A robust and generalizable Python library for reading geoh5/Geoscience Analyst files using kerchunk.
This library provides tools to create kerchunk references for geoh5 files (HDF5-based geoscience data format), enabling efficient cloud-based access to large geoscience datasets without downloading entire files.
- Geoh5Reader: Create kerchunk references from geoh5 files (local or cloud-stored)
- CloudAccessor: Manage cloud storage credentials and access (S3-focused)
- DataNavigator: Navigate and extract data from zarr-backed geoh5 files
- Geoh5Scanner: Analyze geoh5 file structure and generate reports
- Batch Processing: Process multiple files efficiently
- Block Model Support: Specialized tools for working with block model data
git clone https://github.com/RichardScottOZ/Kerchunk-geoh5.git
cd Kerchunk-geoh5
pip install -e .# For direct geoh5 reading
pip install -e ".[geoh5py]"
# For visualization support
pip install -e ".[viz]"
# For all optional dependencies
pip install -e ".[all]"from kerchunk_geoh5 import Geoh5Reader
reader = Geoh5Reader(inline_threshold=100)
reference = reader.create_reference("file.geoh5")
reader.save_reference(reference, "file_reference.json")from kerchunk_geoh5 import Geoh5Reader, CloudAccessor
# Set up cloud access
cloud = CloudAccessor(profile="myprofile")
storage_options = cloud.get_storage_options()
# Create reference
reader = Geoh5Reader()
reference = reader.create_reference(
"s3://bucket/file.geoh5",
storage_options=storage_options
)
reader.save_reference(reference, "reference.json")from kerchunk_geoh5 import DataNavigator, CloudAccessor
cloud = CloudAccessor(profile="myprofile")
navigator = DataNavigator(
"reference.json",
remote_protocol="s3",
remote_options=cloud.get_remote_options()
)
# List structure
print(navigator.list_groups())
# Find block models (typically 10k-20k elements)
block_models = navigator.find_data_by_shape(10000, 20000)
# Extract data
if block_models:
data = navigator.get_data(block_models[0]['path'])
print(f"Data shape: {data.shape}")from kerchunk_geoh5 import DataNavigator, Geoh5Scanner
navigator = DataNavigator("reference.json")
scanner = Geoh5Scanner(navigator)
# Get statistics
stats = scanner.get_summary_statistics()
print(f"Data items: {stats['valid_data_items']}")
print(f"Objects: {stats['valid_objects']}")
# Export report
scanner.export_structure_report("report.txt")The examples/ directory contains complete working examples:
basic_local_file.py- Simple local file reference creationcloud_batch_processing.py- Batch process S3-stored filesnavigate_and_extract.py- Navigate and extract datascan_structure.py- Analyze file structurecomplete_workflow.py- End-to-end workflow
reader = Geoh5Reader(inline_threshold=100)
reference = reader.create_reference(file_path, storage_options=None)
references = reader.create_references_batch(file_paths, storage_options=None)
reader.save_reference(reference, output_path)
reference = reader.load_reference(reference_path)cloud = CloudAccessor(profile=None, credentials=None, region="us-west-2")
creds = cloud.load_aws_credentials(profile="default")
s3 = cloud.s3 # Get S3FileSystem instance
storage_opts = cloud.get_storage_options()
remote_opts = cloud.get_remote_options()
files = cloud.list_files(bucket, prefix="", pattern="*.geoh5")nav = DataNavigator(reference_path, remote_protocol="s3", remote_options=None)
groups = nav.list_groups(path="")
arrays = nav.list_arrays(path="")
data = nav.get_data(path)
attrs = nav.get_attributes(path)
results = nav.find_data_by_shape(min_size=None, max_size=None)
structure = nav.get_geoscience_structure()
vertices = nav.extract_vertices(object_path)scanner = Geoh5Scanner(navigator)
structure = scanner.scan_full_structure()
data_items = scanner.scan_data_items()
objects = scanner.scan_objects()
block_models = scanner.find_block_models()
stats = scanner.get_summary_statistics()
report = scanner.export_structure_report(output_path=None)Process large geoh5 files stored in S3 without downloading:
- Create kerchunk references once
- Share lightweight JSON references
- Multiple users access same data efficiently
- Reduce data transfer costs
Work with block model inversions and gravity data:
- Identify block models by size and type
- Extract vertex coordinates
- Analyze density distributions
- Compare multiple inversions
Explore unknown geoh5 files:
- Scan complete structure
- Generate summary reports
- Find data by characteristics
- Extract metadata
The library is organized into four main components:
- Reader (
reader.py) - Create kerchunk references from geoh5 files - Cloud (
cloud.py) - Handle cloud storage and credentials - Navigator (
navigator.py) - Navigate zarr-backed data structures - Scanner (
scanner.py) - Analyze and report on file structure
This library builds on:
- kerchunk - Create references to chunked data formats
- geoh5py - Read/write geoh5 files
- fsspec - Unified filesystem interface
- zarr - Chunked array storage
- Installation Guide - Detailed installation instructions
- Contributing Guidelines - How to contribute to the project
- Changelog - Version history and changes
Contributions are welcome! Please read our Contributing Guidelines before submitting issues or pull requests.
MIT License - See LICENSE file for details
The original Jupyter notebook (Kerchunk-Block-Model-Analysis.ipynb) contains examples of:
- AWS Batch framework setup
- PyVista visualization
- Block model analysis workflows
These have been generalized into the library modules.