kerchunk-geoh5

A robust and generalizable Python library for reading geoh5/Geoscience Analyst files using kerchunk.

This library provides tools to create kerchunk references for geoh5 files (HDF5-based geoscience data format), enabling efficient cloud-based access to large geoscience datasets without downloading entire files.

Features

Geoh5Reader: Create kerchunk references from geoh5 files (local or cloud-stored)
CloudAccessor: Manage cloud storage credentials and access (S3-focused)
DataNavigator: Navigate and extract data from zarr-backed geoh5 files
Geoh5Scanner: Analyze geoh5 file structure and generate reports
Batch Processing: Process multiple files efficiently
Block Model Support: Specialized tools for working with block model data

Installation

From source

git clone https://github.com/RichardScottOZ/Kerchunk-geoh5.git
cd Kerchunk-geoh5
pip install -e .

With optional dependencies

# For direct geoh5 reading
pip install -e ".[geoh5py]"

# For visualization support
pip install -e ".[viz]"

# For all optional dependencies
pip install -e ".[all]"

Quick Start

Create a kerchunk reference for a local file

from kerchunk_geoh5 import Geoh5Reader

reader = Geoh5Reader(inline_threshold=100)
reference = reader.create_reference("file.geoh5")
reader.save_reference(reference, "file_reference.json")

Work with cloud-stored files

from kerchunk_geoh5 import Geoh5Reader, CloudAccessor

# Set up cloud access
cloud = CloudAccessor(profile="myprofile")
storage_options = cloud.get_storage_options()

# Create reference
reader = Geoh5Reader()
reference = reader.create_reference(
    "s3://bucket/file.geoh5",
    storage_options=storage_options
)
reader.save_reference(reference, "reference.json")

Navigate and extract data

from kerchunk_geoh5 import DataNavigator, CloudAccessor

cloud = CloudAccessor(profile="myprofile")
navigator = DataNavigator(
    "reference.json",
    remote_protocol="s3",
    remote_options=cloud.get_remote_options()
)

# List structure
print(navigator.list_groups())

# Find block models (typically 10k-20k elements)
block_models = navigator.find_data_by_shape(10000, 20000)

# Extract data
if block_models:
    data = navigator.get_data(block_models[0]['path'])
    print(f"Data shape: {data.shape}")

Scan file structure

from kerchunk_geoh5 import DataNavigator, Geoh5Scanner

navigator = DataNavigator("reference.json")
scanner = Geoh5Scanner(navigator)

# Get statistics
stats = scanner.get_summary_statistics()
print(f"Data items: {stats['valid_data_items']}")
print(f"Objects: {stats['valid_objects']}")

# Export report
scanner.export_structure_report("report.txt")

Examples

The examples/ directory contains complete working examples:

basic_local_file.py - Simple local file reference creation
cloud_batch_processing.py - Batch process S3-stored files
navigate_and_extract.py - Navigate and extract data
scan_structure.py - Analyze file structure
complete_workflow.py - End-to-end workflow

API Reference

Geoh5Reader

reader = Geoh5Reader(inline_threshold=100)
reference = reader.create_reference(file_path, storage_options=None)
references = reader.create_references_batch(file_paths, storage_options=None)
reader.save_reference(reference, output_path)
reference = reader.load_reference(reference_path)

CloudAccessor

cloud = CloudAccessor(profile=None, credentials=None, region="us-west-2")
creds = cloud.load_aws_credentials(profile="default")
s3 = cloud.s3  # Get S3FileSystem instance
storage_opts = cloud.get_storage_options()
remote_opts = cloud.get_remote_options()
files = cloud.list_files(bucket, prefix="", pattern="*.geoh5")

DataNavigator

nav = DataNavigator(reference_path, remote_protocol="s3", remote_options=None)
groups = nav.list_groups(path="")
arrays = nav.list_arrays(path="")
data = nav.get_data(path)
attrs = nav.get_attributes(path)
results = nav.find_data_by_shape(min_size=None, max_size=None)
structure = nav.get_geoscience_structure()
vertices = nav.extract_vertices(object_path)

Geoh5Scanner

scanner = Geoh5Scanner(navigator)
structure = scanner.scan_full_structure()
data_items = scanner.scan_data_items()
objects = scanner.scan_objects()
block_models = scanner.find_block_models()
stats = scanner.get_summary_statistics()
report = scanner.export_structure_report(output_path=None)

Use Cases

Cloud-based Geoscience Data Analysis

Process large geoh5 files stored in S3 without downloading:

Create kerchunk references once
Share lightweight JSON references
Multiple users access same data efficiently
Reduce data transfer costs

Block Model Processing

Work with block model inversions and gravity data:

Identify block models by size and type
Extract vertex coordinates
Analyze density distributions
Compare multiple inversions

Data Discovery

Explore unknown geoh5 files:

Scan complete structure
Generate summary reports
Find data by characteristics
Extract metadata

Architecture

The library is organized into four main components:

Reader (reader.py) - Create kerchunk references from geoh5 files
Cloud (cloud.py) - Handle cloud storage and credentials
Navigator (navigator.py) - Navigate zarr-backed data structures
Scanner (scanner.py) - Analyze and report on file structure

Background

This library builds on:

kerchunk - Create references to chunked data formats
geoh5py - Read/write geoh5 files
fsspec - Unified filesystem interface
zarr - Chunked array storage

Documentation

Installation Guide - Detailed installation instructions
Contributing Guidelines - How to contribute to the project
Changelog - Version history and changes

Contributing

Contributions are welcome! Please read our Contributing Guidelines before submitting issues or pull requests.

License

MIT License - See LICENSE file for details

Related Projects

kerchunk - Original kerchunk fork
geoh5py - Geoh5 Python library

Original Examples

The original Jupyter notebook (Kerchunk-Block-Model-Analysis.ipynb) contains examples of:

AWS Batch framework setup
PyVista visualization
Block model analysis workflows

These have been generalized into the library modules.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
kerchunk_geoh5		kerchunk_geoh5
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.txt		Dockerfile.txt
INSTALL.md		INSTALL.md
Kerchunk-Block-Model-Analysis.ipynb		Kerchunk-Block-Model-Analysis.ipynb
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
QUICKSTART.md		QUICKSTART.md
README.md		README.md
env_config.yml		env_config.yml
env_pip.config.txt		env_pip.config.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kerchunk-geoh5

Features

Installation

From source

With optional dependencies

Quick Start

Create a kerchunk reference for a local file

Work with cloud-stored files

Navigate and extract data

Scan file structure

Examples

API Reference

Geoh5Reader

CloudAccessor

DataNavigator

Geoh5Scanner

Use Cases

Cloud-based Geoscience Data Analysis

Block Model Processing

Data Discovery

Architecture

Background

Documentation

Contributing

License

Related Projects

Original Examples

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kerchunk-geoh5

Features

Installation

From source

With optional dependencies

Quick Start

Create a kerchunk reference for a local file

Work with cloud-stored files

Navigate and extract data

Scan file structure

Examples

API Reference

Geoh5Reader

CloudAccessor

DataNavigator

Geoh5Scanner

Use Cases

Cloud-based Geoscience Data Analysis

Block Model Processing

Data Discovery

Architecture

Background

Documentation

Contributing

License

Related Projects

Original Examples

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages