Skip to content

RichardScottOZ/Kerchunk-geoh5

Repository files navigation

kerchunk-geoh5

A robust and generalizable Python library for reading geoh5/Geoscience Analyst files using kerchunk.

This library provides tools to create kerchunk references for geoh5 files (HDF5-based geoscience data format), enabling efficient cloud-based access to large geoscience datasets without downloading entire files.

Features

  • Geoh5Reader: Create kerchunk references from geoh5 files (local or cloud-stored)
  • CloudAccessor: Manage cloud storage credentials and access (S3-focused)
  • DataNavigator: Navigate and extract data from zarr-backed geoh5 files
  • Geoh5Scanner: Analyze geoh5 file structure and generate reports
  • Batch Processing: Process multiple files efficiently
  • Block Model Support: Specialized tools for working with block model data

Installation

From source

git clone https://github.com/RichardScottOZ/Kerchunk-geoh5.git
cd Kerchunk-geoh5
pip install -e .

With optional dependencies

# For direct geoh5 reading
pip install -e ".[geoh5py]"

# For visualization support
pip install -e ".[viz]"

# For all optional dependencies
pip install -e ".[all]"

Quick Start

Create a kerchunk reference for a local file

from kerchunk_geoh5 import Geoh5Reader

reader = Geoh5Reader(inline_threshold=100)
reference = reader.create_reference("file.geoh5")
reader.save_reference(reference, "file_reference.json")

Work with cloud-stored files

from kerchunk_geoh5 import Geoh5Reader, CloudAccessor

# Set up cloud access
cloud = CloudAccessor(profile="myprofile")
storage_options = cloud.get_storage_options()

# Create reference
reader = Geoh5Reader()
reference = reader.create_reference(
    "s3://bucket/file.geoh5",
    storage_options=storage_options
)
reader.save_reference(reference, "reference.json")

Navigate and extract data

from kerchunk_geoh5 import DataNavigator, CloudAccessor

cloud = CloudAccessor(profile="myprofile")
navigator = DataNavigator(
    "reference.json",
    remote_protocol="s3",
    remote_options=cloud.get_remote_options()
)

# List structure
print(navigator.list_groups())

# Find block models (typically 10k-20k elements)
block_models = navigator.find_data_by_shape(10000, 20000)

# Extract data
if block_models:
    data = navigator.get_data(block_models[0]['path'])
    print(f"Data shape: {data.shape}")

Scan file structure

from kerchunk_geoh5 import DataNavigator, Geoh5Scanner

navigator = DataNavigator("reference.json")
scanner = Geoh5Scanner(navigator)

# Get statistics
stats = scanner.get_summary_statistics()
print(f"Data items: {stats['valid_data_items']}")
print(f"Objects: {stats['valid_objects']}")

# Export report
scanner.export_structure_report("report.txt")

Examples

The examples/ directory contains complete working examples:

  • basic_local_file.py - Simple local file reference creation
  • cloud_batch_processing.py - Batch process S3-stored files
  • navigate_and_extract.py - Navigate and extract data
  • scan_structure.py - Analyze file structure
  • complete_workflow.py - End-to-end workflow

API Reference

Geoh5Reader

reader = Geoh5Reader(inline_threshold=100)
reference = reader.create_reference(file_path, storage_options=None)
references = reader.create_references_batch(file_paths, storage_options=None)
reader.save_reference(reference, output_path)
reference = reader.load_reference(reference_path)

CloudAccessor

cloud = CloudAccessor(profile=None, credentials=None, region="us-west-2")
creds = cloud.load_aws_credentials(profile="default")
s3 = cloud.s3  # Get S3FileSystem instance
storage_opts = cloud.get_storage_options()
remote_opts = cloud.get_remote_options()
files = cloud.list_files(bucket, prefix="", pattern="*.geoh5")

DataNavigator

nav = DataNavigator(reference_path, remote_protocol="s3", remote_options=None)
groups = nav.list_groups(path="")
arrays = nav.list_arrays(path="")
data = nav.get_data(path)
attrs = nav.get_attributes(path)
results = nav.find_data_by_shape(min_size=None, max_size=None)
structure = nav.get_geoscience_structure()
vertices = nav.extract_vertices(object_path)

Geoh5Scanner

scanner = Geoh5Scanner(navigator)
structure = scanner.scan_full_structure()
data_items = scanner.scan_data_items()
objects = scanner.scan_objects()
block_models = scanner.find_block_models()
stats = scanner.get_summary_statistics()
report = scanner.export_structure_report(output_path=None)

Use Cases

Cloud-based Geoscience Data Analysis

Process large geoh5 files stored in S3 without downloading:

  • Create kerchunk references once
  • Share lightweight JSON references
  • Multiple users access same data efficiently
  • Reduce data transfer costs

Block Model Processing

Work with block model inversions and gravity data:

  • Identify block models by size and type
  • Extract vertex coordinates
  • Analyze density distributions
  • Compare multiple inversions

Data Discovery

Explore unknown geoh5 files:

  • Scan complete structure
  • Generate summary reports
  • Find data by characteristics
  • Extract metadata

Architecture

The library is organized into four main components:

  1. Reader (reader.py) - Create kerchunk references from geoh5 files
  2. Cloud (cloud.py) - Handle cloud storage and credentials
  3. Navigator (navigator.py) - Navigate zarr-backed data structures
  4. Scanner (scanner.py) - Analyze and report on file structure

Background

This library builds on:

  • kerchunk - Create references to chunked data formats
  • geoh5py - Read/write geoh5 files
  • fsspec - Unified filesystem interface
  • zarr - Chunked array storage

Documentation

Contributing

Contributions are welcome! Please read our Contributing Guidelines before submitting issues or pull requests.

License

MIT License - See LICENSE file for details

Related Projects

Original Examples

The original Jupyter notebook (Kerchunk-Block-Model-Analysis.ipynb) contains examples of:

  • AWS Batch framework setup
  • PyVista visualization
  • Block model analysis workflows

These have been generalized into the library modules.

About

Examples of reading data from cloud stored geoh5 / Geoscience Analyst files

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors