Cloud Optimized Format Investigation for ICESat‐2: HDF5 Study Results and Recommendations

This wiki serves as a living document for the ICESat-2 cloud-optimized format investigation and ATL03 benchmarking activities, as part of the 2023 ICESat-2 Hackweek.

The Problem

Pull from https://github.com/nsidc/uwg_2023_cloud_formats and Hackweek presentation
While the initial ICESat-2 Hackweek project included the benchmarking of several cloud optimized formats including GeoParquet and FlatGeobuf, the investigation shifted focus to cloud optimizations for HDF5.

Cloud Optimized Geospatial Formats

Descriptions of formats

Original HDF5 Granules

Current chunking strategy for the most relevant variables on ATL03, the numbers were taken from an ATL03 with high data density from the Antarctic Peninsula, see data selection notebook

dataset	data_type	~points / ~km covered (max density)	size on disk	items	shape	avg_chunk_size
/gtx/heights/h_ph	float (4bytes)	5000 / 2	160509659	4584	(10000,)	35kb
/gtx/heights/lat_ph	float (4bytes)	5000 / 2	214649295	4584	(10000,)	46kb
/gtx/heights/lon_ph	float (4bytes)	5000 / 2	232415912	4584	(10000,)	50kb

Cloud Optimized HDF5

Repacked HDF5 Granules

Rechunked HDF5 Granules

Kerchunk

Access tools

Describe access tools

HDF5 Library

`h5py`

`fsspec`

`h5netcdf`

`xarray`

`h5coro`

Methodology

We focused on benchmarking in-region access to HDF5 ICESat-2 ATL03 granules stored in AWS S3 buckets in the AWS region us-west-2. Region us-west-2 is where NASA data are being stored. The ICESat-2 ATL03 product is Global Geolocated Photon Data (10.5067/ATLAS/ATL03.005) and contains the heights above the ellipsoid, time, latitudes and longitudes, along with quality control and ancillary data, of individual photons emitted and detected by the ATLAS instrument on-board ICESat-2. Each granule file contains estimated heights for photons from pairs of beams for three reference ground tracks. Photons for each beam have a nominal spacing of 0.7 m. File sizes are of the order of 2 GB but can range from 40 MB to 10 GB, depending on the number of of background photons detected by the sensor. Background photons are scattered towards the detector from sources outside of the beam emitted by the ATLAS instrument (e.g. from the Sun). ATL03 is a Level 2 product and the lowest level product likely to be used on a regular basis by the science community. ICESat-2 standard science products and community developed products start with ATL03 or high-level products. The larger (relative) filesize and common use of ATL03 products, makes it a good test case for benchmarking cloud optimized formats and access tools.

Benchmarks are measured/set for six file formats (see [Cloud Optimized Formats](link to section)) and six common access methods. Not all formats can be used with every access method. Figure X shows the benchmark tests as a matrix of file format and access method.

`

Benchmarks tested by file format and access method. Greyed-out boxes indicate incomapatable format and tool combinations.
	Original HDF5	Repacked HDF5	Kerchunk Original	Kerchunk Repacked	Geoparquet	Flatgeobuf
`h5py`	X	X
GEDI Subsetter	X	X
`xarray` + `h5netcdf`	X	X	X	X
`h5coro`
`geopandas` + `pyogrio/GDAL`
`geopandas` + `parquet`

Each benchmark test includes in-region file access from an EC2 instance to a S3 bucket, and either calculating the mean of a whole data variable or calculating the mean of a spatial subset of a data variable. The photon_height variable was used. Subsetting was done by selecting a latitude range. The subsetting task required accessing the latitude variable.

Test granules were selected to represent a large and small filesizes. X granules over the Antarctic Peninsula were selected because the highly reflective ice and snow surfaces, along with the complex topography, result in high photon return rates. Y granules over were selected because darker, less reflective, ocean surfaces have lower photon return rates.

ATL03 granules used for benchmark test
Granule ID	Size (GB)
_name_of_granule_	_size_of_granule_

Benchmarking is implemented using python with a set of benchmark test classes executed in a series of Jupyter notebooks on an AWS EC2 instance with N cores/threads/processes using the Cryocloud Jupyter Hub. Do we need other information about the hub here?

HDF5 Performance Variability Investigation

Add details here on the methodology for the i/o client investigation Add table of Format / io client / Environment to describe what combinations were tested?

Results

Conclusions and Recommendations

Ideal changes to HDF5 and i/o clients for the science use cases we prioritized and tested for ATL03.
Consider tradeoffs and existing value/benefit of HDF5 compared to the use cases that could benefit from cloud-optimized reformatting.

Adoption of Cloud Optimized HDF5

Specific recommendations for ICESat-2 (just for ATL03 or for all products??)

What is the key recommendation vs what could interim solution(s) look like

Repacking metadata

Chunking Strategy

Chunking can be tuned to a specific access pattern
Reference work that JP did: He found that larger chunks were not more performant due to decompression of chunks

HDF5 file structure

Broader lessons learned and recommendations for ICESat-2 and any other future NASA Earthdata mission with similar hierarchical structure What is the key recommendation vs what could interim solution(s) look like

Existing limitations and usability challenges
- Still thinking about data access from a posix file system
- Any coordinate-based file structure needs multiple reads to access each of the variables across resolutions
- Need to traverse dataset group hierarchy to work with the data
  - This is why geoparquet is beneficial - it is a flatter structure
- Add table or pseudocode that shows the # of reads you need to grab the decoded time or geolocation variable (Here's what would be faster if variables are within the same group
Recommendations:
- All the info you need to access /heights is contained within /heights (even at the expense of a larger file)
- We can improve things only so far with the existing file structure. Need to think about # of reads involved to decode a particular variable (use example of extracting Epoch time)

i/o Client Improvements

Key recommendation: Modernize the h5 library to make use of cloud optimizations
- Long term this should live in the library - then other tools / languages would benefit beyond python workflows
H5coro
- Can h5coro work with CO HDf5? Yes, and it’s the only engine that works better in the tests (out of the box)
Recommendation for fsspec caching to make reporting more accurate
Earthaccess future feature
Level of effort to implement
Recommendations for end-users
- Lowest effort (prior to longer term solutions): Tutorial for how to set flags
- End-user discovery of recommended i/o parameters
  - CMR, STAC, elsewhere?

Access patterns and cost savings

Relationship between # of reads that can be decreased through above methods and cost of in region and out of region access

In-region: most important to reduce the # of gets
Out of region: Egress is comparable but still a reduced get request rate

Kerchunk and other cloud optimized formats

Out of scope for NSIDC DAAC work but can we say anything else here? Is this where we can ask others to contribute?
Zarr and cloud optimized HDF5 fie are basically the same When accessed via Kerchunk
From meeting notes: “What's the time taken to produce the kerchunk JSON file, vs reprocessing a HDF5 file to the repacked format? like 20 minutes vs 2”

Appendix

Level of Effort estimates for repacking/rechunking HDF5 & Level of Effort to update i/o clients

Repacking rechunking as part of updated ICESat-2 data production code

Change the h5 library(?) flags, re-run tests to ensure no breaking changes, etc.

Repacking / rechunking data upon DAAC ingest

Estimate effort to run h5repack for all ICESat-2 data
Would need Cumulus team to provide estimate for what adding h5repack into ingest stream. The command itself is very simple.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly