-
Notifications
You must be signed in to change notification settings - Fork 2
Cloud Optimized Format Investigation for ICESat‐2: HDF5 Study Results and Recommendations
This wiki serves as a living document for the ICESat-2 cloud-optimized format investigation and ATL03 benchmarking activities, as part of the 2023 ICESat-2 Hackweek.
- Pull from https://github.com/nsidc/uwg_2023_cloud_formats and Hackweek presentation
- While the initial ICESat-2 Hackweek project included the benchmarking of several cloud optimized formats including GeoParquet and FlatGeobuf, the investigation shifted focus to cloud optimizations for HDF5.
Descriptions of formats
Current chunking strategy for the most relevant variables on ATL03, the numbers were taken from an ATL03 with high data density from the Antarctic Peninsula, see data selection notebook
dataset | data_type | ~points / ~km covered (max density) | size on disk | items | shape | avg_chunk_size |
---|---|---|---|---|---|---|
/gtx/heights/h_ph | float (4bytes) | 5000 / 2 | 160509659 | 4584 | (10000,) | 35kb |
/gtx/heights/lat_ph | float (4bytes) | 5000 / 2 | 214649295 | 4584 | (10000,) | 46kb |
/gtx/heights/lon_ph | float (4bytes) | 5000 / 2 | 232415912 | 4584 | (10000,) | 50kb |
Describe access tools
We focused on benchmarking in-region access to HDF5 ICESat-2 ATL03 granules stored in AWS S3 buckets in the AWS region us-west-2
. Region us-west-2
is where NASA data are being stored. The ICESat-2 ATL03 product is Global Geolocated Photon Data (10.5067/ATLAS/ATL03.005) and contains the heights above the ellipsoid, time, latitudes and longitudes, along with quality control and ancillary data, of individual photons emitted and detected by the ATLAS instrument on-board ICESat-2. Each granule file contains estimated heights for photons from pairs of beams for three reference ground tracks. Photons for each beam have a nominal spacing of 0.7 m. File sizes are of the order of 2 GB but can range from 40 MB to 10 GB, depending on the number of of background photons detected by the sensor. Background photons are scattered towards the detector from sources outside of the beam emitted by the ATLAS instrument (e.g. from the Sun). ATL03 is a Level 2 product and the lowest level product likely to be used on a regular basis by the science community. ICESat-2 standard science products and community developed products start with ATL03 or high-level products. The larger (relative) filesize and common use of ATL03 products, makes it a good test case for benchmarking cloud optimized formats and access tools.
Benchmarks are measured/set for six file formats (see [Cloud Optimized Formats](link to section)) and six common access methods. Not all formats can be used with every access method. Figure X shows the benchmark tests as a matrix of file format and access method.
`Original HDF5 | Repacked HDF5 | Kerchunk Original | Kerchunk Repacked | Geoparquet | Flatgeobuf | |
---|---|---|---|---|---|---|
`h5py` | X | X | ||||
GEDI Subsetter | X | X | ||||
`xarray` + `h5netcdf` | X | X | X | X | ||
`h5coro` | ||||||
`geopandas` + `pyogrio/GDAL` | ||||||
`geopandas` + `parquet` |
Each benchmark test includes in-region file access from an EC2 instance to a S3 bucket, and either calculating the mean of a whole data variable or calculating the mean of a spatial subset of a data variable. The photon_height
variable was used. Subsetting was done by selecting a latitude range. The subsetting task required accessing the latitude
variable.
Test granules were selected to represent a large and small filesizes. X granules over the Antarctic Peninsula were selected because the highly reflective ice and snow surfaces, along with the complex topography, result in high photon return rates. Y granules over were selected because darker, less reflective, ocean surfaces have lower photon return rates.
Granule ID | Size (GB) |
---|---|
_name_of_granule_ | _size_of_granule_ |
Benchmarking is implemented using python with a set of benchmark test classes executed in a series of Jupyter notebooks on an AWS EC2 instance with N cores/threads/processes using the Cryocloud Jupyter Hub. Do we need other information about the hub here?
Add details here on the methodology for the i/o client investigation Add table of Format / io client / Environment to describe what combinations were tested?
- Ideal changes to HDF5 and i/o clients for the science use cases we prioritized and tested for ATL03.
- Consider tradeoffs and existing value/benefit of HDF5 compared to the use cases that could benefit from cloud-optimized reformatting.
Specific recommendations for ICESat-2 (just for ATL03 or for all products??)
What is the key recommendation vs what could interim solution(s) look like
- Chunking can be tuned to a specific access pattern
- Reference work that JP did: He found that larger chunks were not more performant due to decompression of chunks
Broader lessons learned and recommendations for ICESat-2 and any other future NASA Earthdata mission with similar hierarchical structure What is the key recommendation vs what could interim solution(s) look like
- Existing limitations and usability challenges
- Still thinking about data access from a posix file system
- Any coordinate-based file structure needs multiple reads to access each of the variables across resolutions
- Need to traverse dataset group hierarchy to work with the data
- This is why geoparquet is beneficial - it is a flatter structure
- Add table or pseudocode that shows the # of reads you need to grab the decoded time or geolocation variable (Here's what would be faster if variables are within the same group
- Recommendations:
- All the info you need to access /heights is contained within /heights (even at the expense of a larger file)
- We can improve things only so far with the existing file structure. Need to think about # of reads involved to decode a particular variable (use example of extracting Epoch time)
-
Key recommendation: Modernize the h5 library to make use of cloud optimizations
- Long term this should live in the library - then other tools / languages would benefit beyond python workflows
-
H5coro
- Can h5coro work with CO HDf5? Yes, and it’s the only engine that works better in the tests (out of the box)
-
Recommendation for fsspec caching to make reporting more accurate
-
Earthaccess future feature
-
Level of effort to implement
-
Recommendations for end-users
- Lowest effort (prior to longer term solutions): Tutorial for how to set flags
- End-user discovery of recommended i/o parameters
- CMR, STAC, elsewhere?
Relationship between # of reads that can be decreased through above methods and cost of in region and out of region access
- In-region: most important to reduce the # of gets
- Out of region: Egress is comparable but still a reduced get request rate
- Out of scope for NSIDC DAAC work but can we say anything else here? Is this where we can ask others to contribute?
- Zarr and cloud optimized HDF5 fie are basically the same When accessed via Kerchunk
- From meeting notes: “What's the time taken to produce the kerchunk JSON file, vs reprocessing a HDF5 file to the repacked format? like 20 minutes vs 2”
Level of Effort estimates for repacking/rechunking HDF5 & Level of Effort to update i/o clients
- Change the h5 library(?) flags, re-run tests to ensure no breaking changes, etc.
- Estimate effort to run h5repack for all ICESat-2 data
- Would need Cumulus team to provide estimate for what adding h5repack into ingest stream. The command itself is very simple.