HDF/NetCDF in the cloud #251

betolink · 2023-06-16T17:06:24Z

betolink
Jun 16, 2023
Maintainer

Based on the latest discussions on Openscapes about HDF in the cloud, I think there should be a more in-depth study of the state of things, in a way a "follow up" to this post from Matt Rocklin https://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

earthaccess uses fsspec for IO but the HDF libraries usually used to read these files(h5py, pyhdf) were not designed for concurrent access via python file-like objects and IO latency for data in the cloud is really problematic. Here is a list of some of the access patterns used for HDF data in remote storage systems:

HDF + h5py (via fsspec) -> this is what earthaccess uses, it works but is really slow(not necessarily fsspec's fault), see: h5py sequential reads in the cloud it may also run into deadlock issues when multiple files are accessed.
HDF + Kerchunk -> seems to work well for gridded NetCDF files, it would be great to have example notebooks and see if it can handle more complex datasets like Icesat's ATL03
HDF + native S3 driver -> doesn't seem to work for NASA data and uses sequential reads too.
HDF + H5Coro -> Only for a subset of HDF files, ~~no stand-alone client library~~ new development!!, H5Coro has a stand alone client and could be a great drop-in replacement for H5Py on certain cases
HDF + OpenDAP = it may work but in practice we delegate the IO and file conversion to OpenDAP and of course we need server side services running. Not all data is supported and/or backed by OpenDAP.

Ideally we could come up with a better understanding of these access patterns: their capabilities, bottlenecks and potential workarounds.

weiji14 · 2023-08-13T21:55:57Z

weiji14
Aug 13, 2023

There's been some work done during the ICESat-2 Hackweek 2023 to benchmark reading ICESat-2 ATL03 data from an s3 bucket using different libraries and file formats. See:

Repository - https://github.com/ICESAT-2HackWeek/h5cloud
Final presentation - https://docs.google.com/presentation/d/1Hsq1W_X_kbL80tDk8XSozqn1OMcxuv7W8aCZZ2n1ro0/edit#slide=id.g267a8d198f4_1_6

Preliminary results from ICESAT-2HackWeek/h5cloud@1f34411:

Observations:

The h5repack format (still HDF5, but optimized chunks) may be faster for newer cloud-optimized libraries like h5coro and kerchunk, but slower for 'classic' libraries like h5py (though it may be due to the rechunking scheme of the h5repack files we tested).
H5coro and xarray-kerchunk both look like fast and promising areas to continue development. Those libraries are still 'young' though, and generally needs more work to support more HDF5 data structures (e.g. 2D variables for h5coro and nested groups in kerchunk are still unsupported?).

Caveats: Note that the above ATL03 dataset is a non-gridded point-cloud-like dataset, and may not apply to other HDF5 data structures, so be careful when generalizing this to all HDF5 data files.

Next steps

Some of these are outlined by @abarciauskas-bgse at ICESAT-2HackWeek/h5cloud#18:

Cloud-optimized data formats:
- Complete benchmarking of various file formats like kerchunk, h5repack, Zarr, COPC, etc
Libraries:
- 2D table-based:
  - gedi_subsetter's H5DataFrame (subclassed from pandas.DataFrame) that uses h5py as the driver, worked on by @chuckwondo et al. Could possibly be extracted as standalone library from GEDI.
- n-dimensional:
  - Investigate adding new NetCDF4/HDF5 backends to xarray besides the current h5netcdf:
    - h5coro (C++ based) - WIP by @jpswinski and @rwegener2 at https://github.com/ICESAT-2HackWeek/xarray (see the h5coro-backend- branches)
    - hidefix (Rust-based) - @gauteh is having an ongoing discussion at Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! pydata/xarray#7446

9 replies

abarciauskas-bgse Aug 15, 2023

@alex-s-gardner As far as I am aware (although I haven't dug into the underlying library modules used, xarray, h5coro, gedi_subsetter and h5py) no multithreading was used. I do know that the tests were run on a cryocloud JupyterHub instance with an allocated 32-40GB Memory and 4-8 CPU.

rwegener2 Aug 15, 2023

Just wanted to add that h5coro implements multithreading under the hood. That is one of the two major advantages it has over h5py. The other is that it caches the structure of the file as it traverses it, I suppose building a sort of "on the fly" internal kerchunk.

gauteh Aug 15, 2023

Hidefix also does parallell reading, as well as some parallell decoding (which turns out to be a significant part of xarray load time).

martindurant Aug 15, 2023

(there is also talk about parallelising decoding in zarr/numcodecs, particularly for GPUs, but this is not yet ready; obviously this is less relevant in dask workflows)

betolink Sep 12, 2023
Maintainer Author

Thanks to @jonm3D we now have a tool to rapidly visualize HDF/NetCDF file's structure https://github.com/jonm3D/h5xray. This tool will be very handy when we get to benchmark access to HDF data in cloud storage.

betolink · 2024-02-06T15:59:59Z

betolink
Feb 6, 2024
Maintainer Author

Cloud optimized HDF5 works!

Well, seems like cloud-optimized HDF5 is indeed possible and the mixed numbers we initially got were the result of not understanding the default IO and caching behavior of h5py and fsspec. With a better understanding of these mechanisms, we are getting consistently better performance regardless of the file size.

There are some caveats, h5py and fsspec need to "know" in advance if we are dealing with a cloud optimized HDF5 and if so, tune their behavior to align it with the chunking strategy used to repack the files in question.
The main caveat of course is that reprocessing entire collections is something that if it happens, will not happen in the short term but these numbers give me hope.

Exhibit A)

Using h5py to calculate a mean (~47000000 data points) in a 7GB file in-region (us-west-2)

Observations

h5py is really efficient once it has the file metadata, the performance gain is not dramatic here but if we look the total requests... there is definitely a monetary factor for the data provider.
Total data transfer could be a bit inaccurate when we use fsspec as some of it comes from the cache, we need a mechanism to separate cache hits vs real data transfers in the logs. Or verify that cache hits are not counted in the reads.

Exhibit B)

Doing the same operation using Xarray, in-region access on the same 7GB file

Observations

Xarray does by default a lot of CF decoding and that touches a lot of the files. We could turn this off but we don't want to figure out how to interpret each variable (scale, fill etc etc).
The overhead of decoding metadata is very noticeable when we open one of the ATL03 granules.
Using cloud optmized files helps even more in the Xarray access pattern since a lot of the data needed for the decoding needs the metadata structures.

Now let's see what happens when we try to do the same operation over 2 files but this time out of region...

Exhibit C)

Out of region access

Observations

This scenario is not possible with the S3 distribution for NASA data (restricted to in-region access). If not throttled this would be similar to the performance on HTTPS through NASA's CloudFront.
This is what cloud-optimized HDF5 could bring in performance and cost savings for the data providers with data in S3, remember each GET request to a bucket costs >0.0 USD.
We are using h5py which we know is not multi-threaded, there is room for improvement and if the HDF group gets enough support to keep modernizing the library, we could have a stack in the no distant future where cloud-optimized HDF5 will work the same way COGs work in the raster world.
If NISAR and ICESat-2 start producing cloud-optimized datasets, I see the case for earthaccess.smart_open() where the library will inspect and align the IO parameters to the strategy in these datasets.

We are working on a set of recommendations specifically for ICESat-2 data and we'll share more findings once we have the document ready.

6 replies

martindurant Feb 6, 2024

why a true Zarr store is better (there is only one copy of the metadata). Important metadata is replicated in each file with no consistency guarantees. So this complexity leaks in to kerchunk too.

Well, kerchunk hopes to decode once, so that reading the created reference set is one-step and doesn't access the constituent data files. Yes, we've had to spend quite some effort getting concatenation in the presence of different CF time "units" fields right.

dcherian Feb 6, 2024

Yes certainly kerchunk is a major performance enhancement, but I'm hoping to convince our data center friends here that even if you can, perhaps you shouldn't when it comes to HDF in the cloud.

martindurant Feb 6, 2024

Of course, if straight-to-zarr is an option, that's different. Or using normal time units. Or at the very least having a consistent epoch... Data centres do still seem to like using the most established format and conventions, and also the ability to just download some part of the data as self-contained, self-describing files.

betolink Feb 6, 2024
Maintainer Author

Hopefully this was not interpreted as a critique to Xarray decoding, I think the overall objective of this topic is to understand different access patterns and their tradeoffs when it comes to accessing archival data in the cloud.

I see cloud-optimized HDF5 as a multiplier for approaches like kerchunk or even pangeo-forge pipelines (if they have to transform HDF to Zarr). If used, generating the references for kerchunk can benefit from already consolidated metadata in the HDF5 file, anecdotally I noticed a ~5x improvement (PR coming to kerchunk). I also wonder if we should copy/follow up this valuable conversation on this thread in Pangeo

martindurant Feb 7, 2024

PR coming to kerchunk

Looking forward to it :)

zequihg50 · 2024-03-20T14:56:12Z

zequihg50
Mar 20, 2024

Hi,

Great results! Have you attempted to use xarray and dask with the processes scheduler? I'm experimenting with this, but for some unknown reason, the volume of data (Total Req) is huge, leading to longer processing times. For instance:

import xarray
import dask
import fsspec

dask.config.set(scheduler="processes")

URL = "..."

fs = fsspec.open(URL, cache_type="blockcache", block_size=8*1024*1024)

ds = xarray.open_dataset(
    fs.open(),
    engine="h5netcdf",
    driver_kwds={
        "page_buf_size": 32*1024*1024,
        "rdcc_nbytes": 8*1024*1024
    }).chunk({...}) # align with file chunking

ds["variable"].mean().compute(num_workers=8)

4 replies

martindurant Mar 20, 2024

The best caching scheme for HDF5 is usually "first", especially if you set the block size big enough to contain all front matter of the file. Of course, the processes cannot share this in-memory cache, so it may well not be doing too much that is helpful for you. You might also want to try "blockcache::" prepended to your URL for (shared) on-disk block caching, but you would need to create your store directory first (cache_storage= kwarg to the caching filesystem).

betolink Mar 20, 2024
Maintainer Author

the process cannot share the cache

Maybe this is what's happening, I'd try using Martin's suggestion for a shared cache on disk. Another thing, the blockcache works better if the files were cloud optimized first and the page size aligns with that of the cache, I'm not sure if the files you're opening have that. If they are public maybe we can test some of these approaches.

zequihg50 Mar 21, 2024

Thank you both for the insightful comments. Indeed, utilizing the "first" caching scheme improves things, but the volume of data read remains huge when employing Dask processes. I will experiment with the disk-shared cache and provide feedback if I make any progress. While HDF5 paged aggregation notably enhances performance, it appears from the numbers you've presented that there is a substantial increase in Total Req Bytes for all Exhibits, which can be exacerbated in some cases, such as process parallelism. Anyway, a lot of interesting stuff happening :)

betolink Mar 27, 2024
Maintainer Author

@zequihg50 I forgot to mention, the total requested bytes in the graph is a bit misleading because fsspec logs do not track cache hits (PR coming!) meaning the actual requested bytes is way less than the ones in the charts I posted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF/NetCDF in the cloud #251

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 19 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

HDF/NetCDF in the cloud #251

betolink Jun 16, 2023 Maintainer

Replies: 3 comments · 19 replies

Next steps

betolink Sep 12, 2023 Maintainer Author

betolink Feb 6, 2024 Maintainer Author

Cloud optimized HDF5 works!

Exhibit A)

Exhibit B)

Exhibit C)

betolink Feb 6, 2024 Maintainer Author

betolink Mar 20, 2024 Maintainer Author

betolink Mar 27, 2024 Maintainer Author

betolink
Jun 16, 2023
Maintainer

Replies: 3 comments 19 replies

betolink Sep 12, 2023
Maintainer Author

betolink
Feb 6, 2024
Maintainer Author

betolink Feb 6, 2024
Maintainer Author

betolink Mar 20, 2024
Maintainer Author

betolink Mar 27, 2024
Maintainer Author