files (netCDF, HDF5, GRIB, …) - Want to put them on the cloud - Zarr is a much better format - Separation between data and metadata - Scalable - Cloud-optimized access patterns - But data providers don't want to change formats - Ideally avoid data duplication
Kerchunk currently: - Finds byte ranges using e.g. SingleHdf5ToZarr - Represents them as a nest dict in-memory - Combines dicts using MultiZarrToZarr - Writes them out in kerchunk reference format (json/parquet) - Result is sidecar files which behave like a zarr store… - … when read through fsspec
- Store-level abstractions make many operations hard to express - MultiZarrToZarr is bespoke and overloaded - In-memory dict representation is complicated, bespoke, and inefficient - Output files are not true Zarr stores, - Can only be understood by fsspec (i.e. currently only in python…?)
ZEP, then implement reading arbitrary byte ranges in zarr readers - Means virtual zarr stores that can be read in any language - Opens the door to e.g. javascript visualization frameworks pointing at netCDF files… - New type of Zarr store containing chunk manifest.json files
cool idea! - VirtualiZarr package exists as alternative to kerchunk - Some rough edges but progressing quickly - Can be used today to write kerchunk-format references - Uses xarray API so should be intuitive - Plan is to upstream sidecar formats as Zarr enhancements Go try it! https://github.com/TomNicholas/VirtualiZarr
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4