This issues describes a concept for a Zarr v3 Storage Transformer to enable generic indirection between the Zarr keys and the name of the underlying objects in a store. It is not a new idea (see below) but this design is meant to cover a broader set of use cases.
GoalsThere has been a lot written on this subject already (see issues linked above) so I'm going to attempt to jump straight into the design. The key difference between this design and prior proposals is that the manifest will be local to the Array. The reason for this is to increase the scalability, portability, and composability of the manifest concept.
Store layoutThe manifest store layout will resemble that of a regular Zarr V3 store. Consider the following directory store representation:
a/zarr.json <- group metadata
a/foo/zarr.json <- array metadata
a/foo/manifest.json <- array manifest
...
b/baz/zarr.json <- array metadata
b/baz/c/1/1 <- "regular" chunk
...
Note: array a/foo
is a manifest array but array b/baz
is a regular zarr array.
Manifest style arrays will need to declare a storage transformer configuration:
{ "node_type": "array", ... "storage_transformers": [ { "name": "chunk-manifest-json", "configuration": { "manifest": "./manifest.json" } } ] }
Note: the small manifests could also be inlined directly into the array metadata object.
Manifest objectIn my example above, the array a/foo
includes a manifest object (a/foo/manifest.json
) which will store the mapping of chunk keys to keys in the store:
{ "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100}, "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100}, "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100}, "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, }
path
would be the only required key, offset/length/checksum/etc could all be added keys to a) inform the store how to fetch bytes from the chunk or b) provide the store with additional metadata about the chunk.
Note 1: Kerchunk also supports inline data in place of the path. That could also be supported here.
Note 2: I'm using JSON as a manifest type here, but many other options exist, including Parquet or even Zarr arrays.
Edit: Feb 6 7:20p PT - After thinking about this more, I'm beginning to think serialization of concatenated arrays is a trickier problem than should be addressed in the initial iteration here. The main tricky bit is how to combine arrays with compatible dtypes/shapes/chunks but with differing codecs. Details from my original ideas below but consider this redacted from the proposal for now.
DetailsOne of the goals above is to enable concatenating multiple Zarr arrays. The manifest approach supports a zero-copy way to achieve this. The concept here closely resembles the approach from [Kerchunk's MultiZarrToZarr](https://fsspec.github.io/kerchunk/tutorial.html#combine-multiple-kerchunked-datasets-into-a-single-logical-aggregate-dataset), except it targeting individual arrays and could be made to work with any zarr arrray (not just Kerchunk references). The idea is that concatenating arrays can be done in Zarr, provided a set of constraints are met, by simply rewriting the keys. Implementations could provide a API for doing this concatenation like:
arr_a: zarr.Array = zarr.open(store_a, path='foo') # shape=(10, 4, 5), chunks=(2, 4, 5) arr_b: zarr.Array = zarr.open(store_b, path='bar') # shape=(6, 4, 5), chunks=(2, 4, 5) arr_ab: zarr.Array = zarr.concatenate([arr_a, arr_b], axis=0, store=store_c) # shape=(16, 4, 5), chunks=(2, 4, 5)
In this example, zarr.concatenate
would act similar numpy.concatenate
, returning a new zarr.Array
object after creating the new manifest in store_c
. This could also be done in two steps by adding a save_manifest
method to the Zarr arrays.
I've tried very hard to keep the scope of this as small as possible. There are currently few v3 storage transformers to emulate so I think the best next step is to try out this simple approach before spending too much time on a spec or elaborating on future options. That said, there are some obvious ways to extend this:
🙌 to those that have done a great job pushing this subject forward already: @martindurant, @alimanfoo, @rabernat among others.
norlandrhagen and alxmrsnorlandrhagen and alxmrsTomNicholas, maxrjones, normanrz, alxmrs and cboettig
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4