RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/zarr-developers/zarr-specs/issues/287 below:

Manifest storage transformer · Issue #287 · zarr-developers/zarr-specs · GitHub

This issues describes a concept for a Zarr v3 Storage Transformer to enable generic indirection between the Zarr keys and the name of the underlying objects in a store. It is not a new idea (see below) but this design is meant to cover a broader set of use cases.

Goals

Enable content-addressable storage schemes (see Content-addressable storage transformer (v3 protocol extension) #82 for early proposal)
Enable stores that reference bytes created outside Zarr (e.g. Kerchunk)
Enable static snapshots of stores ( Beyond consolidated metadata for V3: inspiration from Apache Iceberg #154)
Enable concatenating of multiple arrays without copying any chunk keys ( Refactor MultiZarrToZarr into multiple functions fsspec/kerchunk#377 (comment))
Enable creating Zarr stores that are a mix of "reference" arrays (i.e. Kerchunk) and native Zarr arrays

Design

There has been a lot written on this subject already (see issues linked above) so I'm going to attempt to jump straight into the design. The key difference between this design and prior proposals is that the manifest will be local to the Array. The reason for this is to increase the scalability, portability, and composability of the manifest concept.

Store layout

The manifest store layout will resemble that of a regular Zarr V3 store. Consider the following directory store representation:

a/zarr.json  <- group metadata
a/foo/zarr.json  <- array metadata
a/foo/manifest.json <- array manifest 
...
b/baz/zarr.json <- array metadata
b/baz/c/1/1 <- "regular" chunk
...

Note: array a/foo is a manifest array but array b/baz is a regular zarr array.

Array metadata

Manifest style arrays will need to declare a storage transformer configuration:

{
  "node_type": "array",
  ...
  "storage_transformers": [
    {
      "name": "chunk-manifest-json",
      "configuration": {
        "manifest": "./manifest.json"
      }
    }
  ]
}

Note: the small manifests could also be inlined directly into the array metadata object.

Manifest object

In my example above, the array a/foo includes a manifest object (a/foo/manifest.json) which will store the mapping of chunk keys to keys in the store:

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},  
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, 
}

path would be the only required key, offset/length/checksum/etc could all be added keys to a) inform the store how to fetch bytes from the chunk or b) provide the store with additional metadata about the chunk.

Note 1: Kerchunk also supports inline data in place of the path. That could also be supported here.
Note 2: I'm using JSON as a manifest type here, but many other options exist, including Parquet or even Zarr arrays.

Concatenating arrays:

Edit: Feb 6 7:20p PT - After thinking about this more, I'm beginning to think serialization of concatenated arrays is a trickier problem than should be addressed in the initial iteration here. The main tricky bit is how to combine arrays with compatible dtypes/shapes/chunks but with differing codecs. Details from my original ideas below but consider this redacted from the proposal for now.

Details

One of the goals above is to enable concatenating multiple Zarr arrays. The manifest approach supports a zero-copy way to achieve this. The concept here closely resembles the approach from [Kerchunk's MultiZarrToZarr](https://fsspec.github.io/kerchunk/tutorial.html#combine-multiple-kerchunked-datasets-into-a-single-logical-aggregate-dataset), except it targeting individual arrays and could be made to work with any zarr arrray (not just Kerchunk references). The idea is that concatenating arrays can be done in Zarr, provided a set of constraints are met, by simply rewriting the keys. Implementations could provide a API for doing this concatenation like:

arr_a: zarr.Array = zarr.open(store_a, path='foo')  # shape=(10, 4, 5), chunks=(2, 4, 5)
arr_b: zarr.Array = zarr.open(store_b, path='bar')  # shape=(6, 4, 5), chunks=(2, 4, 5)
arr_ab: zarr.Array = zarr.concatenate([arr_a, arr_b], axis=0, store=store_c)  # shape=(16, 4, 5), chunks=(2, 4, 5)

In this example, zarr.concatenate would act similar numpy.concatenate, returning a new zarr.Array object after creating the new manifest in store_c. This could also be done in two steps by adding a save_manifest method to the Zarr arrays.

Possible extensions

I've tried very hard to keep the scope of this as small as possible. There are currently few v3 storage transformers to emulate so I think the best next step is to try out this simple approach before spending too much time on a spec or elaborating on future options. That said, there are some obvious ways to extend this:

Supporting writes to manifest arrays (possible, there are many edge cases to consider)
Enable content addressable storage by hashing keys during writes
Support non-JSON manifests (many options)

Props

🙌 to those that have done a great job pushing this subject forward already: @martindurant, @alimanfoo, @rabernat among others.

norlandrhagen and alxmrsnorlandrhagen and alxmrsTomNicholas, maxrjones, normanrz, alxmrs and cboettig

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4