A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/TomNicholas/FROST below:

TomNicholas/FROST: Decentralized global data catalog protocol for all scientific data

FROST (Federated Registry Of Scientific Things)

FROST is a decentralized subscribable data catalog protocol for sharing all scientific data globally.

(See the FAQ for explanation of the GIF)

Warning

This doesn't actually exist yet, this repo is just for brainstorming ideas. Please contribute!

Context: The best way to store and provide access to big scientific datasets is via ARCO data in S3-compatible cloud object storage. We now have scalable cloud-optimised formats that are version-controlled at rest in object storage (particularly Icechunk for arrays and Iceberg for tables). This is huge, as even dynamically-updated datasets can now be distributed via raw S3, with no other server needed. All the data providers who are paying attention are about to put their data in these formats, and then they will try to advertise the S3 URLs to the world via ad-hoc data catalogs.

Problem: ⛓️‍💥 Everyone's catalogs are disconnected from everyone else's ⛓️‍💥.

This means:

Solution: Federated catalog protocol with cross-org publish-subcribe model.

How do we build it?: Not sure exactly, but the problem is analogous to creating Federated alternatives to centralized social media (i.e. Bluesky/Mastodon vs Twitter). Perhaps we can piggyback off of Bluesky's ATproto or Mastodon's ActivityPub?

  1. Hello World

A network with a single dataset node. It is periodically updated, and publishes the fact it has been updated.

  1. Related data

A network with two dataset nodes, one which refers to the other as being related in some way, so the graph has a single edge.

  1. Leader-Follower

A network with two dataset nodes, one downstream which refers to the other upstream, stating that the downstream one has been derived from the upstream one in some specific programatic way, which is retriggered upon each update to the upstream dataset. The graph has a single edge.

  1. Cross-Org Catalog

A network with two nodes, belonging to different organisations, that are not connected. We check that both nodes can be listed by both orgs.

  1. Cross-Org Follower

A network with two nodes, belonging to different organisations, one downstream which refers to the other upstream, stating that the downstream one has been derived from the upstream one in some specific programatic way, which is retriggered upon each update to the upstream dataset. The graph has a single edge. We check that both nodes can be listed by both orgs.

Once such a protocol is in place, many different use cases / business models / services become easier to build.

Data provider organisations could publish their datasets and be confident that anyone interested can programmatically find their data and track their updates. They can still build a catalog showing just their own org's datasets, but they can also broadcast their data offerings in a way that is easy for other organisations to track. (We distinguish between the global registry of all datasets and updates, and various catalogs, which are subsets of the registry of interest to one organisation or community.)

As anyone could consume the "firehose" of public dataset updates, anyone could build a website which filters or queries that entires in any way before displaying them. In particular they could experiment with different models of search, independent of how the actual registry data is disseminated. The simplest example would be keyword-based search (e.g. NOAA could provide a page that displays all datasets across any org, but only those with "ocean" in the metadata), but the same architecture would also allow for more complex semantic search services using ML techniques.

A platform that provides additional services on top of an organisation's private data is known as a data lake. In addition to public cataloging features, the federated protocol would allow datasets in data lakes to recieve and act upon updates from public datasets in other organisations, even if the downstream datasets in the data lake were not publicly exposed.

A data marketplace is basically just a data catalog but with one extra layer between the registry and the storage - an access control layer which grants access to the raw data only upon authenticated payment. This allows for an entirely different business model with almost exactly the same architecture. Sellers of data could broadcast their offerings globally, and if some measure of price was included in the registry schema, their prices could automatically be displayed in anyone else's catalogs. As the price in the catalog need not be the actual price paid upon negotiating with the data provider, the resulting experience would be somewhat like using Facebook Marketplace, where the listed price is only intended as a rough expectation, and actual transactions occur outside of the FROST network.

With all links to public datasets made available, anyone could easily find and suck out all the datasets they considered important into a replica as a backup.

Updates to datasets should propagate through the network automatically and quickly (in seconds to minutes using ATproto). This enables real-time data services to be built upon the data sharing network, e.g. recomputing wildfire risk each time new satellite imagery becomes available.

Q: That GIF is pretty, but what does it mean?

A: The GIF is intended to show notifications of dataset updates propagating through a federated network.

Each node is a version-controlled dataset sitting in S3, in either Icechunk or Iceberg format. The datasets are spread across 3 organisations: NASA, NOAA, and a startup (rocketship). Although each dataset sits in the owning organisation's object storage, the locations, versions, and dependencies of each are shared publicly via the FROST protocol. They thus form a cross-org (federated) network, the FROST network.

A source of new data (the satellite) causes the NASA dataset to be updated. A notification of this update is broadcast to it's dependent datasets. A re-computation of these dependents is triggered, and updated versions of each written out.

Q: Where are the computations which create the new versions of each dataset running?

A: Not within the FROST network - the compute task deployments are deliberately separate. Those spinning cogs in the GIF are just meant to indicate some task running somewhere that was triggered by a notification sent via FROST, and will publish an update to FROST once the task is complete. (The tasks could even be manual - i.e. a human receives a notification telling them to look at the updated upstream data before deciding how to update their derived dataset.) The compute layer is an area where platform providers could innovate and compete - they could provide something like Github Actions but for automatically updating datasets instead of updating codebases. But their solutions should be quite general - being too prescriptive as to how these computations must be done will discourage people from using the network.

Q: Why is Icechunk/Iceberg such a big deal?

A: Icechunk is the biggest thing since Zarr. In case you've been living under a rock, or more sympathetically if “Serverless ACID Transactional Array Database” doesn’t mean anything to you (it didn’t to me when I first read it either), let me summarize the implications:

Icechunk is directly inspired by Apache Iceberg (and some similar formats like Delta Tables), which is the same thing but for tabular data.

Q: Why does the network need to span across different fields of science?

A: Any attempt to split up the catalog by fields of science will inevitably divide some community’s interdisciplinary field in two. Imagine if github only let you put code for Neuroscience analysis in it. That's fine at the labelling level, but not cool to force that community to bridge two separate networks to receive all updates they care about. Your data probably isn’t that special anyway, you almost certainly could fit it into this framework.

Q: Can’t we just use STAC?

A: No, it’s not general enough (lots of scientific data that isn’t a Spatio-Temporal view of the Earth). See also the section in the motivation blog post.

Q: Could I catalog <some data type> with this?

A: The set of allowed data models should be extensible, but restricted to those which have the following properties:

Important examples which should already meet these criteria are:

Q: Can the catalog layer have a field for <my domain-specific metadata tag>?

A: No, bad. It's crucial that the catalog remain domain-agnostic. Adding domain-specific choices in catalog schema is one of the main reasons why so many existing projects in this space don't generalize.

Instead this as a problem to be solved at the level of metadata standards. With the data catalog able to attach arbitrary metadata (e.g. JSON), the field of microscopists can work out amongst themselves some convention for the standard schema of their metadata and what that means to microscopists, whilst climate and weather people can make sure their data follows the CF conventions and so on. This approach is the only one compatible with what a standard is - a community-agreed schema that is extremely useful when followed but you're not forced to follow it.

Q: But shouldn't we enforce that the data at least has <requirement>?

A: No. Basically nothing other than the bare minimum for the system to work should be enforced. As soon as you enforce anything it raises the barrier to entry, reducing adoption. Your enforcement will also inevitably bake in some assumptions that seem reasonable in your field but aren’t meetable in general, so you end up making it less generalizable. Note that GitHub enforces nothing, not even having a license or readme (though it does very strongly suggest them). It doesn’t try to force you to use pyproject.toml for a python project or anything like that, it leaves that entirely up to the python community.

Every type of quality control and metadata standardization should similarly be left up to the relevant community. A layered architecture faciliates this - for example you could create a community-specific public catalog website that only displays entries in the registry if their metadata matches some community standardized schema. That would incentivise data providers in your community to make their metadata compliant, but not block them from sharing it if they don’t.

Q: Shouldn't we decentralize the storage of the actual data too?

A: Sure, if you like. It's possible to do that with OSN pods or even cryptographically securely with IPFS. But that's a separate layer from what FROST is concerned with. FROST only catalogs references (i.e. URLs) to where the data exists, and decentralizes the network of records of where the data actually lives. The actual data is stored outside of the network, for example in some organization's S3 bucket. The storage layer is therefore configurable, with the only requirement being that the location of the data and metadata can be expressed as a single public URL.

Apache 2.0


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4