A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/xarray-contrib/datatree/issues/330 below:

`open_datatree` performance · Issue #330 · xarray-contrib/datatree · GitHub

Hi everyone,

Today, I would like to open a discussion on the open_datatree method for opening some files stored in Zarr.

First, I would like to give some background information about what we (@kmuehlbauer and @mgrover1) are working on. We propose a hierarchical tree-like data model to store weather radar data following the FAIR principles (Findable, Accessible, Interoperable, Reusable). A Radar Volume Scan (RVS), comprising data collected through multiple cone-like sweeps at various elevation angles, often exceeds several megabytes in size every 5 to 10 minutes and is usually stored as individual files. Radar data storage involves proprietary formats that demand extensive input-output (I/O) operations, leading to prolonged computation times and high hardware requirements. In response, our study introduces a novel data model designed to address these challenges. Leveraging the Climate and Forecast conventions (CF) format-based FM301 hierarchical tree structure, endorsed by the World Meteorological Organization (WMO) and Analysis-Ready Cloud-Optimized (ARCO) formats, we aim to develop an open data model to arrange, manage, and store radar data in cloud-storage buckets efficiently.

Thus, the idea is to create a hierarchical tree as follows:

The root corresponds to the radar name, nodes for each sweep within the RVS, and a xarray.Dataset for each sweep. Our datatree now contains not only all sweeps for each RVS with the azimuth and range dimension but also a third dimension, vcp_time, which allows us to create a time series.

We used data from the Colombian radar network, which is available on this S3 bucket, and put it all together in this repo. Our xarray.datatree object has 11 nodes, one for the radar_parameter group, and the following ten are different sweep elevations ranging from 0 to 20 degrees, as follows.

Looking in detail at sweep_0 (0.5 degrees elevation angle), we can notice that we now have around two thousand 5-min measurements along the vcp_time dimension. We have time series!!!

However, we found out that as we append or increase our datasets along the vcp_time dimension, the opening/loading time takes longer. For example, in our case, two thousand measurements, which correspond to around ten days of 5-min consecutive measurements, took around 46.4 seconds.

Now, if we consider adapting this to longer periods of storage radar datasets (e.g., +10 years), this issue will become a weak spot in our data model.

We consider this data model revolutionary for radar data analysis since we can now perform sizeable computational analyses quickly. For example, we computed a Quasi-Vertical Profile analysis QVP, which took a few lines of code and about 5 seconds to execute.

We looked into the open_datatree method and found the function loops through each node, opening/loading the Dataset at each node. This sequential looping might be a potential reason for it. This could be improved, and we would happily lend a hand in it.

https://github.com/xarray-contrib/datatree/blob/0afaa6cc1d6800987d8b9c37a604dc0a8c68aeaa/datatree/io.py#L84C1-L94C36

Please let us know your thoughts.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4