Hi everyone,
Today, I would like to open a discussion on the open_datatree method for opening some files stored in Zarr.
First, I would like to give some background information about what we (@kmuehlbauer and @mgrover1) are working on. We propose a hierarchical
tree-like data model to store weather radar data following the FAIR principles (Findable, Accessible, Interoperable, Reusable). A Radar Volume Scan (RVS), comprising data collected through multiple cone-like sweeps at various elevation angles, often exceeds several megabytes in size every 5 to 10 minutes and is usually stored as individual files. Radar data storage involves proprietary formats that demand extensive input-output (I/O) operations, leading to prolonged computation times and high hardware requirements. In response, our study introduces a novel data model designed to address these challenges. Leveraging the Climate and Forecast conventions (CF) format-based FM301 hierarchical tree structure, endorsed by the World Meteorological Organization (WMO) and Analysis-Ready Cloud-Optimized (ARCO) formats, we aim to develop an open data model to arrange, manage, and store radar data in cloud-storage buckets efficiently.
Thus, the idea is to create a hierarchical
tree as follows:
The root corresponds to the radar name, nodes for each sweep within the RVS, and a xarray.Dataset
for each sweep. Our datatree now contains not only all sweeps for each RVS with the azimuth
and range
dimension but also a third dimension, vcp_time,
which allows us to create a time series.
We used data from the Colombian radar network, which is available on this S3 bucket, and put it all together in this repo. Our xarray.datatree
object has 11 nodes, one for the radar_parameter
group, and the following ten are different sweep elevations ranging from 0 to 20 degrees, as follows.
Looking in detail at sweep_0
(0.5 degrees elevation angle), we can notice that we now have around two thousand 5-min measurements along the vcp_time
dimension. We have time series!!!
However, we found out that as we append or increase our datasets along the vcp_time
dimension, the opening/loading time takes longer. For example, in our case, two thousand measurements, which correspond to around ten days of 5-min consecutive measurements, took around 46.4
seconds.
Now, if we consider adapting this to longer periods of storage radar datasets (e.g., +10 years), this issue will become a weak spot in our data model.
We consider this data model revolutionary for radar data analysis since we can now perform sizeable computational analyses quickly. For example, we computed a Quasi-Vertical Profile analysis QVP, which took a few lines of code and about 5 seconds to execute.
We looked into the open_datatree
method and found the function loops through each node, opening/loading the Dataset
at each node. This sequential looping might be a potential reason for it. This could be improved, and we would happily lend a hand in it.
Please let us know your thoughts.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4