Here are some quick examples of what you can do with xarray.DataArray
objects. Everything is explained in much more detail in the rest of the documentation.
To begin, import numpy, pandas and xarray using their customary abbreviations:
import numpy as np import pandas as pd import xarray as xrCreate a DataArray#
You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]}) data
<xarray.DataArray (x: 2, y: 3)> Size: 48B array([[ 0.62502503, -0.37642159, 0.40151989], [ 0.6724865 , -1.51058594, -0.26854215]]) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y
0.625 -0.3764 0.4015 0.6725 -1.511 -0.2685
array([[ 0.62502503, -0.37642159, 0.40151989], [ 0.6724865 , -1.51058594, -0.26854215]])
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
In this case, we have generated a 2D array, assigned the names x and y to the two dimensions respectively and associated two coordinate labels ‘10’ and ‘20’ with the two locations along the x dimension. If you supply a pandas Series
or DataFrame
, metadata is copied directly:
xr.DataArray(pd.Series(range(3), index=list("abc"), name="foo"))
<xarray.DataArray 'foo' (dim_0: 3)> Size: 24B array([0, 1, 2]) Coordinates: * dim_0 (dim_0) object 24B 'a' 'b' 'c'
dim_0
(dim_0)
object
'a' 'b' 'c'
array(['a', 'b', 'c'], dtype=object)
PandasIndex
PandasIndex(Index(['a', 'b', 'c'], dtype='object', name='dim_0'))
Here are the key properties for a DataArray
:
# like in pandas, values is a numpy array that you can modify in-place data.values data.dims data.coords # you can use this dictionary to store arbitrary metadata data.attrsIndexing#
Xarray supports four kinds of indexing. Since we have assigned coordinate labels to the x dimension we can use label-based indexing along that dimension just like pandas. The four examples below all yield the same result (the value at x=10
) but at varying levels of convenience and intuitiveness.
# positional and by integer label, like numpy data[0, :] # loc or "location": positional and coordinate label, like pandas data.loc[10] # isel or "integer select": by dimension name and integer label data.isel(x=0) # sel or "select": by dimension name and coordinate label data.sel(x=10)
<xarray.DataArray (y: 3)> Size: 24B array([ 0.62502503, -0.37642159, 0.40151989]) Coordinates: x int64 8B 10 Dimensions without coordinates: y
0.625 -0.3764 0.4015
array([ 0.62502503, -0.37642159, 0.40151989])
Unlike positional indexing, label-based indexing frees us from having to know how our array is organized. All we need to know are the dimension name and the label we wish to index i.e. data.sel(x=10)
works regardless of whether x
is the first or second dimension of the array and regardless of whether 10
is the first or second element of x
. We have already told xarray that x is the first dimension when we created data
: xarray keeps track of this so we don’t have to. For more, see Indexing and selecting data.
While you’re setting up your DataArray, it’s often a good idea to set metadata attributes. A useful choice is to set data.attrs['long_name']
and data.attrs['units']
since xarray will use these, if present, to automatically label your plots. These special names were chosen following the NetCDF Climate and Forecast (CF) Metadata Conventions. attrs
is just a Python dictionary, so you can assign anything you wish.
data.attrs["long_name"] = "random velocity" data.attrs["units"] = "metres/sec" data.attrs["description"] = "A random variable created as an example." data.attrs["random_attribute"] = 123 data.attrs # you can add metadata to coordinates too data.x.attrs["units"] = "x units"Computation#
Data arrays work very similarly to numpy ndarrays:
data + 10 np.sin(data) # transpose data.T data.sum()
<xarray.DataArray ()> Size: 8B array(-0.45651827)
However, aggregation operations can use dimension names instead of axis numbers:
<xarray.DataArray (y: 3)> Size: 24B array([ 0.64875576, -0.94350377, 0.06648887]) Dimensions without coordinates: y
0.6488 -0.9435 0.06649
array([ 0.64875576, -0.94350377, 0.06648887])
Arithmetic operations broadcast based on dimension name. This means you don’t need to insert dummy dimensions for alignment:
a = xr.DataArray(np.random.randn(3), [data.coords["y"]]) b = xr.DataArray(np.random.randn(4), dims="z") a b a + b
<xarray.DataArray (y: 3, z: 4)> Size: 96B array([[-1.08471222, 0.23782075, -0.49706651, -1.12372828], [-2.51571846, -1.19318549, -1.92807276, -2.55473453], [-2.02735762, -0.70482465, -1.43971192, -2.06637369]]) Coordinates: * y (y) int64 24B 0 1 2 Dimensions without coordinates: z
-1.085 0.2378 -0.4971 -1.124 -2.516 ... -2.027 -0.7048 -1.44 -2.066
array([[-1.08471222, 0.23782075, -0.49706651, -1.12372828], [-2.51571846, -1.19318549, -1.92807276, -2.55473453], [-2.02735762, -0.70482465, -1.43971192, -2.06637369]])
y
(y)
int64
0 1 2
[3 values with dtype=int64]
PandasIndex
PandasIndex(RangeIndex(start=0, stop=3, step=1, name='y'))
It also means that in most cases you do not need to worry about the order of dimensions:
<xarray.DataArray (x: 2, y: 3)> Size: 48B array([[0., 0., 0.], [0., 0., 0.]]) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y
0.0 0.0 0.0 0.0 0.0 0.0
array([[0., 0., 0.], [0., 0., 0.]])
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
Operations also align based on index labels:
<xarray.DataArray (x: 1, y: 3)> Size: 24B array([[0., 0., 0.]]) Coordinates: * x (x) int64 8B 10 Dimensions without coordinates: y
PandasIndex
PandasIndex(Index([10], dtype='int64', name='x'))
For more, see Computation.
GroupBy#Xarray supports grouped operations using a very similar API to pandas (see GroupBy: Group and Bin Data):
labels = xr.DataArray(["E", "F", "E"], [data.coords["y"]], name="labels") labels data.groupby(labels).mean("y") data.groupby(labels).map(lambda x: x - x.min())
<xarray.DataArray (x: 2, y: 3)> Size: 48B array([[0.89356717, 1.13416435, 0.67006204], [0.94102864, 0. , 0. ]]) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y
0.8936 1.134 0.6701 0.941 0.0 0.0
array([[0.89356717, 1.13416435, 0.67006204], [0.94102864, 0. , 0. ]])
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
Visualizing your datasets is quick and convenient:
<matplotlib.collections.QuadMesh at 0x795bdd69a660>
Note the automatic labeling with names and units. Our effort in adding metadata attributes has paid off! Many aspects of these figures are customizable: see Plotting.
pandas#Xarray objects can be easily converted to and from pandas objects using the to_series()
, to_dataframe()
and to_xarray()
methods:
series = data.to_series() series # convert back series.to_xarray()
<xarray.DataArray (x: 2, y: 3)> Size: 48B array([[ 0.62502503, -0.37642159, 0.40151989], [ 0.6724865 , -1.51058594, -0.26854215]]) Coordinates: * x (x) int64 16B 10 20 * y (y) int64 24B 0 1 2
0.625 -0.3764 0.4015 0.6725 -1.511 -0.2685
array([[ 0.62502503, -0.37642159, 0.40151989], [ 0.6724865 , -1.51058594, -0.26854215]])
x
(x)
int64
10 20
y
(y)
int64
0 1 2
[3 values with dtype=int64]
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
PandasIndex
PandasIndex(RangeIndex(start=0, stop=3, step=1, name='y'))
xarray.Dataset
is a dict-like container of aligned DataArray
objects. You can think of it as a multi-dimensional generalization of the pandas.DataFrame
:
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi)) ds
<xarray.Dataset> Size: 88B Dimensions: (x: 2, y: 3) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y Data variables: foo (x, y) float64 48B 0.625 -0.3764 0.4015 0.6725 -1.511 -0.2685 bar (x) int64 16B 1 2 baz float64 8B 3.142
foo
(x, y)
float64
0.625 -0.3764 ... -1.511 -0.2685
array([[ 0.62502503, -0.37642159, 0.40151989], [ 0.6724865 , -1.51058594, -0.26854215]])
bar
(x)
int64
1 2
baz
()
float64
3.142
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
This creates a dataset with three DataArrays named foo
, bar
and baz
. Use dictionary or dot indexing to pull out Dataset
variables as DataArray
objects but note that assignment only works with dictionary indexing:
<xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48B array([[ 0.62502503, -0.37642159, 0.40151989], [ 0.6724865 , -1.51058594, -0.26854215]]) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y Attributes: long_name: random velocity units: metres/sec description: A random variable created as an example. random_attribute: 123
0.625 -0.3764 0.4015 0.6725 -1.511 -0.2685
array([[ 0.62502503, -0.37642159, 0.40151989], [ 0.6724865 , -1.51058594, -0.26854215]])
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
When creating ds
, we specified that foo
is identical to data
created earlier, bar
is one-dimensional with single dimension x
and associated values ‘1’ and ‘2’, and baz
is a scalar not associated with any dimension in ds
. Variables in datasets can have different dtype
and even different dimensions, but all dimensions are assumed to refer to points in the same shared coordinate system i.e. if two variables have dimension x
, that dimension must be identical in both variables.
For example, when creating ds
xarray automatically aligns bar
with DataArray
foo
, i.e., they share the same coordinate system so that ds.bar['x'] == ds.foo['x'] == ds['x']
. Consequently, the following works without explicitly specifying the coordinate x
when creating ds['bar']
:
<xarray.DataArray 'bar' ()> Size: 8B array(1) Coordinates: x int64 8B 10
You can do almost everything you can do with DataArray
objects with Dataset
objects (including indexing and arithmetic) if you prefer to work with multiple variables at once.
NetCDF is the recommended file format for xarray objects. Users from the geosciences will recognize that the Dataset
data model looks very similar to a netCDF file (which, in fact, inspired it).
You can directly read and write xarray objects to disk using to_netcdf()
, open_dataset()
and open_dataarray()
:
ds.to_netcdf("example.nc") reopened = xr.open_dataset("example.nc") reopened
<xarray.Dataset> Size: 88B Dimensions: (x: 2, y: 3) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y Data variables: foo (x, y) float64 48B ... bar (x) int64 16B ... baz float64 8B ...
foo
(x, y)
float64
...
[6 values with dtype=float64]
bar
(x)
int64
...
[2 values with dtype=int64]
baz
()
float64
...
[1 values with dtype=float64]
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
It is common for datasets to be distributed across multiple files (commonly one file per timestep). Xarray supports this use-case by providing the open_mfdataset()
and the save_mfdataset()
methods. For more, see Reading and writing files.
xarray.DataTree
is a tree-like container of DataArray
objects, organised into multiple mutually alignable groups. You can think of it like a (recursive) dict
of Dataset
objects, where coordinate variables and their indexes are inherited down to children.
Let’s first make some example xarray datasets:
import numpy as np import xarray as xr data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]}) ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "baz": np.pi}) ds ds2 = ds.interp(coords={"x": [10, 12, 14, 16, 18, 20]}) ds2 ds3 = xr.Dataset( {"people": ["alice", "bob"], "heights": ("people", [1.57, 1.82])}, coords={"species": "human"}, ) ds3
<xarray.Dataset> Size: 76B Dimensions: (people: 2) Coordinates: * people (people) <U5 40B 'alice' 'bob' species <U5 20B 'human' Data variables: heights (people) float64 16B 1.57 1.82
people
(people)
<U5
'alice' 'bob'
array(['alice', 'bob'], dtype='<U5')
species
()
<U5
'human'
array('human', dtype='<U5')
heights
(people)
float64
1.57 1.82
PandasIndex
PandasIndex(Index(['alice', 'bob'], dtype='object', name='people'))
Now we’ll put these datasets into a hierarchical DataTree:
dt = xr.DataTree.from_dict( {"simulation/coarse": ds, "simulation/fine": ds2, "/": ds3} ) dt
<xarray.DatasetView> Size: 76B Dimensions: (people: 2) Coordinates: * people (people) <U5 40B 'alice' 'bob' species <U5 20B 'human' Data variables: heights (people) float64 16B 1.57 1.82
<xarray.DatasetView> Size: 40B Dimensions: (people: 2) Coordinates: * people (people) <U5 40B 'alice' 'bob' Data variables: *empty*
<xarray.DatasetView> Size: 128B Dimensions: (people: 2, x: 2, y: 3) Coordinates: * people (people) <U5 40B 'alice' 'bob' * x (x) int64 16B 10 20 Dimensions without coordinates: y Data variables: foo (x, y) float64 48B -0.7636 0.2557 0.2947 0.7252 0.9361 -0.8931 bar (x) int64 16B 1 2 baz float64 8B 3.142
foo
(x, y)
float64
-0.7636 0.2557 ... 0.9361 -0.8931
array([[-0.76362178, 0.25566475, 0.29465509], [ 0.72524167, 0.93609699, -0.89309011]])
bar
(x)
int64
1 2
baz
()
float64
3.142
<xarray.DatasetView> Size: 288B Dimensions: (people: 2, x: 6, y: 3) Coordinates: * people (people) <U5 40B 'alice' 'bob' * x (x) int64 48B 10 12 14 16 18 20 Dimensions without coordinates: y Data variables: foo (x, y) float64 144B -0.7636 0.2557 0.2947 ... 0.7252 0.9361 -0.8931 bar (x) float64 48B 1.0 1.2 1.4 1.6 1.8 2.0 baz float64 8B 3.142
x
(x)
int64
10 12 14 16 18 20
array([10, 12, 14, 16, 18, 20])
foo
(x, y)
float64
-0.7636 0.2557 ... 0.9361 -0.8931
array([[-0.76362178, 0.25566475, 0.29465509], [-0.46584909, 0.3917512 , 0.05710605], [-0.1680764 , 0.52783765, -0.18044299], [ 0.12969629, 0.66392409, -0.41799203], [ 0.42746898, 0.80001054, -0.65554107], [ 0.72524167, 0.93609699, -0.89309011]])
bar
(x)
float64
1.0 1.2 1.4 1.6 1.8 2.0
array([1. , 1.2, 1.4, 1.6, 1.8, 2. ])
baz
()
float64
3.142
people
(people)
<U5
'alice' 'bob'
array(['alice', 'bob'], dtype='<U5')
species
()
<U5
'human'
array('human', dtype='<U5')
heights
(people)
float64
1.57 1.82
This created a DataTree with nested groups. We have one root group, containing information about individual people. This root group can be named, but here it is unnamed, and is referenced with "/"
. This structure is similar to a unix-like filesystem. The root group then has one subgroup simulation
, which contains no data itself but does contain another two subgroups, named fine
and coarse
.
The (sub)subgroups fine
and coarse
contain two very similar datasets. They both have an "x"
dimension, but the dimension is of different lengths in each group, which makes the data in each group unalignable. In the root group we placed some completely unrelated information, in order to show how a tree can store heterogeneous data.
Remember to keep unalignable dimensions in sibling groups because a DataTree inherits coordinates down through its child nodes. You can see this inheritance in the above representation of the DataTree. The coordinates people
and species
defined in the root /
node are shown in the child nodes both /simulation/coarse
and /simulation/fine
. All coordinates in parent-descendent lineage must be alignable to form a DataTree. If your input data is not aligned, you can still get a nested dict
of Dataset
objects with open_groups()
and then apply any required changes to ensure alignment before converting to a DataTree
.
The constraints on each group are the same as the constraint on DataArrays within a single dataset with the addition of requiring parent-descendent coordinate agreement.
We created the subgroups using a filesystem-like syntax, and accessing groups works the same way. We can access individual DataArrays in a similar fashion.
dt["simulation/coarse/foo"]
<xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48B array([[-0.76362178, 0.25566475, 0.29465509], [ 0.72524167, 0.93609699, -0.89309011]]) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y
-0.7636 0.2557 0.2947 0.7252 0.9361 -0.8931
array([[-0.76362178, 0.25566475, 0.29465509], [ 0.72524167, 0.93609699, -0.89309011]])
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
We can also view the data in a particular group as a read-only DatasetView
using xarray.Datatree.dataset
:
dt["simulation/coarse"].dataset
<xarray.DatasetView> Size: 128B Dimensions: (x: 2, y: 3, people: 2) Coordinates: * people (people) <U5 40B 'alice' 'bob' * x (x) int64 16B 10 20 Dimensions without coordinates: y Data variables: foo (x, y) float64 48B -0.7636 0.2557 0.2947 0.7252 0.9361 -0.8931 bar (x) int64 16B 1 2 baz float64 8B 3.142
people
(people)
<U5
'alice' 'bob'
array(['alice', 'bob'], dtype='<U5')
x
(x)
int64
10 20
foo
(x, y)
float64
-0.7636 0.2557 ... 0.9361 -0.8931
array([[-0.76362178, 0.25566475, 0.29465509], [ 0.72524167, 0.93609699, -0.89309011]])
bar
(x)
int64
1 2
baz
()
float64
3.142
PandasIndex
PandasIndex(Index(['alice', 'bob'], dtype='object', name='people'))
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
We can get a copy of the Dataset
including the inherited coordinates by calling the to_dataset
method:
ds_inherited = dt["simulation/coarse"].to_dataset() ds_inherited
<xarray.Dataset> Size: 128B Dimensions: (x: 2, y: 3, people: 2) Coordinates: * people (people) <U5 40B 'alice' 'bob' * x (x) int64 16B 10 20 Dimensions without coordinates: y Data variables: foo (x, y) float64 48B -0.7636 0.2557 0.2947 0.7252 0.9361 -0.8931 bar (x) int64 16B 1 2 baz float64 8B 3.142
people
(people)
<U5
'alice' 'bob'
array(['alice', 'bob'], dtype='<U5')
x
(x)
int64
10 20
foo
(x, y)
float64
-0.7636 0.2557 ... 0.9361 -0.8931
array([[-0.76362178, 0.25566475, 0.29465509], [ 0.72524167, 0.93609699, -0.89309011]])
bar
(x)
int64
1 2
baz
()
float64
3.142
PandasIndex
PandasIndex(Index(['alice', 'bob'], dtype='object', name='people'))
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
And you can get a copy of just the node local values of Dataset
by setting the inherit
keyword to False
:
ds_node_local = dt["simulation/coarse"].to_dataset(inherit=False) ds_node_local
<xarray.Dataset> Size: 88B Dimensions: (x: 2, y: 3) Coordinates: * x (x) int64 16B 10 20 Dimensions without coordinates: y Data variables: foo (x, y) float64 48B -0.7636 0.2557 0.2947 0.7252 0.9361 -0.8931 bar (x) int64 16B 1 2 baz float64 8B 3.142
foo
(x, y)
float64
-0.7636 0.2557 ... 0.9361 -0.8931
array([[-0.76362178, 0.25566475, 0.29465509], [ 0.72524167, 0.93609699, -0.89309011]])
bar
(x)
int64
1 2
baz
()
float64
3.142
PandasIndex
PandasIndex(Index([10, 20], dtype='int64', name='x'))
Note
We intend to eventually implement most Dataset
methods (indexing, aggregation, arithmetic, etc) on DataTree
objects, but many methods have not been implemented yet.
Tip
If all of your variables are mutually alignable (i.e., they live on the same grid, such that every common dimension name maps to the same length), then you probably don’t need xarray.DataTree
, and should consider just sticking with xarray.Dataset
.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4