year-vs-climatology / hvplot_docs /09-Gridded_Datasets.md
ahuang11's picture
Upload 52 files
b9a0f21 verified
|
raw
history blame
13.3 kB
# Gridded Datasets
```python
import xarray as xr
import numpy as np
import holoviews as hv
from holoviews import opts
hv.extension('matplotlib')
opts.defaults(opts.Scatter3D(color='Value', cmap='fire', edgecolor='black', s=50))
```
In the [Tabular Data](./08-Tabular_Datasets.ipynb) guide we covered how to work with columnar data in HoloViews. Apart from tabular or column based data there is another data format that is particularly common in the science and engineering contexts, namely multi-dimensional arrays. The gridded data interfaces allow working with grid-based datasets directly.
Grid-based datasets have two types of dimensions:
* they have coordinate or key dimensions, which describe the sampling of each dimension in the value arrays
* they have value dimensions which describe the quantity of the multi-dimensional value arrays
There are many different types of gridded datasets, which each approximate or measure a surface or space at discretely specified coordinates. In HoloViews, gridded datasets are typically one of three possible types: Regular rectilinear grids, irregular rectilinear grids, and curvilinear grids. Regular rectilinear grids can be defined by 1D coordinate arrays specifying the spacing along each dimension, while the other types require grid coordinates with the same dimensionality as the underlying value arrays, specifying the full n-dimensional coordinates of the corresponding array value. HoloViews provides many different elements supporting regularly spaced rectilinear grids, but currently only QuadMesh supports irregularly spaced rectilinear and curvilinear grids.
The difference between uniform, rectilinear and curvilinear grids is best illustrated by the figure below:
<figure>
<img src="http://earthsystemmodeling.org/docs/release/ESMF_8_1_1/ESMC_crefdoc/img11.png" alt="grid-types">
<figcaption>Types of logically rectangular grid tiles. Red circles show the values needed to specify grid coordinates for each type. Reproduced from <a href="http://earthsystemmodeling.org/docs/release/ESMF_8_1_1/ESMC_crefdoc/node5.html">ESMF documentation</a></figcaption>
</figure>
In this section we will first discuss how to work with the simpler rectilinear grids and then describe how to define a curvilinear grid with 2D coordinate arrays.
## Declaring gridded data
All Elements that support a ColumnInterface also support the GridInterface. The simplest example of a multi-dimensional (or more precisely 2D) gridded dataset is an image, which has implicit or explicit x-coordinates, y-coordinates and an array representing the values for each combination of these coordinates. Let us start by declaring an Image with explicit x- and y-coordinates:
```python
img = hv.Image((range(10), range(5), np.random.rand(5, 10)), datatype=['grid'])
img
```
In the above example we defined that there would be 10 samples along the x-axis, 5 samples along the y-axis and then defined a random ``5x10`` array, matching those dimensions. This follows the NumPy (row, column) indexing convention. When passing a tuple HoloViews will use the first gridded data interface, which stores the coordinates and value arrays as a dictionary mapping the dimension name to a NumPy array representing the data:
```python
img.data
```
However HoloViews also ships with an interface for ``xarray`` and the [GeoViews](https://geoviews.org) library ships with an interface for ``iris`` objects, which are two common libraries for working with multi-dimensional datasets:
```python
arr_img = img.clone(datatype=['image'])
print(type(arr_img.data))
try:
xr_img = img.clone(datatype=['xarray'])
print(type(xr_img.data))
except:
print('xarray interface could not be imported.')
```
In the case of an Image HoloViews also has a simple image representation which stores the data as a single array and converts the x- and y-coordinates to a set of bounds:
```python
print("Array type: %s with bounds %s" % (type(arr_img.data), arr_img.bounds))
```
To summarize the constructor accepts a number of formats where the value arrays should always match the shape of the coordinate arrays:
1. A simple np.ndarray along with (l, b, r, t) bounds
2. A tuple of the coordinate and value arrays
3. A dictionary of the coordinate and value arrays indexed by their dimension names
3. XArray DataArray or XArray Dataset
4. An Iris cube
# Working with a multi-dimensional dataset
A gridded Dataset may have as many dimensions as desired, however individual Element types only support data of a certain dimensionality. Therefore we usually declare a ``Dataset`` to hold our multi-dimensional data and take it from there.
```python
dataset3d = hv.Dataset((range(3), range(5), range(7), np.random.randn(7, 5, 3)),
['x', 'y', 'z'], 'Value')
dataset3d
```
This is because even a 3D multi-dimensional array represents volumetric data which we can display easily only if it contains few samples. In this simple case we can get an overview of what this data looks like by casting it to a ``Scatter3D`` Element (which will help us visualize the operations we are applying to the data:
```python
hv.Scatter3D(dataset3d)
```
### Indexing
In order to explore the dataset we therefore often want to define a lower dimensional slice into the array and then convert the dataset:
```python
dataset3d.select(x=1).to(hv.Image, ['y', 'z']) + hv.Scatter3D(dataset3d.select(x=1))
```
### Groupby
Another common method to apply to our data is to facet or animate the data using ``groupby`` operations. HoloViews provides a convenient interface to apply ``groupby`` operations and select which dimensions to visualize.
```python
(dataset3d.to(hv.Image, ['y', 'z'], 'Value', ['x']) +
hv.HoloMap({x: hv.Scatter3D(dataset3d.select(x=x)) for x in range(3)}, kdims='x'))
```
### Aggregating
Another common operation is to aggregate the data with a function thereby reducing a dimension. You can either ``aggregate`` the data by passing the dimensions to aggregate or ``reduce`` a specific dimension. Both have the same function:
```python
hv.Image(dataset3d.aggregate(['x', 'y'], np.mean)) + hv.Image(dataset3d.reduce(z=np.mean))
```
By aggregating the data we can reduce it to any number of dimensions we want. We can for example compute the spread of values for each z-coordinate and plot it using a ``Spread`` and ``Curve`` Element. We simply aggregate by that dimension and pass the aggregation functions we want to apply:
```python
hv.Spread(dataset3d.aggregate('z', np.mean, np.std)) * hv.Curve(dataset3d.aggregate('z', np.mean))
```
It is also possible to generate lower-dimensional views into the dataset which can be useful to summarize the statistics of the data along a particular dimension. A simple example is a box-whisker of the ``Value`` for each x-coordinate. Using the ``.to`` conversion interface we declare that we want a ``BoxWhisker`` Element indexed by the ``x`` dimension showing the ``Value`` dimension. Additionally we have to ensure to set ``groupby`` to an empty list because by default the interface will group over any remaining dimension.
```python
dataset3d.to(hv.BoxWhisker, 'x', 'Value', groupby=[])
```
Similarly we can generate a ``Distribution`` Element showing the ``Value`` dimension, group by the 'x' dimension and then overlay the distributions, giving us another statistical summary of the data:
```python
dataset3d.to(hv.Distribution, 'Value', [], groupby='x').overlay()
```
## Categorical dimensions
The key dimensions of the multi-dimensional arrays do not have to represent continuous values, we can display datasets with categorical variables as a ``HeatMap`` Element:
```python
heatmap = hv.HeatMap((['A', 'B', 'C'], ['a', 'b', 'c', 'd', 'e'], np.random.rand(5, 3)))
heatmap + hv.Table(heatmap)
```
## Non-uniform rectilinear grids
As discussed above, there are two main types of grids handled by HoloViews. So far, we have mainly dealt with uniform, rectilinear grids, but we can use the ``QuadMesh`` element to work with non-uniform rectilinear grids and curvilinear grids.
In order to define a non-uniform, rectilinear grid we can declare explicit irregularly spaced x- and y-coordinates. In the example below we specify the x/y-coordinate bin edges of the grid as arrays of shape ``M+1`` and ``N+1`` and a value array (``zs``) of shape ``NxM``:
```python
n = 8 # Number of bins in each direction
xs = np.logspace(1, 3, n)
ys = np.linspace(1, 10, n)
zs = np.arange((n-1)**2).reshape(n-1, n-1)
print('Shape of x-coordinates:', xs.shape)
print('Shape of y-coordinates:', ys.shape)
print('Shape of value array:', zs.shape)
hv.QuadMesh((xs, ys, zs))
```
## Curvilinear grids
To define a curvilinear grid the x/y-coordinates of the grid should be defined as 2D arrays of shape ``NxM`` or ``N+1xM+1``, i.e. either as the bin centers or the bin edges of each 2D bin.
```python
n=20
coords = np.linspace(-1.5,1.5,n)
X,Y = np.meshgrid(coords, coords);
Qx = np.cos(Y) - np.cos(X)
Qy = np.sin(Y) + np.sin(X)
Z = np.sqrt(X**2 + Y**2)
print('Shape of x-coordinates:', Qx.shape)
print('Shape of y-coordinates:', Qy.shape)
print('Shape of value array:', Z.shape)
qmesh = hv.QuadMesh((Qx, Qy, Z))
qmesh
```
## Working with xarray data types
As demonstrated previously, `Dataset` comes with support for the `xarray` library, which offers a powerful way to work with multi-dimensional, regularly spaced data. In this example, we'll load an example dataset, turn it into a HoloViews `Dataset` and visualize it. First, let's have a look at the xarray dataset's contents:
```python
xr_ds = xr.tutorial.open_dataset("air_temperature").load()
xr_ds
```
It is trivial to turn this xarray Dataset into a Holoviews `Dataset` (the same also works for DataArray):
```python
hv_ds = hv.Dataset(xr_ds)[:, :, "2013-01-01"]
print(hv_ds)
```
We have used the usual slice notation in order to select one single day in the rather large dataset. Finally, let's visualize the dataset by converting it to a `HoloMap` of `Images` using the `to()` method. We need to specify which of the dataset's key dimensions will be consumed by the images (in this case "lat" and "lon"), where the remaining key dimensions will be associated with the HoloMap (here: "time"). We'll use the slice notation again to clip the longitude.
```python
airtemp = hv_ds.to(hv.Image, kdims=["lon", "lat"], dynamic=False)
airtemp[:, 220:320, :].opts(colorbar=True, fig_size=200)
```
Here, we have explicitly specified the default behaviour `dynamic=False`, which returns a HoloMap. Note, that this approach immediately converts all available data to images, which will take up a lot of RAM for large datasets. For these situations, use `dynamic=True` to generate a [DynamicMap](./07-Live_Data.ipynb) instead. Additionally, [xarray features dask support](http://xarray.pydata.org/en/stable/dask.html), which is helpful when dealing with large amounts of data.
It is also possible to render curvilinear grids with xarray, and here we will load one such example. The dataset below defines a curvilinear grid of air temperatures varying over time. The curvilinear grid can be identified by the fact that the ``xc`` and ``yc`` coordinates are defined as two-dimensional arrays:
```python
rasm = xr.tutorial.open_dataset("rasm").load()
rasm.coords
```
To simplify the example we will select a single timepoint and add explicit coordinates for the x and y dimensions:
```python
rasm = rasm.isel(time=0, x=slice(0, 200)).assign_coords(x=np.arange(200), y=np.arange(205))
rasm.coords
```
Now that we have defined both rectilinear and curvilinear coordinates we can visualize the difference between the two by explicitly defining which set of coordinates to use:
```python
hv.QuadMesh(rasm, ['x', 'y']) + hv.QuadMesh(rasm, ['xc', 'yc'])
```
Additional examples of visualizing xarrays in the context of geographical data can be found in the GeoViews documentation: [Gridded Datasets I](http://geoviews.org/user_guide/Gridded_Datasets_I.html) and
[Gridded Datasets II](http://geoviews.org/user_guide/Gridded_Datasets_II.html). These guides also contain useful information on the interaction between xarray data structures and HoloViews Datasets in general.
# API
## Accessing the data
In order to be able to work with data in different formats Holoviews defines a general interface to access the data. The dimension_values method allows returning underlying arrays.
#### Key dimensions (coordinates)
By default ``dimension_values`` will return the expanded columnar format of the data:
```python
heatmap.dimension_values('x')
```
To access just the unique coordinates along a dimension simply supply the ``expanded=False`` keyword:
```python
heatmap.dimension_values('x', expanded=False)
```
Finally we can also get a non-flattened, expanded coordinate array returning a coordinate array of the same shape as the value arrays
```python
heatmap.dimension_values('x', flat=False)
```
#### Value dimensions
When accessing a value dimension the method will similarly return a flat view of the data:
```python
heatmap.dimension_values('z')
```
We can pass the ``flat=False`` argument to access the multi-dimensional array:
```python
heatmap.dimension_values('z', flat=False)
```