Spaces:
Running
Running
File size: 14,613 Bytes
b9a0f21 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
As we discovered in the [Introduction](Introduction.ipynb), HoloViews allows plotting a variety of data types. Here we will use the sample data module and load the pandas and dask hvPlot API:
```python
import numpy as np
import hvplot.pandas # noqa
import hvplot.dask # noqa
```
As we learned the hvPlot API closely mirrors the [Pandas plotting API](https://pandas.pydata.org/pandas-docs/stable/visualization.html), but instead of generating static images when used in a notebook, it uses HoloViews to generate either static or dynamically streaming Bokeh plots. Static plots can be used in any context, while streaming plots require a live [Jupyter notebook](https://jupyter.org), a deployed [Bokeh Server app](https://bokeh.pydata.org/en/latest/docs/user_guide/server.html), or a deployed [Panel](https://panel.holoviz.org) app.
HoloViews provides an extensive, very rich set of objects along with a powerful set of operations to apply, as you can find out in the [HoloViews User Guide](https://holoviews.org/user_guide/index.html). But here we will focus on the most essential mechanisms needed to make your data visualizable, without having to worry about the mechanics going on behind the scenes.
We will be focusing on two different datasets:
- A small CSV file of US crime data, broken down by state
- A larger Parquet-format file of airline flight data
The ``hvplot.sample_data`` module makes these datasets Intake data catalogue, which we can load either using pandas:
```python
from hvplot.sample_data import us_crime, airline_flights
crime = us_crime.read()
print(type(crime))
crime.head()
```
Or using dask as a ``dask.DataFrame``:
```python
flights = airline_flights.to_dask().persist()
print(type(flights))
flights.head()
```
## The plot interface
The ``dask.dataframe.DataFrame.hvplot``, ``pandas.DataFrame.hvplot`` and ``intake.DataSource.plot`` interfaces (and Series equivalents) from HvPlot provide a powerful high-level API to generate complex plots. The ``.hvplot`` API can be called directly or used as a namespace to generate specific plot types.
### The plot method
The most explicit way to use the plotting API is to specify the names of columns to plot on the ``x``- and ``y``-axis respectively:
```python
crime.hvplot.line(x='Year', y='Violent Crime rate')
```
As you'll see in more detail below, you can choose which kind of plot you want to use for the data:
```python
crime.hvplot(x='Year', y='Violent Crime rate', kind='scatter')
```
To group the data by one or more additional columns, specify an additional ``by`` variable. As an example here we will plot the departure delay ('depdelay') as a function of 'distance', grouping the data by the 'carrier'. There are many available carriers, so we will select only two of them so that the plot is readable:
```python
flight_subset = flights[flights.carrier.isin(['OH', 'F9'])]
flight_subset.hvplot(x='distance', y='depdelay', by='carrier', kind='scatter', alpha=0.2, persist=True)
```
Here we have specified the `x` axis explicitly, which can be omitted if the Pandas index column is already the desired x axis. Similarly, here we specified the `y` axis; by default all of the non-index columns would be plotted (which would be a lot of data in this case). If you don't specify the 'y' axis, it will have a default label named 'value', but you can then provide a y axis label explicitly using the ``value_label`` option.
Putting all of this together we will plot violent crime, robbery, and burglary rates on the y-axis, specifying 'Year' as the x, and relabel the y-axis to display the 'Rate'.
```python
crime.hvplot(x='Year', y=['Violent Crime rate', 'Robbery rate', 'Burglary rate'],
value_label='Rate (per 100k people)')
```
### The hvplot namespace
Instead of using the ``kind`` argument to the plot call, we can use the ``hvplot`` namespace, which lets us easily discover the range of plot types that are supported. Use tab completion to explore the available plot types:
```python
crime.hvplot.<TAB>
```
Plot types available include:
* <a href="#area">``.area()``</a>: Plots a area chart similar to a line chart except for filling the area under the curve and optionally stacking
* <a href="#bars">``.bar()``</a>: Plots a bar chart that can be stacked or grouped
* <a href="#bivariate">``.bivariate()``</a>: Plots 2D density of a set of points
* <a href="#box-whisker-plots">``.box()``</a>: Plots a box-whisker chart comparing the distribution of one or more variables
* <a href="#heatmap">``.heatmap()``</a>: Plots a heatmap to visualizing a variable across two independent dimensions
* <a href="#hexbins">``.hexbins()``</a>: Plots hex bins
* <a href="#histogram">``.hist()``</a>: Plots the distribution of one or histograms as a set of bins
* <a href="#kde-density">``.kde()``</a>: Plots the kernel density estimate of one or more variables.
* <a href="#the-plot-method">``.line()``</a>: Plots a line chart (such as for a time series)
* <a href="#scatter">``.scatter()``</a>: Plots a scatter chart comparing two variables
* <a href="#step">``.step()``</a>: Plots a step chart akin to a line plot
* <a href="#tables">``.table()``</a>: Generates a SlickGrid DataTable
* <a href="#groupby">``.violin()``</a>: Plots a violin plot comparing the distribution of one or more variables using the kernel density estimate
#### Area
Like most other plot types the ``area`` chart supports the three ways of defining a plot outlined above. An area chart is most useful when plotting multiple variables in a stacked chart. This can be achieve by specifying ``x``, ``y``, and ``by`` columns or using the ``columns`` and ``index``/``use_index`` (equivalent to ``x``) options:
```python
crime.hvplot.area(x='Year', y=['Robbery', 'Aggravated assault'])
```
We can also explicitly set ``stacked`` to False and define an ``alpha`` value to compare the values directly:
```python
crime.hvplot.area(x='Year', y=['Aggravated assault', 'Robbery'], stacked=False, alpha=0.4)
```
Another use for an area plot is to visualize the spread of a value. For instance using the flights dataset we may want to see the spread in mean delay values across carriers. For that purpose we compute the mean delay by day and carrier and then the min/max mean delay for across all carriers. Since the output of ``hvplot`` is just a regular holoviews object, we can use the overlay operator (\*) to place the plots on top of each other.
```python
delay_min_max = flights.groupby(['day', 'carrier'])['carrier_delay'].mean().groupby('day').agg({'min': np.min, 'max': np.max})
delay_mean = flights.groupby('day')['carrier_delay'].mean()
delay_min_max.hvplot.area(x='day', y='min', y2='max', alpha=0.2) * delay_mean.hvplot()
```
#### Bars
In the simplest case we can use ``.hvplot.bar`` to plot ``x`` against ``y``. We'll use ``rot=90`` to rotate the tick labels on the x-axis making the years easier to read:
```python
crime.hvplot.bar(x='Year', y='Violent Crime rate', rot=90)
```
If we want to compare multiple columns instead we can set ``y`` to a list of columns. Using the ``stacked`` option we can then compare the column values more easily:
```python
crime.hvplot.bar(x='Year', y=['Violent crime total', 'Property crime total'],
stacked=True, rot=90, width=800, legend='top_left')
```
#### Scatter
The scatter plot supports many of the same features as the other chart types we have seen so far but can also be colored by another variable using the ``c`` option.
```python
crime.hvplot.scatter(x='Violent Crime rate', y='Burglary rate', c='Year')
```
Anytime that color is being used to represent a dimension, the ``cmap`` option can be used to control the colormap that is used to represent that dimension. Additionally, the colorbar can be disabled using ``colorbar=False``.
#### Step
A step chart is very similar to a line chart but instead of linearly interpolating between samples the step chart visualizes discrete steps. The point at which to step can be controlled via the ``where`` keyword allowing `'pre'`, `'mid'` (default) and `'post'` values:
```python
crime.hvplot.step(x='Year', y=['Robbery', 'Aggravated assault'])
```
#### HexBins
You can create hexagonal bin plots with the ``hexbin`` method. Hexbin plots can be a useful alternative to scatter plots if your data are too dense to plot each point individually. Since these data are not regularly distributed, we'll use the ``logz`` option to map z-axis (color) to a log scale colorbar.
```python
flights.hvplot.hexbin(x='airtime', y='arrdelay', width=600, height=500, logz=True)
```
#### Bivariate
You can create a 2D density plot with the ``bivariate`` method. Bivariate plots can be a useful alternative to scatter plots if your data are too dense to plot each point individually.
```python
crime.hvplot.bivariate(x='Violent Crime rate', y='Burglary rate', width=600, height=500)
```
#### HeatMap
A ``HeatMap`` lets us view the relationship between three variables, so we specify the 'x' and 'y' variables and an additional 'C' variable. Additionally we can define a ``reduce_function`` that computes the values for each bin from the samples that fall into it. Here we plot the 'depdelay' (i.e. departure delay) for each day of the month and carrier in the dataset:
```python
flights.compute().hvplot.heatmap(x='day', y='carrier', C='depdelay', reduce_function=np.mean, colorbar=True)
```
#### Tables
Unlike all other plot types, a table only supports one signature: either all columns are plotted, or a subset of columns can be selected by defining the ``columns`` explicitly:
```python
crime.hvplot.table(columns=['Year', 'Population', 'Violent Crime rate'], width=400)
```
### Distributions
Plotting distributions differs slightly from other plots since they plot only one variable in the simple case rather than plotting two or more variables against each other. Therefore when plotting these plot types no ``index`` or ``x`` value needs to be supplied. Instead:
1. Declare a single ``y`` variable, e.g. ``source.plot.hist(variable)``, or
2. Declare a ``y`` variable and ``by`` variable, e.g. ``source.plot.hist(variable, by='Group')``, or
3. Declare columns or plot all columns, e.g. ``source.plot.hist()`` or ``source.plot.hist(columns=['A', 'B', 'C'])``
#### Histogram
The Histogram is the simplest example of a distribution; often we simply plot the distribution of a single variable, in this case the 'Violent Crime rate'. Additionally we can define a range over which to compute the histogram and the number of bins using the ``bin_range`` and ``bins`` arguments respectively:
```python
crime.hvplot.hist(y='Violent Crime rate')
```
Or we can plot the distribution of multiple columns:
```python
columns = ['Violent Crime rate', 'Property crime rate', 'Burglary rate']
crime.hvplot.hist(y=columns, bins=50, alpha=0.5, legend='top', height=400)
```
We can also group the data by another variable. Here we'll use ``subplots`` to split each carrier out into its own plot:
```python
flight_subset = flights[flights.carrier.isin(['AA', 'US', 'OH'])]
flight_subset.hvplot.hist('depdelay', by='carrier', bins=20, bin_range=(-20, 100), width=300, subplots=True)
```
#### KDE (density)
You can also create density plots using ``hvplot.kde()`` or ``hvplot.density()``:
```python
crime.hvplot.kde(y='Violent Crime rate')
```
Comparing the distribution of multiple columns is also possible:
```python
columns=['Violent Crime rate', 'Property crime rate', 'Burglary rate']
crime.hvplot.kde(y=columns, alpha=0.5, value_label='Rate', legend='top_right')
```
The ``hvplot.kde`` also supports the ``by`` keyword:
```python
flight_subset = flights[flights.carrier.isin(['AA', 'US', 'OH'])]
flight_subset.hvplot.kde('depdelay', by='carrier', xlim=(-20, 70), width=300, subplots=True)
```
#### Box-Whisker Plots
Just like the other distribution-based plot types, the box-whisker plot supports plotting a single column:
```python
crime.hvplot.box(y='Violent Crime rate')
```
It also supports multiple columns and the same options as seen previously (``legend``, ``invert``, ``value_label``):
```python
columns=['Burglary rate', 'Larceny-theft rate', 'Motor vehicle theft rate',
'Property crime rate', 'Violent Crime rate']
crime.hvplot.box(y=columns, group_label='Crime', legend=False, value_label='Rate (per 100k)', invert=True)
```
Lastly, it also supports using the ``by`` keyword to split the data into multiple subsets:
```python
flight_subset = flights[flights.carrier.isin(['AA', 'US', 'OH'])]
flight_subset.hvplot.box('depdelay', by='carrier', ylim=(-10, 70))
```
## Composing Plots
One of the core strengths of HoloViews is the ease of composing
different plots. Individual plots can be composed using the ``*`` and
``+`` operators, which overlay and compose plots into layouts
respectively. For more information on composing objects, see the
HoloViews [User Guide](https://holoviews.org/user_guide/Composing_Elements.html).
By using these operators we can combine multiple plots into composite plots. A simple example is overlaying two plot types:
```python
crime.hvplot(x='Year', y='Violent Crime rate') * crime.hvplot.scatter(x='Year', y='Violent Crime rate', c='k')
```
We can also lay out different plots and tables together:
```python
(crime.hvplot.bar(x='Year', y='Violent Crime rate', rot=90, width=550) +
crime.hvplot.table(['Year', 'Population', 'Violent Crime rate'], width=420))
```
## Large data
The previous examples summarized the fairly large airline dataset using statistical plot types that aggregate the data into a feasible subset for plotting. We can instead aggregate the data directly into the viewable image using [datashader](https://datashader.org), which provides a rendering of the entire set of raw data available (as far as the resolution of the screen allows). Here we plot the 'airtime' against the 'distance':
```python
flights.hvplot.scatter(x='distance', y='airtime', datashade=True)
```
## Groupby
Thanks to the ability of HoloViews to explore a parameter space with a set of widgets we can apply a groupby along a particular column or dimension. For example we can view the distribution of departure delays by carrier grouped by day, allowing the user to choose which day to display:
```python
flights.hvplot.violin(y='depdelay', by='carrier', groupby='dayofweek', ylim=(-20, 60), height=500)
```
This user guide merely provided an overview over the available plot types; to see a detailed description on how to customize plots see the [Customization](Customization.ipynb) user guide.
|