Introduction to data visualization with Python

Introduction to data visualization with Python#

The Python Visualisation Landscape#

Can’t see the forest for the trees?#

PyViz.org attempts to shine light on common patterns and use cases, comparing or discussing multiple plotting libraries.
Have a look at Bence Arató’s talk at EuroPython 2020 (30 min.) if you prefer a video.
Use the Grammar of Graphics as a mental model:
- Wilkinson’s original Grammar of Graphics
- Wickham’s extension into a Layered Grammar of Graphics which is implemented in ggplot2 in R and plotnine
- Vega-Lite’s extension that includes a Grammar of Interactions

Learning objectives#

Know how to apply a grammar of interactions to make interactive charts with the most commonly used Python graphing libraries (bokeh, plotly, altair)
Know how to make single-page apps in Jupyter notebook with the most commonly used Python dashboarding and analytic app libraries (panel, plotly dash)

Today we focus on how to make interactive visualisations#

Apply Grammar of (Interactive) Graphics to dissect different implementations of Gapminder
Look into under the hood to get better understanding how everything works, so you can start using the libraries properly
Dissect different designs patterns of interactive graphs that are useful voor data analytics

We will not talk about the what of visualisation#

Go through the slides and online resources of Sara Sprinkhuizen’s lecture on Data visualisation to gain a better understanding what makes a good data viz
Use one of the many good online resources to help you choose the right chart for your task:
Stephen Few’s classic book on Information Dashboard Design

Grammar of (Interactive) Graphics#

Leland’s seven classes#

GoG class	Description	Comments and examples
1. Varset	A set of one or more variables	More generic definition than just a table or `dataframe`
2. Algebra	Produce combinations of variables	Join, concatenate, group by
3. Scales	Scale variables	Transformation like taking log or normalizing
4. Statistics	Compute statistical summeries	Generates a new varset
5. Geometry	Control the type of plot	point, line, area, path, bar, polygon, edge etc.
6. Coordinates	The coordinate system and faceting	Usually Cartesian, but also polar or geographic coordinates
7. Aesthetics	Actual mapping of variables to a perceivable graphic	Visual variables include position, size, shape, orientation, brightness, color, granularity. For interactive graphics also blur, sound and motion.

Vega-Lite’s Grammar of Interactions#

Selection component	Description	Comments and examples
type	Way in which backing points are selected as minimal set to identify all selected points	point, list, interval
predicate	Logic to determine selected points	Inside or outside dragged area, within a range etc.
domain or range	Invert screen position to data values	Click on a mark for selecting single point, drag to select points in area etc.
event	The actual input event	Mouseover, selection by dragging
init	Initialize selection with specific points	Used for automatically determining scale extents
transforms	Manipulate selection	E.g. moving a rectangular selection
resolve	Re-evaluate visual encodings as selections change	Change color (highlighting), use selection as input for other encodings (cross-filtering), re-define scales etc.

Choosing your Python libraries for interactive visualizations and dashboards#

Pythonistas are somewhat envious of that the fact that the R stack has set the standard for interactive data visualizations and dashboarding with ggplot2 and Shiny. But recently Python has caught up, going by the number of stars on GitHub.

For you interactive data visualization work, you need to make two choices:

Choose your interactive plotting library for making figures. Altair, Bokeh and Plotly are the most popular ones, you can find a more detailed comparison here
Choose your dashboarding library for making, you guessed it, dashboards. Streamlit, Voilá, Plotly Dash and Panel are the most popular ones, you can read more here and here

Note that although many plotting libraries are supported in the dashboarding libraries, some integrations work better than others. Sticking to the same ecosystem yields the following combinations:

Plotly + Plotly Dash: backed by a Canadian company under the same name, this is an excellent stack to work in. Over time you can upgrade to a paid (enterprise) version including low-code development environments and hosting for ease of sharing apps.
Bokeh + Panel: pure open source libraries which are financially supported by the NumFOCUS and Anaconda. Complete freedom to integrate these libraries into your own stack without ever having to worry about licensing.
Altair + Streamlit: the new kids on the block, but with an impressive pedigree. University of Washinton Interactive Data Lab are the core developers of Altair, with Jeffrey Heer, Jake VanderPlas and Mike Bohstock amongst their ranks. Note that Tableau is a spin-off from this community, too. Streamlit is incorporated in the US, but it’s creators are spread all over the world. It was acquired by Snowflake in 2022.

This curriculum focuses on Altair and Streamlit.

Gapminder in many different ways#

Let’s look at different implementations of the classic Gapminder bubble chart in a Streamlit dashboard. The source code is as follows

from dataclasses import dataclass

import altair as alt
from bokeh.models import (Button, CategoricalColorMapper, ColumnDataSource, HoverTool, Label, LogTicker, Slider)
from bokeh.palettes import Spectral6
from bokeh.plotting import figure
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import streamlit as st


@dataclass
class Gapminder:
    """Class for storing Gapminder data and plots"""

    url: str = "https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv"
    year: int = 1952
    show_data: bool = False
    show_legend: bool = True
    chart_height: int = 500

    def __post_init__(self):
        self.dataset = pd.read_csv(self.url)
        self.df = self.get_data()
        self.title = f"Life expectancy vs. GPD ({self.year}"
        self.xlabel = "GDP per capita (2000 dollars)"
        self.ylabel = "Life expectancy (years)"
        self.xlim = (self.df['gdpPercap'].min()-100,self.df['gdpPercap'].max()+1000)
        self.ylim = (20, 90)

    def get_data(self):
        """Return gapminder data for a given year.

        Countries with gdpPercap lower than 10,000 are discarded.
        """
        df = self.dataset[
            (self.dataset.year == self.year) & (self.dataset.gdpPercap < 10000)
        ].copy()
        df["size"] = np.sqrt(df["pop"] * 2.666051223553066e-05)
        return df

    def altair(self):
        legend = {} if self.show_legend else {"legend": None}
        plot = (
            alt.Chart(self.df)
            .mark_circle()
            .  (
                alt.X(
                    "gdpPercap:Q",
                    scale=alt.Scale(type="log"),
                    axis=alt.Axis(title=self.xlabel),
                ),
                alt.Y(
                    "lifeExp:Q",
                    scale=alt.Scale(zero=False, domain=self.ylim),
                    axis=alt.Axis(title=self.ylabel),
                ),
                size=alt.Size("pop:Q", scale=alt.Scale(type="log"), legend=None),
                color=alt.Color(
                    "continent", scale=alt.Scale(scheme="category10"), **legend
                ),
                tooltip=["continent", "country", "gdpPercap", "lifeExp"],
            )
            .properties(title="Altair", height=self.chart_height)
            .configure_title(anchor="start")
        )

        return plot.interactive()

    def plotly(self):
        traces = []
        for continent, self.df in self.df.groupby("continent"):
            marker = dict(
                symbol="circle",
                sizemode="area",
                sizeref=0.1,
                size=self.df["size"],
                line=dict(width=2),
            )
            traces.append(
                go.Scatter(
                    x=self.df.gdpPercap,
                    y=self.df.lifeExp,
                    mode="markers",
                    marker=marker,
                    name=continent,
                    text=self.df.country,
                )
            )

        axis_opts = dict(
            gridcolor="rgb(255, 255, 255)", zerolinewidth=1, ticklen=5, gridwidth=2
        )
        layout = go.Layout(
            title="Plotly",
            showlegend=self.show_legend,
            height=self.chart_height,
            xaxis=dict(title=self.xlabel, type="log", **axis_opts),
            yaxis=dict(title=self.ylabel, **axis_opts),
        )

        return go.Figure(data=traces, layout=layout)

    
    def bokeh(self):
        # note bokeh version issue https://discuss.streamlit.io/t/bokeh-2-0-potentially-broken-in-streamlit/2025/8
        source = ColumnDataSource(self.df)
        color_mapper = CategoricalColorMapper(palette=Spectral6, factors=self.df.continent.unique())
        plot = figure(title="Bokeh", x_axis_type="log", height=self.chart_height)
        plot.xaxis.axis_label = self.xlabel
        plot.xaxis.ticker=LogTicker()
        plot.yaxis.axis_label = self.ylabel
        plot.scatter(
            x="gdpPercap",
            y="lifeExp",
            size="size",
            source=source,
            fill_color={"field": "continent", "transform": color_mapper},
            fill_alpha=0.8,
            line_color="#7c7e71",
            line_width=0.5,
            line_alpha=0.5,
            legend_group="continent"
            )
        plot.add_tools(HoverTool(tooltips=[
            ("continent:", "@continent"),
            ("country:", "@country"),
            ("GDP per capita:", "@gdpPercap"),
            ("Life expectancy:", "@lifeExp")], show_arrow=False, point_policy="follow_mouse"))
        return plot

    
    def pyplot(self):
            data = self.df
            title = "Matplotlib"
            fig, ax = plt.subplots(figsize=(3, 3))
            ax.set_xscale("log")
            ax.set_title(title, fontsize=16)
            ax.set_xlabel(self.xlabel, fontsize=10)
            ax.set_ylabel(self.ylabel, fontsize=10)
            ax.set_ylim(self.ylim)
            ax.set_xlim(self.xlim)

            for continent, df in data.groupby('continent'):
                ax.scatter(df.gdpPercap, y=df.lifeExp, s=df['size']*5,
                        edgecolor='black', label=continent)
                
            if self.show_legend:
                ax.legend(loc=4)
            
            return fig


# initiate
gapminder = Gapminder()
st.set_page_config(layout="wide")

# side bar
st.sidebar.subheader("Widgets")
st.sidebar.markdown("Use the slider to show data from subsequent years.")
gapminder.year = st.sidebar.slider(label="", min_value=1952, max_value=2007, step=5)
gapminder.show_legend = st.sidebar.checkbox("Toggle legend", gapminder.show_legend)
gapminder.df = gapminder.get_data()


# main body
st.title("Gapminder in different ways")
st.markdown(
    """Demo of different interactive plotting libraries reproducing the classic
    [Gapminder bubble chart](https://discuss.streamlit.io/t/bokeh-2-0-potentially-broken-in-streamlit/2025/8).
    """
)

with st.expander("Show data"):
    st.dataframe(gapminder.df)

col1, col2 = st.columns([1, 1])

with col1:
    st.altair_chart(gapminder.altair(), True)
    st.plotly_chart(gapminder.plotly(), True)

with col2:
    st.bokeh_chart(gapminder.bokeh(), use_container_width=True)
    st.pyplot(gapminder.pyplot(), False)

  Cell In[1], line 49
    .  (
       ^
SyntaxError: invalid syntax

Altair#

Structure of a Altair plot#


alt	convention to `import altair as alt`
.Chart(data)	instantiate `Chart` object with data
.transform{aggregate\|bin\|calculate\|…}_	apply transformations before visualization
.mark{area\|bar\|circle\|\|…}_	choose the geometry c.q. type of plot
.encode(x=.., y=.., color=..}	mapping of variables to a perceivable graphic
.add_selection(…)	define type and predicates for interactive selections
.transform_filter(…)	apply selection filter
.properties(width=…, height=…)	set properties of figure
.interactive()	enable panning and zooming

Workshop exercises#

We are going to use Altair for exploratory data analysis (EDA), with a dataset of choice. Try to make the following, starting with a simple graph and building up to more complex interactions

Create a histogram of a feature of interest using alt.Chart().mark_bar()
Create a small multiple of histograms for various features using .facet or .repeat. Refer to the section on multi-view composition
Add an interactive average to your barchart, as shown in the example below
Create a dynamic query, where the histogram is changed interactively based on some input. See the section on Altair interaction
Create a dynamic query where the filter is based on an other graph
Create a Streamlit dashboard that presents your main findings and conclusions from your EDA

Example to get started#

import altair as alt
from vega_datasets import data


df = data.seattle_weather()
df

	date	precipitation	temp_max	temp_min	wind	weather
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle
1	2012-01-02	10.9	10.6	2.8	4.5	rain
2	2012-01-03	0.8	11.7	7.2	2.3	rain
3	2012-01-04	20.3	12.2	5.6	4.7	rain
4	2012-01-05	1.3	8.9	2.8	6.1	rain
...	...	...	...	...	...	...
1456	2015-12-27	8.6	4.4	1.7	2.9	fog
1457	2015-12-28	1.5	5.0	1.7	1.3	fog
1458	2015-12-29	0.0	7.2	0.6	2.6	fog
1459	2015-12-30	0.0	5.6	-1.0	3.4	sun
1460	2015-12-31	0.0	5.6	-2.1	3.5	sun

1461 rows × 6 columns

step 1: basic bar chart#

bar = alt.Chart(df).mark_bar().encode(x="month(date):T", y="mean(precipitation):Q")
bar

Step 2: add a rule with at the average#

rule = alt.Chart(df).mark_rule(color="firebrick").encode(y="mean(precipitation):Q")

Challenge#

Extend your code to build a small multiple with interaction for each multiple. Feel free to submit your solution, so it can be included in this notebook.

Closing remarks#

Plotly is a good alternative to Altair#

Animations with Plotly#

import plotly.express as px

gapminder = px.data.gapminder()
gapminder_animate = px.scatter(gapminder, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
           size="pop", color="continent", hover_name="country", facet_col="continent",
           log_x=True, size_max=45, range_x=[100,100000], range_y=[25,90], width=800, height=400)

gapminder_animate.show()

Read more on how you can use plotly.express directly from pandas.

Know your limits#

Single-Page Apps (SPAs) are do-able
Beware when you are moving into advanced development territory
- Multi-page apps
- Sharing data between callbacks: when your datasets get too large, you need to store it somewhere and keep track of state of your dataset.
- Working with callbacks does have limitations. If you notice that you need to nest callbacks (callback A -> callback B -> callback C -> final result), you are on your way to the Callback Hell a.k.a. the Pyramid of Doom. Stop and reconsider before continuing.

My personal recommendations#

Choose any interactive plotting library and get to know it: altair, bokeh or plotly
Choose any of the higher level APIs to be productive in you data analysis work: plotly Express or Altair
Choose any of the dashboarding libraries to make an interactive notebook app: Streamlit or Dash
Don’t try to build your own BI tool. Buy one. It saves time and money. (PS: have a look at redash.io)

Introduction to data visualization with Python

Contents

Introduction to data visualization with Python#

The Python Visualisation Landscape#

Can’t see the forest for the trees?#

Learning objectives#

Today we focus on how to make interactive visualisations#

We will not talk about the what of visualisation#

Grammar of (Interactive) Graphics#

Leland’s seven classes#

Vega-Lite’s Grammar of Interactions#

Choosing your Python libraries for interactive visualizations and dashboards#

Gapminder in many different ways#

Altair#

Structure of a Altair plot#

Workshop exercises#

Example to get started#

step 1: basic bar chart#

Step 2: add a rule with at the average#

Step 3: Create small multiple with `.facet`#

Step 4: Add interaction#

Challenge#

Closing remarks#

Plotly is a good alternative to Altair#

Animations with Plotly#

Know your limits#

My personal recommendations#

Introduction to data visualization with Python

Contents

Introduction to data visualization with Python#

The Python Visualisation Landscape#

Can’t see the forest for the trees?#

Learning objectives#

Today we focus on how to make interactive visualisations#

We will not talk about the what of visualisation#

Grammar of (Interactive) Graphics#

Leland’s seven classes#

Vega-Lite’s Grammar of Interactions#

Choosing your Python libraries for interactive visualizations and dashboards#

Gapminder in many different ways#

Altair#

Structure of a Altair plot#

Workshop exercises#

Example to get started#

step 1: basic bar chart#

Step 2: add a rule with at the average#

Step 3: Create small multiple with .facet#

Step 4: Add interaction#

Challenge#

Closing remarks#

Plotly is a good alternative to Altair#

Animations with Plotly#

Know your limits#

My personal recommendations#

Step 3: Create small multiple with `.facet`#