Introduction to data visualization with Python#

The Python Visualisation Landscape#

Can’t see the forest for the trees?#

Learning objectives#

  • Know how to apply a grammar of interactions to make interactive charts with the most commonly used Python graphing libraries (bokeh, plotly, altair)

  • Know how to make single-page apps in Jupyter notebook with the most commonly used Python dashboarding and analytic app libraries (panel, plotly dash)

Today we focus on how to make interactive visualisations#

  • Apply Grammar of (Interactive) Graphics to dissect different implementations of Gapminder

  • Look into under the hood to get better understanding how everything works, so you can start using the libraries properly

  • Dissect different designs patterns of interactive graphs that are useful voor data analytics

We will not talk about the what of visualisation#

Grammar of (Interactive) Graphics#

Leland’s seven classes#

GoG class

Description

Comments and examples

1. Varset

A set of one or more variables

More generic definition than just a table or dataframe

2. Algebra

Produce combinations of variables

Join, concatenate, group by

3. Scales

Scale variables

Transformation like taking log or normalizing

4. Statistics

Compute statistical summeries

Generates a new varset

5. Geometry

Control the type of plot

point, line, area, path, bar, polygon, edge etc.

6. Coordinates

The coordinate system and faceting

Usually Cartesian, but also polar or geographic coordinates

7. Aesthetics

Actual mapping of variables to a perceivable graphic

Visual variables include position, size, shape, orientation, brightness, color, granularity. For interactive graphics also blur, sound and motion.

Vega-Lite’s Grammar of Interactions#

Selection component

Description

Comments and examples

type

Way in which backing points are selected as minimal set to identify all selected points

point, list, interval

predicate

Logic to determine selected points

Inside or outside dragged area, within a range etc.

domain or range

Invert screen position to data values

Click on a mark for selecting single point, drag to select points in area etc.

event

The actual input event

Mouseover, selection by dragging

init

Initialize selection with specific points

Used for automatically determining scale extents

transforms

Manipulate selection

E.g. moving a rectangular selection

resolve

Re-evaluate visual encodings as selections change

Change color (highlighting), use selection as input for other encodings (cross-filtering), re-define scales etc.

Choosing your Python libraries for interactive visualizations and dashboards#

Pythonistas are somewhat envious of that the fact that the R stack has set the standard for interactive data visualizations and dashboarding with ggplot2 and Shiny. But recently Python has caught up, going by the number of stars on GitHub.

For you interactive data visualization work, you need to make two choices:

  • Choose your interactive plotting library for making figures. Altair, Bokeh and Plotly are the most popular ones, you can find a more detailed comparison here

  • Choose your dashboarding library for making, you guessed it, dashboards. Streamlit, Voilá, Plotly Dash and Panel are the most popular ones, you can read more here and here

Note that although many plotting libraries are supported in the dashboarding libraries, some integrations work better than others. Sticking to the same ecosystem yields the following combinations:

  • Plotly + Plotly Dash: backed by a Canadian company under the same name, this is an excellent stack to work in. Over time you can upgrade to a paid (enterprise) version including low-code development environments and hosting for ease of sharing apps.

  • Bokeh + Panel: pure open source libraries which are financially supported by the NumFOCUS and Anaconda. Complete freedom to integrate these libraries into your own stack without ever having to worry about licensing.

  • Altair + Streamlit: the new kids on the block, but with an impressive pedigree. University of Washinton Interactive Data Lab are the core developers of Altair, with Jeffrey Heer, Jake VanderPlas and Mike Bohstock amongst their ranks. Note that Tableau is a spin-off from this community, too. Streamlit is incorporated in the US, but it’s creators are spread all over the world. It was acquired by Snowflake in 2022.

This curriculum focuses on Altair and Streamlit.

Gapminder in many different ways#

Let’s look at different implementations of the classic Gapminder bubble chart in a Streamlit dashboard. The source code is as follows

from dataclasses import dataclass

import altair as alt
from bokeh.models import (Button, CategoricalColorMapper, ColumnDataSource, HoverTool, Label, LogTicker, Slider)
from bokeh.palettes import Spectral6
from bokeh.plotting import figure
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import streamlit as st


@dataclass
class Gapminder:
    """Class for storing Gapminder data and plots"""

    url: str = "https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv"
    year: int = 1952
    show_data: bool = False
    show_legend: bool = True
    chart_height: int = 500

    def __post_init__(self):
        self.dataset = pd.read_csv(self.url)
        self.df = self.get_data()
        self.title = f"Life expectancy vs. GPD ({self.year}"
        self.xlabel = "GDP per capita (2000 dollars)"
        self.ylabel = "Life expectancy (years)"
        self.xlim = (self.df['gdpPercap'].min()-100,self.df['gdpPercap'].max()+1000)
        self.ylim = (20, 90)

    def get_data(self):
        """Return gapminder data for a given year.

        Countries with gdpPercap lower than 10,000 are discarded.
        """
        df = self.dataset[
            (self.dataset.year == self.year) & (self.dataset.gdpPercap < 10000)
        ].copy()
        df["size"] = np.sqrt(df["pop"] * 2.666051223553066e-05)
        return df

    def altair(self):
        legend = {} if self.show_legend else {"legend": None}
        plot = (
            alt.Chart(self.df)
            .mark_circle()
            .  (
                alt.X(
                    "gdpPercap:Q",
                    scale=alt.Scale(type="log"),
                    axis=alt.Axis(title=self.xlabel),
                ),
                alt.Y(
                    "lifeExp:Q",
                    scale=alt.Scale(zero=False, domain=self.ylim),
                    axis=alt.Axis(title=self.ylabel),
                ),
                size=alt.Size("pop:Q", scale=alt.Scale(type="log"), legend=None),
                color=alt.Color(
                    "continent", scale=alt.Scale(scheme="category10"), **legend
                ),
                tooltip=["continent", "country", "gdpPercap", "lifeExp"],
            )
            .properties(title="Altair", height=self.chart_height)
            .configure_title(anchor="start")
        )

        return plot.interactive()

    def plotly(self):
        traces = []
        for continent, self.df in self.df.groupby("continent"):
            marker = dict(
                symbol="circle",
                sizemode="area",
                sizeref=0.1,
                size=self.df["size"],
                line=dict(width=2),
            )
            traces.append(
                go.Scatter(
                    x=self.df.gdpPercap,
                    y=self.df.lifeExp,
                    mode="markers",
                    marker=marker,
                    name=continent,
                    text=self.df.country,
                )
            )

        axis_opts = dict(
            gridcolor="rgb(255, 255, 255)", zerolinewidth=1, ticklen=5, gridwidth=2
        )
        layout = go.Layout(
            title="Plotly",
            showlegend=self.show_legend,
            height=self.chart_height,
            xaxis=dict(title=self.xlabel, type="log", **axis_opts),
            yaxis=dict(title=self.ylabel, **axis_opts),
        )

        return go.Figure(data=traces, layout=layout)

    
    def bokeh(self):
        # note bokeh version issue https://discuss.streamlit.io/t/bokeh-2-0-potentially-broken-in-streamlit/2025/8
        source = ColumnDataSource(self.df)
        color_mapper = CategoricalColorMapper(palette=Spectral6, factors=self.df.continent.unique())
        plot = figure(title="Bokeh", x_axis_type="log", height=self.chart_height)
        plot.xaxis.axis_label = self.xlabel
        plot.xaxis.ticker=LogTicker()
        plot.yaxis.axis_label = self.ylabel
        plot.scatter(
            x="gdpPercap",
            y="lifeExp",
            size="size",
            source=source,
            fill_color={"field": "continent", "transform": color_mapper},
            fill_alpha=0.8,
            line_color="#7c7e71",
            line_width=0.5,
            line_alpha=0.5,
            legend_group="continent"
            )
        plot.add_tools(HoverTool(tooltips=[
            ("continent:", "@continent"),
            ("country:", "@country"),
            ("GDP per capita:", "@gdpPercap"),
            ("Life expectancy:", "@lifeExp")], show_arrow=False, point_policy="follow_mouse"))
        return plot

    
    def pyplot(self):
            data = self.df
            title = "Matplotlib"
            fig, ax = plt.subplots(figsize=(3, 3))
            ax.set_xscale("log")
            ax.set_title(title, fontsize=16)
            ax.set_xlabel(self.xlabel, fontsize=10)
            ax.set_ylabel(self.ylabel, fontsize=10)
            ax.set_ylim(self.ylim)
            ax.set_xlim(self.xlim)

            for continent, df in data.groupby('continent'):
                ax.scatter(df.gdpPercap, y=df.lifeExp, s=df['size']*5,
                        edgecolor='black', label=continent)
                
            if self.show_legend:
                ax.legend(loc=4)
            
            return fig


# initiate
gapminder = Gapminder()
st.set_page_config(layout="wide")

# side bar
st.sidebar.subheader("Widgets")
st.sidebar.markdown("Use the slider to show data from subsequent years.")
gapminder.year = st.sidebar.slider(label="", min_value=1952, max_value=2007, step=5)
gapminder.show_legend = st.sidebar.checkbox("Toggle legend", gapminder.show_legend)
gapminder.df = gapminder.get_data()


# main body
st.title("Gapminder in different ways")
st.markdown(
    """Demo of different interactive plotting libraries reproducing the classic
    [Gapminder bubble chart](https://discuss.streamlit.io/t/bokeh-2-0-potentially-broken-in-streamlit/2025/8).
    """
)

with st.expander("Show data"):
    st.dataframe(gapminder.df)

col1, col2 = st.columns([1, 1])

with col1:
    st.altair_chart(gapminder.altair(), True)
    st.plotly_chart(gapminder.plotly(), True)

with col2:
    st.bokeh_chart(gapminder.bokeh(), use_container_width=True)
    st.pyplot(gapminder.pyplot(), False)
  Cell In[1], line 49
    .  (
       ^
SyntaxError: invalid syntax

Altair#

Structure of a Altair plot#

alt

convention to import altair as alt

.Chart(data)

instantiate Chart object with data

.transform{aggregate|bin|calculate|…}_

apply transformations before visualization

.mark{area|bar|circle||…}_

choose the geometry c.q. type of plot

.encode(x=.., y=.., color=..}

mapping of variables to a perceivable graphic

.add_selection(…)

define type and predicates for interactive selections

.transform_filter(…)

apply selection filter

.properties(width=…, height=…)

set properties of figure

.interactive()

enable panning and zooming

Workshop exercises#

We are going to use Altair for exploratory data analysis (EDA), with a dataset of choice. Try to make the following, starting with a simple graph and building up to more complex interactions

  1. Create a histogram of a feature of interest using alt.Chart().mark_bar()

  2. Create a small multiple of histograms for various features using .facet or .repeat. Refer to the section on multi-view composition

  3. Add an interactive average to your barchart, as shown in the example below

  4. Create a dynamic query, where the histogram is changed interactively based on some input. See the section on Altair interaction

  5. Create a dynamic query where the filter is based on an other graph

  6. Create a Streamlit dashboard that presents your main findings and conclusions from your EDA

Example to get started#

import altair as alt
from vega_datasets import data


df = data.seattle_weather()
df
date precipitation temp_max temp_min wind weather
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle
1 2012-01-02 10.9 10.6 2.8 4.5 rain
2 2012-01-03 0.8 11.7 7.2 2.3 rain
3 2012-01-04 20.3 12.2 5.6 4.7 rain
4 2012-01-05 1.3 8.9 2.8 6.1 rain
... ... ... ... ... ... ...
1456 2015-12-27 8.6 4.4 1.7 2.9 fog
1457 2015-12-28 1.5 5.0 1.7 1.3 fog
1458 2015-12-29 0.0 7.2 0.6 2.6 fog
1459 2015-12-30 0.0 5.6 -1.0 3.4 sun
1460 2015-12-31 0.0 5.6 -2.1 3.5 sun

1461 rows × 6 columns

step 1: basic bar chart#

bar = alt.Chart(df).mark_bar().encode(x="month(date):T", y="mean(precipitation):Q")
bar

Step 2: add a rule with at the average#

rule = alt.Chart(df).mark_rule(color="firebrick").encode(y="mean(precipitation):Q")

Step 3: Create small multiple with .facet#

Create a small multiple of the previous graph for each type of weather. The facet plot should have 2 columns.

(bar + rule).properties(width=400).facet(facet="weather:O", columns=2)
# more verbose solution, to show how you can parametrize composition in Altair
bar_ = (alt
       .Chart()
       .mark_bar()
       .encode(x="month(date):T", y="mean(precipitation):Q"))
rule_ = (alt
        .Chart()
        .mark_rule(color="firebrick")
        .encode(y="mean(precipitation):Q")
       )
alt.layer(bar_, rule_, data=df).facet(facet="weather:O", columns=2)

Step 4: Add interaction#

See also mean based selection

selection = alt.selection_interval(encodings=['x'])

base = alt.Chart(df)

bar_i = base.mark_bar().encode(
    x="month(date):T",
    y="mean(precipitation):Q",
    opacity=alt.condition(selection, alt.value(1.0), alt.value(0.7))).add_selection(selection)

rule_i = base.mark_rule(color="firebrick").transform_filter(selection).encode(y="mean(precipitation):Q")
text = rule_i.mark_text(angle=0, color='firebrick', dy=-10).encode(text=alt.Text('mean(precipitation):Q',format=',.2r'))

(bar_i + rule_i + text).properties(width=600)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)

Challenge#

Extend your code to build a small multiple with interaction for each multiple. Feel free to submit your solution, so it can be included in this notebook.

Closing remarks#

Plotly is a good alternative to Altair#

Animations with Plotly#

import plotly.express as px

gapminder = px.data.gapminder()
gapminder_animate = px.scatter(gapminder, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
           size="pop", color="continent", hover_name="country", facet_col="continent",
           log_x=True, size_max=45, range_x=[100,100000], range_y=[25,90], width=800, height=400)
gapminder_animate.show()

Read more on how you can use plotly.express directly from pandas.

Know your limits#

  • Single-Page Apps (SPAs) are do-able

  • Beware when you are moving into advanced development territory

    • Multi-page apps

    • Sharing data between callbacks: when your datasets get too large, you need to store it somewhere and keep track of state of your dataset.

    • Working with callbacks does have limitations. If you notice that you need to nest callbacks (callback A -> callback B -> callback C -> final result), you are on your way to the Callback Hell a.k.a. the Pyramid of Doom. Stop and reconsider before continuing.

My personal recommendations#

  • Choose any interactive plotting library and get to know it: altair, bokeh or plotly

  • Choose any of the higher level APIs to be productive in you data analysis work: plotly Express or Altair

  • Choose any of the dashboarding libraries to make an interactive notebook app: Streamlit or Dash

  • Don’t try to build your own BI tool. Buy one. It saves time and money. (PS: have a look at redash.io)