Open In Colab

Introduction to FPP3#

Forecasting: Principles & Practice, Third Edition (FPP3)#

Rob Hyndman is a renowned Australian statistician known for his work on forecasting and time series. We use the third edition of his seminal book Forecasting: Principles & Practice for this section on time-series forecasting. The whole book is freely available online and provides a comprehensive introduction to forecasting methods and presents enough information about each method for readers to use them sensibly.

The book includes examples in R using the fable package which provides a collection of commonly used univariate and multivariate time series forecasting models including exponential smoothing via state space models and automatic ARIMA modelling.

To get started with time-series forecasting in Python:

  • Read the lectures notes here

  • Read the book, at least covering the basics in chapters 1 to 9

  • Try the exercise below and predict the number of bike shares per day

  • Go through the example solution

Exercise: predict number of bike shares per day#

Predict bike-rentals per day#

Using the bike rentals dataset your task is to predict bike-rentals per day with a forecast horizon of one month with confidence intervals. The original data comes from the UCI Machine Learning repository; we use a slightly modified version with extra features.

The code below get you started by reading to dataframes:

  • hourly, with observations per hour and

  • daily, … you guessed it.

It is up to you to decide how to tackle the problem, and which library to use. Consider these libraries as a suggestion:

  • statsmodels.tsa as the basis for time-series analysis

  • pmdarima which wraps statsmodels into a convenient auto.arima function like in R

  • sktime as the new, unified framework for machine learning in Python

  • prophet

import altair as alt
import numpy as np
import pandas as pd


# https://altair-viz.github.io/user_guide/faq.html#local-filesystem
# alt.data_transformers.enable("json")

# https://towardsdatascience.com/how-to-build-a-time-series-dashboard-in-python-with-panel-altair-and-a-jupyter-notebook-c0ed40f02289
# alt.renderers.enable("default")

Data understanding#

Data Set Information#

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Attribute Information#

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

  • instant: record index

  • dteday : date

  • season : season (1:winter, 2:spring, 3:summer, 4:fall)

  • yr : year (0: 2011, 1:2012)

  • mnth : month ( 1 to 12)

  • hr : hour (0 to 23)

  • holiday : weather day is holiday or not (extracted from [Web Link])

  • weekday : day of the week

  • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

  • weathersit :

    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy

    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

  • temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)

  • atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

  • hum: Normalized humidity. The values are divided to 100 (max)

  • windspeed: Normalized wind speed. The values are divided to 67 (max)

  • casual: count of casual users

  • registered: count of registered users

  • cnt: count of total rental bikes including both casual and registered

def parse_date_hour(date, hour):
    """Construct datetime for index of hourly data."""
    return pd.to_datetime(" ".join([date, str(hour).zfill(2)]), format="%Y-%m-%d %H")


daily = pd.read_csv("https://github.com/jads-nl/public-lectures/blob/main/time-series-forecasting/datasets/bike-sharing/bike-sharing-daily-processed.csv?raw=true", parse_dates=["dteday"]).drop(
    columns=["instant", "Unnamed: 0"]
)
hourly = pd.read_csv("https://github.com/jads-nl/public-lectures/blob/main/time-series-forecasting/datasets/bike-sharing/hour.csv?raw=true").drop(columns=["instant"])
hourly.index = pd.DatetimeIndex(
    hourly.apply(lambda row: parse_date_hour(row.dteday, row.hr), axis=1),
    name="timestamp",
)
daily.head()
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[2], line 6
      2     """Construct datetime for index of hourly data."""
      3     return pd.to_datetime(" ".join([date, str(hour).zfill(2)]), format="%Y-%m-%d %H")
----> 6 daily = pd.read_csv("https://github.com/jads-nl/public-lectures/blob/main/time-series-forecasting/datasets/bike-sharing/bike-sharing-daily-processed.csv?raw=true", parse_dates=["dteday"]).drop(
      7     columns=["instant", "Unnamed: 0"]
      8 )
      9 hourly = pd.read_csv("https://github.com/jads-nl/public-lectures/blob/main/time-series-forecasting/datasets/bike-sharing/hour.csv?raw=true").drop(columns=["instant"])
     10 hourly.index = pd.DatetimeIndex(
     11     hourly.apply(lambda row: parse_date_hour(row.dteday, row.hr), axis=1),
     12     name="timestamp",
     13 )

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/io/parsers/readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    899 kwds_defaults = _refine_defaults_read(
    900     dialect,
    901     delimiter,
   (...)
    908     dtype_backend=dtype_backend,
    909 )
    910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/io/parsers/readers.py:577, in _read(filepath_or_buffer, kwds)
    574 _validate_names(kwds.get("names", None))
    576 # Create the parser.
--> 577 parser = TextFileReader(filepath_or_buffer, **kwds)
    579 if chunksize or iterator:
    580     return parser

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/io/parsers/readers.py:1407, in TextFileReader.__init__(self, f, engine, **kwds)
   1404     self.options["has_index_names"] = kwds["has_index_names"]
   1406 self.handles: IOHandles | None = None
-> 1407 self._engine = self._make_engine(f, self.engine)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/io/parsers/readers.py:1661, in TextFileReader._make_engine(self, f, engine)
   1659     if "b" not in mode:
   1660         mode += "b"
-> 1661 self.handles = get_handle(
   1662     f,
   1663     mode,
   1664     encoding=self.options.get("encoding", None),
   1665     compression=self.options.get("compression", None),
   1666     memory_map=self.options.get("memory_map", False),
   1667     is_text=is_text,
   1668     errors=self.options.get("encoding_errors", "strict"),
   1669     storage_options=self.options.get("storage_options", None),
   1670 )
   1671 assert self.handles is not None
   1672 f = self.handles.handle

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/io/common.py:716, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    713     codecs.lookup_error(errors)
    715 # open URLs
--> 716 ioargs = _get_filepath_or_buffer(
    717     path_or_buf,
    718     encoding=encoding,
    719     compression=compression,
    720     mode=mode,
    721     storage_options=storage_options,
    722 )
    724 handle = ioargs.filepath_or_buffer
    725 handles: list[BaseBuffer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/io/common.py:368, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    366 # assuming storage_options is to be interpreted as headers
    367 req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
--> 368 with urlopen(req_info) as req:
    369     content_encoding = req.headers.get("Content-Encoding", None)
    370     if content_encoding == "gzip":
    371         # Override compression based on Content-Encoding header

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/io/common.py:270, in urlopen(*args, **kwargs)
    264 """
    265 Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
    266 the stdlib.
    267 """
    268 import urllib.request
--> 270 return urllib.request.urlopen(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/urllib/request.py:222, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220 else:
    221     opener = _opener
--> 222 return opener.open(url, data, timeout)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/urllib/request.py:531, in OpenerDirector.open(self, fullurl, data, timeout)
    529 for processor in self.process_response.get(protocol, []):
    530     meth = getattr(processor, meth_name)
--> 531     response = meth(req, response)
    533 return response

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/urllib/request.py:640, in HTTPErrorProcessor.http_response(self, request, response)
    637 # According to RFC 2616, "2xx" code indicates that the client's
    638 # request was successfully received, understood, and accepted.
    639 if not (200 <= code < 300):
--> 640     response = self.parent.error(
    641         'http', request, response, code, msg, hdrs)
    643 return response

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/urllib/request.py:569, in OpenerDirector.error(self, proto, *args)
    567 if http_err:
    568     args = (dict, 'default', 'http_error_default') + orig_args
--> 569     return self._call_chain(*args)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/urllib/request.py:502, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    500 for handler in handlers:
    501     func = getattr(handler, meth_name)
--> 502     result = func(*args)
    503     if result is not None:
    504         return result

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/urllib/request.py:649, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 404: Not Found
brush = alt.selection(type='interval', encodings=['x'])

base = (alt
 .Chart(daily)
 .mark_line()
 .encode(x='dteday', y='cnt')
 .properties(width=800, height=200)
 )

overview = base.properties(height=50).add_selection(brush)
detail = base.encode(alt.X('dteday:T', scale=alt.Scale(domain=brush)))
detail & overview
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated.
   Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 4
      1 brush = alt.selection(type='interval', encodings=['x'])
      3 base = (alt
----> 4  .Chart(daily)
      5  .mark_line()
      6  .encode(x='dteday', y='cnt')
      7  .properties(width=800, height=200)
      8  )
     10 overview = base.properties(height=50).add_selection(brush)
     11 detail = base.encode(alt.X('dteday:T', scale=alt.Scale(domain=brush)))

NameError: name 'daily' is not defined
monthly = daily.groupby(['yr', 'mnth'], as_index=False)['cnt'].sum('cnt')
monthly['yr_mnth'] = monthly.apply(lambda df: '-'.join([str(df.yr), str(df.mnth).zfill(2)]), axis=1)

def simple_ts_plot(df, x='yr_mnth', y='cnt', width=800, height=200):
    return alt.Chart(df).mark_line().encode(x=x, y=y).properties(width=800)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 monthly = daily.groupby(['yr', 'mnth'], as_index=False)['cnt'].sum('cnt')
      2 monthly['yr_mnth'] = monthly.apply(lambda df: '-'.join([str(df.yr), str(df.mnth).zfill(2)]), axis=1)
      4 def simple_ts_plot(df, x='yr_mnth', y='cnt', width=800, height=200):

NameError: name 'daily' is not defined
simple_ts_plot(monthly)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 simple_ts_plot(monthly)

NameError: name 'simple_ts_plot' is not defined
# ... and off you go from here