Lab 10.9.6: Recurrent Neural Networks#

Attribution#

This notebook follows lab 10.9.6 from ISLRv2. The R-code has been ported to Python by Daniel Kapitan (30-01-2022). In this lab we fit the models illustrated in Section 10.5.

Sequential Models for Document Classification#

Here we fit a simple LSTM RNN for sentiment analys with the IMBd move-review data, as discussed in section 10.5.1. We shoed hot to input the dat in Lab 10.9.5. We reproduce a shorter version of the code here.

import itertools
import os
import pickle
from pprint import pprint

import altair as alt
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.sparse import SparseTensor, reorder


# let's keep our keras backend tensorflow quiet
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# load the data
MAX_FEATURES = 10_000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)
2023-11-11 12:11:34.299476: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-11 12:11:34.336842: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-11 12:11:34.338043: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-11 12:11:35.080276: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

We first calculate the lenths of the documents.

wc = [len(review) for review in X_train]
print(f"median word-length of reviews: {np.median(wc)}")
print(f"fraction of reviews that is 500 words or less: {len([r for r in wc if r <= 500]) / len(wc)}")
median word-length of reviews: 178.0
fraction of reviews that is 500 words or less: 0.91568

We see that over 91% of the documents have fewer than 500 words. Our RNN requires all the document sequences to have the same length. We hence restrict the document lengths to the last \(L = 500\) words and padd the beginning of the shorter ones with blanks.

maxlen = 500
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)
for X in [X_train, X_test]:
    print(f"Dimension of X_train: {X.shape}")
Dimension of X_train: (25000, 500)
Dimension of X_train: (25000, 500)
X_train[0, 490:500]
array([4472,  113,  103,   32,   15,   16, 5345,   19,  178,   32],
      dtype=int32)

The last expression shows the last few words in the first document. t this stage, each of the 500 words in the document is represented using an integer corresponding to the location of that word in the 10,000-word dictionary. The first layer of the RNN is an embedding layer of size 32, which will be learned during training. This layer one-hot encodes each document as a matrix of dimension 500 \(\times\) 10,000, and then maps these 10,000 dimensions down to 32.

model = Sequential()
model.add(Embedding(input_dim=10_000, output_dim=32))
model.add(LSTM(units=32))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 32)          320000    
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
=================================================================
Total params: 328353 (1.25 MB)
Trainable params: 328353 (1.25 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

The second layer is an LSTM with 32 units, and the output layer is a single sigmoid for the binary classification task.

The rest is now similar to other networks we have fit. We track the test performance as the network is fit, and see that it attains 87% accuracy

%%time
model.compile(loss="binary_crossentropy", metrics=["accuracy"], optimizer="rmsprop")
history = model.fit(
    X_train, y_train, batch_size=128, epochs=10, validation_data=(X_test, y_test), verbose=0
)
CPU times: user 13min 41s, sys: 1min 20s, total: 15min 1s
Wall time: 4min 52s
_ = (
    pd.DataFrame(history.history)
    .reset_index()
    .rename(columns={"index": "epoch"})
    .assign(epoch=lambda df: df.epoch + 1)
)
df = pd.concat(
    [
        _.iloc[:, 0:3].assign(fold="training"),
        _.iloc[:, [0, -2, -1]]
        .rename(columns={"val_accuracy": "accuracy", "val_loss": "loss"})
        .assign(fold="validation"),
    ],
    axis=0,
).reset_index(drop=True)
base = alt.Chart(df).mark_line(point=True)
loss = base.encode(x="epoch:Q", y="loss", color="fold", tooltip=["epoch", alt.Tooltip("loss", format=",.2f")])
accuracy = base.encode(x="epoch:Q", y="accuracy", color="fold", tooltip=["epoch", alt.Tooltip("accuracy", format=",.2f")] )
loss | accuracy

Time Series Prediction#

We now show how to fit the models in section 10.5.2 for time series prediction. We first set up the data, and standardize each of the variables.

NYSE = pd.read_csv("../datasets/NYSE.csv")
xdata = NYSE.loc[:, ["DJ_return", "log_volume", "log_volatility"]]
istrain = NYSE.loc[:, "train"]
xdata = StandardScaler().fit_transform(xdata)

The variable istrain contains a True for each year that is in the training set, and a False for each year in the test set.

We first write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag \(L\), and returns a lagged version of the matrix. It simply inserts \(L\) rows of NA at the top, and truncates the bottom.

# To be continued