Lab 10.9.5: IMDb Document Classification

Lab 10.9.5: IMDb Document Classification#

Attribution#

This notebook follows lab 10.9.5 from ISLRv2. The R-code has been ported to Python by Daniel Kapitan (23-01-2022).

Data preparation#

We perform document classification on the IMDb dataset, which is available as part of the tensorflow.keras package. We limit the dictionary size to the 10,000 most frequently-used words and tokens.

import itertools
import os
import pickle
from pprint import pprint

import altair as alt
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential


# let's keep our keras backend tensorflow quiet
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# load the data
MAX_FEATURES = 10_000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)

2023-11-11 12:09:36.399868: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-11 12:09:36.437446: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-11 12:09:36.439559: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

2023-11-11 12:09:37.206548: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz

    8192/17464789 [..............................] - ETA: 0s


  565248/17464789 [..............................] - ETA: 1s


 7249920/17464789 [===========>..................] - ETA: 0s


11190272/17464789 [==================>...........] - ETA: 0s


16498688/17464789 [===========================>..] - ETA: 0s


17464789/17464789 [==============================] - 0s 0us/step

The last line is a shortcut for unpacking the list of lists. Each element of X_train is a vector of numbers between 0 and 9999 (the document), referring to the words found in the dictionary. For example, the first training document is the positive review on page 419 of ISLRv2. The indices of the first 12 words are given below.

X_train[1][:12]

[1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012]

To see the words, we create a function, decode_review(), that provides a simple interface to the dictionary.

word_index = imdb.get_word_index()

def decode_review(text, word_index=word_index, start_char=1, oov_char=2, index_from=3):
    """Decodes one-hot encoded reviews from IMDb.
    
    Note default values for `imdb.load_data`, see https://keras.io/api/datasets/imdb/
    """
    reverse_word_index = {v + index_from: k for k, v in word_index.items()}

    # add special tags
    tags = {0: "<PAD>", start_char: "<START>", oov_char: "<UNK>"}
    reverse_word_index = {**tags, **reverse_word_index}

    return " ".join([reverse_word_index.get(i, "<index not found>") for i in text])


decode_review(X_train[0])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json

   8192/1641221 [..............................] - ETA: 0s


1641221/1641221 [==============================] - 0s 0us/step

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

Using spare binary matrices#

Next we write a function to “one-hot” encode each document in a list of documents, and return a binary matrix in sparse-matrix format.

def one_hot_encode(sequences, dimension=MAX_FEATURES):
    """ One-hot encodes IMDb reviews as SciPy sparse matrix.
    
    Using csr_matrix, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix
    For more on sparse matrices, see https://machinelearningmastery.com/sparse-matrices-for-machine-learning/
    """
    seqlen = [len(sequence) for sequence in sequences]
    n = len(seqlen)
    row_index = np.repeat(range(n), seqlen)
    col_index = list(itertools.chain(*sequences))
    data = np.ones(len(row_index), dtype="int")
    return csr_matrix((data, (row_index, col_index)), shape=(n, dimension))

To construct the matrix, one supplies just the entries that are nonzero. In the last line we call csr_matrix() and supply the row indices corresponding to each document and the column indices corresponding to the words in each document. data is literally a list of ones. Words that appear more than once in any given document still get recorded as one.

X_train_1h = one_hot_encode(X_train)
X_test_1h = one_hot_encode(X_test)
y_train = np.array(y_train).astype("float32")
y_test = np.array(y_test).astype("float32")

X_train_1h.shape

(25000, 10000)

# TO DO: don't understand why we get 2.3% non zeros
X_train_1h.sum().sum() / (25_000 * 10_000)

0.023871364

Only 1.3% of the entries are nonzero, so this amounts to considerable savings in memory. We create a validation set of size 2,000, leaving 23,000 for training.

Fit Lasso Logistic Regression#

First we fit a lasso logistic regression model using sklearn.linear_model.LogisticRegression() on the training data, and evaluate its performance on the validation data. In order to plot the accuracy as a function of the shrinkage parameter \(\lambda\), we iterate over different values of \(-log \lambda = [1, 2, .. 20]\). Note that sklearn uses C as the regularization parameter, which is the inverse of lambda. Similar espressions compute the performance on the test data, and were used to produce the left plot in Figure 10.11. The code takes advantage of the sparse-matrix format of x_train_1hand runs in about two minutes; in the usual dense format it would take about over orders of magnitude longer.

np.random.seed(3)
ival = sorted(np.random.choice(range(len(y_train)), 2000))
mask = np.ones_like(y_train, dtype=bool)
mask[ival] = False

%%time
models = {}
for log_lambda in range(1, 20):
    C = 1 / (2 ** ((log_lambda / 2)))
    print(f"fitting LogisticRegression with C = {C:.3f}")
    models[C] = LogisticRegression(C=C, max_iter=1000, random_state=0).fit(
        X_train_1h[mask, :], y_train[mask]
    )
    print("... done")

fitting LogisticRegression with C = 0.707

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

... done
fitting LogisticRegression with C = 0.500

... done
fitting LogisticRegression with C = 0.354

... done
fitting LogisticRegression with C = 0.250

... done
fitting LogisticRegression with C = 0.177

... done
fitting LogisticRegression with C = 0.125

... done
fitting LogisticRegression with C = 0.088

... done
fitting LogisticRegression with C = 0.062

... done
fitting LogisticRegression with C = 0.044

... done
fitting LogisticRegression with C = 0.031

... done
fitting LogisticRegression with C = 0.022

... done
fitting LogisticRegression with C = 0.016

... done
fitting LogisticRegression with C = 0.011

... done
fitting LogisticRegression with C = 0.008

... done
fitting LogisticRegression with C = 0.006

... done
fitting LogisticRegression with C = 0.004

... done
fitting LogisticRegression with C = 0.003

... done
fitting LogisticRegression with C = 0.002

... done
fitting LogisticRegression with C = 0.001

... done
CPU times: user 2min 22s, sys: 3min 51s, total: 6min 14s
Wall time: 1min 34s

accuracy = [
    (
        (k, "train", accuracy_score(y_train[mask], v.predict(X_train_1h[mask, :]))),
        (
            k,
            "validation",
            accuracy_score(y_train[ival], v.predict(X_train_1h[ival, :])),
        ),
    )
    for k, v in models.items()
]
df_lr = pd.DataFrame(itertools.chain(*accuracy), columns=["C", "fold", "accuracy"])
plot_lr = (
    alt.Chart(df_lr)
    .mark_line(point=True)
    .encode(x="log_C:Q", y="accuracy", color="fold")
    .transform_calculate(log_C="log(datum.C)")
)
plot_lr

Build a two-layer feedforward network#

# building a linear stack of layers with the sequential model
model = Sequential()
model.add(Dense(16, activation='relu', input_shape=(10_000,)))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential"

_________________________________________________________________

 Layer (type)                Output Shape              Param #

=================================================================

 dense (Dense)               (None, 16)                160016

 dense_1 (Dense)             (None, 16)                272

 dense_2 (Dense)             (None, 1)                 17

=================================================================

Total params: 160305 (626.19 KB)

Trainable params: 160305 (626.19 KB)

Non-trainable params: 0 (0.00 Byte)

_________________________________________________________________

%%time
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='rmsprop')
history = model.fit(X_train_1h[mask,:], y_train[mask],
          batch_size=512, epochs=20,
          verbose=2,
          validation_data=(X_train_1h[ival,:], y_train[ival]))

Epoch 1/20

46/46 - 1s - loss: 0.5119 - accuracy: 0.7766 - val_loss: 0.4165 - val_accuracy: 0.8385 - 867ms/epoch - 19ms/step

Epoch 2/20

46/46 - 0s - loss: 0.3192 - accuracy: 0.8942 - val_loss: 0.3318 - val_accuracy: 0.8735 - 268ms/epoch - 6ms/step

Epoch 3/20

46/46 - 0s - loss: 0.2538 - accuracy: 0.9135 - val_loss: 0.3071 - val_accuracy: 0.8820 - 268ms/epoch - 6ms/step

Epoch 4/20

46/46 - 0s - loss: 0.2171 - accuracy: 0.9251 - val_loss: 0.2795 - val_accuracy: 0.8915 - 269ms/epoch - 6ms/step

Epoch 5/20

46/46 - 0s - loss: 0.1862 - accuracy: 0.9360 - val_loss: 0.2822 - val_accuracy: 0.8885 - 276ms/epoch - 6ms/step

Epoch 6/20

46/46 - 0s - loss: 0.1675 - accuracy: 0.9430 - val_loss: 0.3178 - val_accuracy: 0.8830 - 269ms/epoch - 6ms/step

Epoch 7/20

46/46 - 0s - loss: 0.1509 - accuracy: 0.9484 - val_loss: 0.3325 - val_accuracy: 0.8775 - 267ms/epoch - 6ms/step

Epoch 8/20

46/46 - 0s - loss: 0.1366 - accuracy: 0.9529 - val_loss: 0.3191 - val_accuracy: 0.8865 - 266ms/epoch - 6ms/step

Epoch 9/20

46/46 - 0s - loss: 0.1276 - accuracy: 0.9576 - val_loss: 0.4159 - val_accuracy: 0.8620 - 267ms/epoch - 6ms/step

Epoch 10/20

46/46 - 0s - loss: 0.1165 - accuracy: 0.9619 - val_loss: 0.3451 - val_accuracy: 0.8760 - 261ms/epoch - 6ms/step

Epoch 11/20

46/46 - 0s - loss: 0.1037 - accuracy: 0.9666 - val_loss: 0.3624 - val_accuracy: 0.8780 - 274ms/epoch - 6ms/step

Epoch 12/20

46/46 - 0s - loss: 0.0961 - accuracy: 0.9689 - val_loss: 0.3697 - val_accuracy: 0.8765 - 264ms/epoch - 6ms/step

Epoch 13/20

46/46 - 0s - loss: 0.0871 - accuracy: 0.9728 - val_loss: 0.4283 - val_accuracy: 0.8700 - 267ms/epoch - 6ms/step

Epoch 14/20

46/46 - 0s - loss: 0.0764 - accuracy: 0.9771 - val_loss: 0.4655 - val_accuracy: 0.8670 - 264ms/epoch - 6ms/step

Epoch 15/20

46/46 - 0s - loss: 0.0711 - accuracy: 0.9784 - val_loss: 0.4456 - val_accuracy: 0.8760 - 270ms/epoch - 6ms/step

Epoch 16/20

46/46 - 0s - loss: 0.0596 - accuracy: 0.9834 - val_loss: 0.4501 - val_accuracy: 0.8735 - 260ms/epoch - 6ms/step

Epoch 17/20

46/46 - 0s - loss: 0.0634 - accuracy: 0.9815 - val_loss: 0.4657 - val_accuracy: 0.8740 - 266ms/epoch - 6ms/step

Epoch 18/20

46/46 - 0s - loss: 0.0499 - accuracy: 0.9861 - val_loss: 0.4848 - val_accuracy: 0.8710 - 261ms/epoch - 6ms/step

Epoch 19/20

46/46 - 0s - loss: 0.0505 - accuracy: 0.9844 - val_loss: 0.5629 - val_accuracy: 0.8640 - 266ms/epoch - 6ms/step

Epoch 20/20

46/46 - 0s - loss: 0.0382 - accuracy: 0.9911 - val_loss: 0.5837 - val_accuracy: 0.8610 - 262ms/epoch - 6ms/step

CPU times: user 12.4 s, sys: 866 ms, total: 13.3 s
Wall time: 6.12 s

The history object has a metrics attribute that records both the training and validation accuracy at each epoch. We’ll wrangle it into long-format for plotting with Altair.

_ = (
    pd.DataFrame(history.history)
    .reset_index()
    .rename(columns={"index": "epoch"})
    .assign(epoch=lambda df: df.epoch + 1)
)
df_mlp = pd.concat(
    [
        _.iloc[:, 0:3].assign(fold="training"),
        _.iloc[:, [0, -2, -1]]
        .rename(columns={"val_accuracy": "accuracy", "val_loss": "loss"})
        .assign(fold="validation"),
    ],
    axis=0,
).reset_index(drop=True)
plot_mlp = (
    alt.Chart(df_mlp)
    .mark_line(point=True)
    .encode(
        x="epoch:Q",
        y="accuracy",
        color="fold",
        tooltip=["epoch", alt.Tooltip("accuracy", format=",.2f")],
    )
)
plot_mlp

To compute the test accuracy, we rerun the entire sequence above, replacing the last line with

history_test = model.fit(X_train_1h[mask,:], y_train[mask],
          batch_size=512, epochs=20,
          verbose=2,
          validation_data=(X_test_1h, y_test))

Epoch 1/20

46/46 - 0s - loss: 0.0415 - accuracy: 0.9885 - val_loss: 0.5478 - val_accuracy: 0.8621 - 475ms/epoch - 10ms/step

Epoch 2/20

46/46 - 0s - loss: 0.0325 - accuracy: 0.9918 - val_loss: 0.5820 - val_accuracy: 0.8615 - 379ms/epoch - 8ms/step

Epoch 3/20

46/46 - 0s - loss: 0.0313 - accuracy: 0.9931 - val_loss: 0.6006 - val_accuracy: 0.8603 - 372ms/epoch - 8ms/step

Epoch 4/20

46/46 - 0s - loss: 0.0304 - accuracy: 0.9924 - val_loss: 0.6116 - val_accuracy: 0.8593 - 369ms/epoch - 8ms/step

Epoch 5/20

46/46 - 0s - loss: 0.0237 - accuracy: 0.9942 - val_loss: 0.6363 - val_accuracy: 0.8588 - 370ms/epoch - 8ms/step

Epoch 6/20

46/46 - 0s - loss: 0.0241 - accuracy: 0.9941 - val_loss: 0.6508 - val_accuracy: 0.8600 - 368ms/epoch - 8ms/step

Epoch 7/20

46/46 - 0s - loss: 0.0212 - accuracy: 0.9951 - val_loss: 0.6782 - val_accuracy: 0.8584 - 373ms/epoch - 8ms/step

Epoch 8/20

46/46 - 0s - loss: 0.0217 - accuracy: 0.9946 - val_loss: 0.6914 - val_accuracy: 0.8588 - 370ms/epoch - 8ms/step

Epoch 9/20

46/46 - 0s - loss: 0.0174 - accuracy: 0.9957 - val_loss: 0.7138 - val_accuracy: 0.8582 - 375ms/epoch - 8ms/step

Epoch 10/20

46/46 - 0s - loss: 0.0192 - accuracy: 0.9958 - val_loss: 0.7304 - val_accuracy: 0.8583 - 376ms/epoch - 8ms/step

Epoch 11/20

46/46 - 0s - loss: 0.0062 - accuracy: 0.9995 - val_loss: 0.7501 - val_accuracy: 0.8589 - 367ms/epoch - 8ms/step

Epoch 12/20

46/46 - 0s - loss: 0.0148 - accuracy: 0.9965 - val_loss: 0.8109 - val_accuracy: 0.8561 - 372ms/epoch - 8ms/step

Epoch 13/20

46/46 - 0s - loss: 0.0139 - accuracy: 0.9969 - val_loss: 0.7983 - val_accuracy: 0.8575 - 369ms/epoch - 8ms/step

Epoch 14/20

46/46 - 0s - loss: 0.0166 - accuracy: 0.9961 - val_loss: 0.8167 - val_accuracy: 0.8578 - 371ms/epoch - 8ms/step

Epoch 15/20

46/46 - 0s - loss: 0.0048 - accuracy: 0.9993 - val_loss: 0.9552 - val_accuracy: 0.8484 - 373ms/epoch - 8ms/step

Epoch 16/20

46/46 - 0s - loss: 0.0037 - accuracy: 0.9997 - val_loss: 0.8486 - val_accuracy: 0.8582 - 367ms/epoch - 8ms/step

Epoch 17/20

46/46 - 0s - loss: 0.0111 - accuracy: 0.9975 - val_loss: 0.8582 - val_accuracy: 0.8591 - 375ms/epoch - 8ms/step

Epoch 18/20

46/46 - 0s - loss: 0.0140 - accuracy: 0.9965 - val_loss: 0.8703 - val_accuracy: 0.8583 - 372ms/epoch - 8ms/step

Epoch 19/20

46/46 - 0s - loss: 0.0021 - accuracy: 0.9998 - val_loss: 0.9009 - val_accuracy: 0.8590 - 371ms/epoch - 8ms/step

Epoch 20/20

46/46 - 0s - loss: 0.0196 - accuracy: 0.9957 - val_loss: 0.9005 - val_accuracy: 0.8588 - 370ms/epoch - 8ms/step