diff --git a/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb b/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb new file mode 100644 index 0000000000..f2b5f2f8ca --- /dev/null +++ b/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb @@ -0,0 +1,672 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "# End-to-end Masked Language Modeling with BERT\n", + "\n", + "**Author:** [Ankur Singh](https://twitter.com/ankur310794)
\n", + "**Date created:** 2020/09/18
\n", + "**Last modified:** 2020/09/18
\n", + "**Description:** Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Introduction\n", + "\n", + "Masked Language Modeling is a fill-in-the-blank task,\n", + "where a model uses the context words surrounding a mask token to try to predict what the\n", + "masked word should be.\n", + "\n", + "For an input that contains one or more mask tokens,\n", + "the model will generate the most likely substitution for each.\n", + "\n", + "Example:\n", + "\n", + "- Input: \"I have watched this [MASK] and it was awesome.\"\n", + "- Output: \"I have watched this movie and it was awesome.\"\n", + "\n", + "Masked language modeling is a great way to train a language\n", + "model in a self-supervised setting (without human-annotated labels).\n", + "Such a model can then be fine-tuned to accomplish various supervised\n", + "NLP tasks.\n", + "\n", + "This example teaches you how to build a BERT model from scratch,\n", + "train it with the masked language modeling task,\n", + "and then fine-tune this model on a sentiment classification task.\n", + "\n", + "We will use the Keras `TextVectorization` and `MultiHeadAttention` layers\n", + "to create a BERT Transformer-Encoder network architecture.\n", + "\n", + "Note: This example should be run with `tf-nightly`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Setup\n", + "\n", + "Install `tf-nightly` via `pip install tf-nightly`." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "from tensorflow.keras.layers.experimental.preprocessing import TextVectorization\n", + "from dataclasses import dataclass\n", + "import pandas as pd\n", + "import numpy as np\n", + "import glob\n", + "import re\n", + "from pprint import pprint" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Set-up Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "\n", + "@dataclass\n", + "class Config:\n", + " MAX_LEN = 256\n", + " BATCH_SIZE = 32\n", + " LR = 0.001\n", + " VOCAB_SIZE = 30000\n", + " EMBED_DIM = 128\n", + " NUM_HEAD = 8 # used in bert model\n", + " FF_DIM = 128 # used in bert model\n", + " NUM_LAYERS = 1\n", + "\n", + "\n", + "config = Config()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Load the data\n", + "\n", + "We will first download the IMDB data and load into a Pandas dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n", + "!tar -xf aclImdb_v1.tar.gz" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "\n", + "def get_text_list_from_files(files):\n", + " text_list = []\n", + " for name in files:\n", + " with open(name) as f:\n", + " for line in f:\n", + " text_list.append(line)\n", + " return text_list\n", + "\n", + "\n", + "def get_data_from_text_files(folder_name):\n", + "\n", + " pos_files = glob.glob(\"aclImdb/\" + folder_name + \"/pos/*.txt\")\n", + " pos_texts = get_text_list_from_files(pos_files)\n", + " neg_files = glob.glob(\"aclImdb/\" + folder_name + \"/neg/*.txt\")\n", + " neg_texts = get_text_list_from_files(neg_files)\n", + " df = pd.DataFrame(\n", + " {\n", + " \"review\": pos_texts + neg_texts,\n", + " \"sentiment\": [0] * len(pos_texts) + [1] * len(neg_texts),\n", + " }\n", + " )\n", + " df = df.sample(len(df)).reset_index(drop=True)\n", + " return df\n", + "\n", + "\n", + "train_df = get_data_from_text_files(\"train\")\n", + "test_df = get_data_from_text_files(\"test\")\n", + "\n", + "all_data = train_df.append(test_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Dataset preparation\n", + "\n", + "We will use the `TextVectorization` layer to vectorize the text into integer token ids.\n", + "It transforms a batch of strings into either\n", + "a sequence of token indices (one sample = 1D array of integer token indices, in order)\n", + "or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens).\n", + "\n", + "Below, we define 3 preprocessing functions.\n", + "\n", + "1. The `get_vectorize_layer` function builds the `TextVectorization` layer.\n", + "2. The `encode` function encodes raw text into integer token ids.\n", + "3. The `get_masked_input_and_labels` function will mask input token ids.\n", + "It masks 15% of all input tokens in each sequence at random." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "\n", + "def custom_standardization(input_data):\n", + " lowercase = tf.strings.lower(input_data)\n", + " stripped_html = tf.strings.regex_replace(lowercase, \"
\", \" \")\n", + " return tf.strings.regex_replace(\n", + " stripped_html, \"[%s]\" % re.escape(\"!#$%&'()*+,-./:;<=>?@\\^_`{|}~\"), \"\"\n", + " )\n", + "\n", + "\n", + "def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=[\"[MASK]\"]):\n", + " \"\"\"Build Text vectorization layer\n", + "\n", + " Args:\n", + " texts (list): List of string i.e input texts\n", + " vocab_size (int): vocab size\n", + " max_seq (int): Maximum sequence lenght.\n", + " special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].\n", + "\n", + " Returns:\n", + " layers.Layer: Return TextVectorization Keras Layer\n", + " \"\"\"\n", + " vectorize_layer = TextVectorization(\n", + " max_tokens=vocab_size,\n", + " output_mode=\"int\",\n", + " standardize=custom_standardization,\n", + " output_sequence_length=max_seq,\n", + " )\n", + " vectorize_layer.adapt(texts)\n", + "\n", + " # Insert mask token in vocabulary\n", + " vocab = vectorize_layer.get_vocabulary()\n", + " vocab = vocab[2 : vocab_size - len(special_tokens)] + [\"[mask]\"]\n", + " vectorize_layer.set_vocabulary(vocab)\n", + " return vectorize_layer\n", + "\n", + "\n", + "vectorize_layer = get_vectorize_layer(\n", + " all_data.review.values.tolist(),\n", + " config.VOCAB_SIZE,\n", + " config.MAX_LEN,\n", + " special_tokens=[\"[mask]\"],\n", + ")\n", + "\n", + "# Get mask token id for masked language model\n", + "mask_token_id = vectorize_layer([\"[mask]\"]).numpy()[0][0]\n", + "\n", + "\n", + "def encode(texts):\n", + " encoded_texts = vectorize_layer(texts)\n", + " return encoded_texts.numpy()\n", + "\n", + "\n", + "def get_masked_input_and_labels(encoded_texts):\n", + " # 15% BERT masking\n", + " inp_mask = np.random.rand(*encoded_texts.shape) < 0.15\n", + " # Do not mask special tokens\n", + " inp_mask[encoded_texts <= 2] = False\n", + " # Set targets to -1 by default, it means ignore\n", + " labels = -1 * np.ones(encoded_texts.shape, dtype=int)\n", + " # Set labels for masked tokens\n", + " labels[inp_mask] = encoded_texts[inp_mask]\n", + "\n", + " # Prepare input\n", + " encoded_texts_masked = np.copy(encoded_texts)\n", + " # Set input to [MASK] which is the last token for the 90% of tokens\n", + " # This means leaving 10% unchanged\n", + " inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)\n", + " encoded_texts_masked[\n", + " inp_mask_2mask\n", + " ] = mask_token_id # mask token is the last in the dict\n", + "\n", + " # Set 10% to a random token\n", + " inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)\n", + " encoded_texts_masked[inp_mask_2random] = np.random.randint(\n", + " 3, mask_token_id, inp_mask_2random.sum()\n", + " )\n", + "\n", + " # Prepare sample_weights to pass to .fit() method\n", + " sample_weights = np.ones(labels.shape)\n", + " sample_weights[labels == -1] = 0\n", + "\n", + " # y_labels would be same as encoded_texts i.e input tokens\n", + " y_labels = np.copy(encoded_texts)\n", + "\n", + " return encoded_texts_masked, y_labels, sample_weights\n", + "\n", + "\n", + "# We have 25000 examples for training\n", + "x_train = encode(train_df.review.values) # encode reviews with vectorizer\n", + "y_train = train_df.sentiment.values\n", + "train_classifier_ds = (\n", + " tf.data.Dataset.from_tensor_slices((x_train, y_train))\n", + " .shuffle(1000)\n", + " .batch(config.BATCH_SIZE)\n", + ")\n", + "\n", + "# We have 25000 examples for testing\n", + "x_test = encode(test_df.review.values)\n", + "y_test = test_df.sentiment.values\n", + "test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(\n", + " config.BATCH_SIZE\n", + ")\n", + "\n", + "# Build dataset for end to end model input (will be used at the end)\n", + "test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices(\n", + " (test_df.review.values, y_test)\n", + ").batch(config.BATCH_SIZE)\n", + "\n", + "# Prepare data for masked language model\n", + "x_all_review = encode(all_data.review.values)\n", + "x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(\n", + " x_all_review\n", + ")\n", + "\n", + "mlm_ds = tf.data.Dataset.from_tensor_slices(\n", + " (x_masked_train, y_masked_labels, sample_weights)\n", + ")\n", + "mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Create BERT model (Pretraining Model) for masked language modeling\n", + "\n", + "We will create a BERT-like pretraining model architecture\n", + "using the `MultiHeadAttention` layer.\n", + "It will take token ids as inputs (including masked tokens)\n", + "and it will predict the correct ids for the masked input tokens." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "\n", + "def bert_module(query, key, value, i):\n", + " # Multi headed self-attention\n", + " attention_output = layers.MultiHeadAttention(\n", + " num_heads=config.NUM_HEAD,\n", + " key_dim=config.EMBED_DIM // config.NUM_HEAD,\n", + " name=\"encoder_{}/multiheadattention\".format(i),\n", + " )(query, key, value)\n", + " attention_output = layers.Dropout(0.1, name=\"encoder_{}/att_dropout\".format(i))(\n", + " attention_output\n", + " )\n", + " attention_output = layers.LayerNormalization(\n", + " epsilon=1e-6, name=\"encoder_{}/att_layernormalization\".format(i)\n", + " )(query + attention_output)\n", + "\n", + " # Feed-forward layer\n", + " ffn = keras.Sequential(\n", + " [\n", + " layers.Dense(config.FF_DIM, activation=\"relu\"),\n", + " layers.Dense(config.EMBED_DIM),\n", + " ],\n", + " name=\"encoder_{}/ffn\".format(i),\n", + " )\n", + " ffn_output = ffn(attention_output)\n", + " ffn_output = layers.Dropout(0.1, name=\"encoder_{}/ffn_dropout\".format(i))(\n", + " ffn_output\n", + " )\n", + " sequence_output = layers.LayerNormalization(\n", + " epsilon=1e-6, name=\"encoder_{}/ffn_layernormalization\".format(i)\n", + " )(attention_output + ffn_output)\n", + " return sequence_output\n", + "\n", + "\n", + "def get_pos_encoding_matrix(max_len, d_emb):\n", + " pos_enc = np.array(\n", + " [\n", + " [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)]\n", + " if pos != 0\n", + " else np.zeros(d_emb)\n", + " for pos in range(max_len)\n", + " ]\n", + " )\n", + " pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2]) # dim 2i\n", + " pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2]) # dim 2i+1\n", + " return pos_enc\n", + "\n", + "\n", + "loss_fn = keras.losses.SparseCategoricalCrossentropy(\n", + " reduction=tf.keras.losses.Reduction.NONE\n", + ")\n", + "loss_tracker = tf.keras.metrics.Mean(name=\"loss\")\n", + "\n", + "\n", + "class MaskedLanguageModel(tf.keras.Model):\n", + " def train_step(self, inputs):\n", + " if len(inputs) == 3:\n", + " features, labels, sample_weight = inputs\n", + " else:\n", + " features, labels = inputs\n", + " sample_weight = None\n", + "\n", + " with tf.GradientTape() as tape:\n", + " predictions = self(features, training=True)\n", + " loss = loss_fn(labels, predictions, sample_weight=sample_weight)\n", + "\n", + " # Compute gradients\n", + " trainable_vars = self.trainable_variables\n", + " gradients = tape.gradient(loss, trainable_vars)\n", + "\n", + " # Update weights\n", + " self.optimizer.apply_gradients(zip(gradients, trainable_vars))\n", + "\n", + " # Compute our own metrics\n", + " loss_tracker.update_state(loss, sample_weight=sample_weight)\n", + "\n", + " # Return a dict mapping metric names to current value\n", + " return {\"loss\": loss_tracker.result()}\n", + "\n", + " @property\n", + " def metrics(self):\n", + " # We list our `Metric` objects here so that `reset_states()` can be\n", + " # called automatically at the start of each epoch\n", + " # or at the start of `evaluate()`.\n", + " # If you don't implement this property, you have to call\n", + " # `reset_states()` yourself at the time of your choosing.\n", + " return [loss_tracker]\n", + "\n", + "\n", + "def create_masked_language_bert_model():\n", + " inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)\n", + "\n", + " word_embeddings = layers.Embedding(\n", + " config.VOCAB_SIZE, config.EMBED_DIM, name=\"word_embedding\"\n", + " )(inputs)\n", + " position_embeddings = layers.Embedding(\n", + " input_dim=config.MAX_LEN,\n", + " output_dim=config.EMBED_DIM,\n", + " weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)],\n", + " name=\"position_embedding\",\n", + " )(tf.range(start=0, limit=config.MAX_LEN, delta=1))\n", + " embeddings = word_embeddings + position_embeddings\n", + "\n", + " encoder_output = embeddings\n", + " for i in range(config.NUM_LAYERS):\n", + " encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)\n", + "\n", + " mlm_output = layers.Dense(config.VOCAB_SIZE, name=\"mlm_cls\", activation=\"softmax\")(\n", + " encoder_output\n", + " )\n", + " mlm_model = MaskedLanguageModel(inputs, mlm_output, name=\"masked_bert_model\")\n", + "\n", + " optimizer = keras.optimizers.Adam(learning_rate=config.LR)\n", + " mlm_model.compile(optimizer=optimizer)\n", + " return mlm_model\n", + "\n", + "\n", + "id2token = dict(enumerate(vectorize_layer.get_vocabulary()))\n", + "token2id = {y: x for x, y in id2token.items()}\n", + "\n", + "\n", + "class MaskedTextGenerator(keras.callbacks.Callback):\n", + " def __init__(self, sample_tokens, top_k=5):\n", + " self.sample_tokens = sample_tokens\n", + " self.k = top_k\n", + "\n", + " def decode(self, tokens):\n", + " return \" \".join([id2token[t] for t in tokens if t != 0])\n", + "\n", + " def convert_ids_to_tokens(self, id):\n", + " return id2token[id]\n", + "\n", + " def on_epoch_end(self, epoch, logs=None):\n", + " prediction = self.model.predict(self.sample_tokens)\n", + "\n", + " masked_index = np.where(self.sample_tokens == mask_token_id)\n", + " masked_index = masked_index[1]\n", + " mask_prediction = prediction[0][masked_index]\n", + "\n", + " top_indices = mask_prediction[0].argsort()[-self.k :][::-1]\n", + " values = mask_prediction[0][top_indices]\n", + "\n", + " for i in range(len(top_indices)):\n", + " p = top_indices[i]\n", + " v = values[i]\n", + " tokens = np.copy(sample_tokens[0])\n", + " tokens[masked_index[0]] = p\n", + " result = {\n", + " \"input_text\": self.decode(sample_tokens[0].numpy()),\n", + " \"prediction\": self.decode(tokens),\n", + " \"probability\": v,\n", + " \"predicted mask token\": self.convert_ids_to_tokens(p),\n", + " }\n", + " pprint(result)\n", + "\n", + "\n", + "sample_tokens = vectorize_layer([\"I have watched this [mask] and it was awesome\"])\n", + "generator_callback = MaskedTextGenerator(sample_tokens.numpy())\n", + "\n", + "bert_masked_model = create_masked_language_bert_model()\n", + "bert_masked_model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Train and Save" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback])\n", + "bert_masked_model.save(\"bert_mlm_imdb.h5\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Fine-tune a sentiment classification model\n", + "\n", + "We will fine-tune our self-supervised model on a downstream task of sentiment classification.\n", + "To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the\n", + "pretrained BERT features." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "# Load pretrained bert model\n", + "mlm_model = keras.models.load_model(\n", + " \"bert_mlm_imdb.h5\", custom_objects={\"MaskedLanguageModel\": MaskedLanguageModel}\n", + ")\n", + "pretrained_bert_model = tf.keras.Model(\n", + " mlm_model.input, mlm_model.get_layer(\"encoder_0/ffn_layernormalization\").output\n", + ")\n", + "\n", + "# Freeze it\n", + "pretrained_bert_model.trainable = False\n", + "\n", + "\n", + "def create_classifier_bert_model():\n", + " inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)\n", + " sequence_output = pretrained_bert_model(inputs)\n", + " pooled_output = layers.GlobalMaxPooling1D()(sequence_output)\n", + " hidden_layer = layers.Dense(64, activation=\"relu\")(pooled_output)\n", + " outputs = layers.Dense(1, activation=\"sigmoid\")(hidden_layer)\n", + " classifer_model = keras.Model(inputs, outputs, name=\"classification\")\n", + " optimizer = keras.optimizers.Adam()\n", + " classifer_model.compile(\n", + " optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"accuracy\"]\n", + " )\n", + " return classifer_model\n", + "\n", + "\n", + "classifer_model = create_classifier_bert_model()\n", + "classifer_model.summary()\n", + "\n", + "# Train the classifier with frozen BERT stage\n", + "classifer_model.fit(\n", + " train_classifier_ds,\n", + " epochs=5,\n", + " validation_data=test_classifier_ds,\n", + ")\n", + "\n", + "# Unfreeze the BERT model for fine-tuning\n", + "pretrained_bert_model.trainable = True\n", + "optimizer = keras.optimizers.Adam()\n", + "classifer_model.compile(\n", + " optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"accuracy\"]\n", + ")\n", + "classifer_model.fit(\n", + " train_classifier_ds,\n", + " epochs=5,\n", + " validation_data=test_classifier_ds,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text" + }, + "source": [ + "## Create an end-to-end model and evaluate it\n", + "\n", + "When you want to deploy a model, it's best if it already includes its preprocessing\n", + "pipeline, so that you don't have to reimplement the preprocessing logic in your\n", + "production environment. Let's create an end-to-end model that incorporates\n", + "the `TextVectorization` layer, and let's evaluate. Our model will accept raw strings\n", + "as input." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab_type": "code" + }, + "outputs": [], + "source": [ + "\n", + "def get_end_to_end(model):\n", + " inputs_string = keras.Input(shape=(1,), dtype=\"string\")\n", + " indices = vectorize_layer(inputs_string)\n", + " outputs = model(indices)\n", + " end_to_end_model = keras.Model(inputs_string, outputs, name=\"end_to_end_model\")\n", + " optimizer = keras.optimizers.Adam(learning_rate=config.LR)\n", + " end_to_end_model.compile(\n", + " optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"accuracy\"]\n", + " )\n", + " return end_to_end_model\n", + "\n", + "\n", + "end_to_end_classification_model = get_end_to_end(classifer_model)\n", + "end_to_end_classification_model.evaluate(test_raw_classifier_ds)" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "mlm_and_finetune_with_bert", + "private_outputs": false, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.0" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/examples/nlp/md/mlm_and_finetune_with_bert.md b/examples/nlp/md/mlm_and_finetune_with_bert.md new file mode 100644 index 0000000000..bb8e302951 --- /dev/null +++ b/examples/nlp/md/mlm_and_finetune_with_bert.md @@ -0,0 +1,704 @@ + +# End-to-end Masked Language Modeling with BERT + +**Author:** [Ankur Singh](https://twitter.com/ankur310794)
+**Date created:** 2020/09/18
+**Last modified:** 2020/09/18
+ + + [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb) [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/examples/nlp/mlm_and_finetune_with_bert.py) + + +**Description:** Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset. + +--- +## Introduction + +Masked Language Modeling is a fill-in-the-blank task, +where a model uses the context words surrounding a mask token to try to predict what the +masked word should be. + +For an input that contains one or more mask tokens, +the model will generate the most likely substitution for each. + +Example: + +- Input: "I have watched this [MASK] and it was awesome." +- Output: "I have watched this movie and it was awesome." + +Masked language modeling is a great way to train a language +model in a self-supervised setting (without human-annotated labels). +Such a model can then be fine-tuned to accomplish various supervised +NLP tasks. + +This example teaches you how to build a BERT model from scratch, +train it with the masked language modeling task, +and then fine-tune this model on a sentiment classification task. + +We will use the Keras `TextVectorization` and `MultiHeadAttention` layers +to create a BERT Transformer-Encoder network architecture. + +Note: This example should be run with `tf-nightly`. + +--- +## Setup + +Install `tf-nightly` via `pip install tf-nightly`. + + +```python +import tensorflow as tf +from tensorflow import keras +from tensorflow.keras import layers +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +from dataclasses import dataclass +import pandas as pd +import numpy as np +import glob +import re +from pprint import pprint +``` + +--- +## Set-up Configuration + + +```python + +@dataclass +class Config: + MAX_LEN = 256 + BATCH_SIZE = 32 + LR = 0.001 + VOCAB_SIZE = 30000 + EMBED_DIM = 128 + NUM_HEAD = 8 # used in bert model + FF_DIM = 128 # used in bert model + NUM_LAYERS = 1 + + +config = Config() +``` + +--- +## Load the data + +We will first download the IMDB data and load into a Pandas dataframe. + + +```python +!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz +!tar -xf aclImdb_v1.tar.gz +``` + +```python + +def get_text_list_from_files(files): + text_list = [] + for name in files: + with open(name) as f: + for line in f: + text_list.append(line) + return text_list + + +def get_data_from_text_files(folder_name): + + pos_files = glob.glob("aclImdb/" + folder_name + "/pos/*.txt") + pos_texts = get_text_list_from_files(pos_files) + neg_files = glob.glob("aclImdb/" + folder_name + "/neg/*.txt") + neg_texts = get_text_list_from_files(neg_files) + df = pd.DataFrame( + { + "review": pos_texts + neg_texts, + "sentiment": [0] * len(pos_texts) + [1] * len(neg_texts), + } + ) + df = df.sample(len(df)).reset_index(drop=True) + return df + + +train_df = get_data_from_text_files("train") +test_df = get_data_from_text_files("test") + +all_data = train_df.append(test_df) +``` +
+``` + % Total % Received % Xferd Average Speed Time Time Time Current + Dload Upload Total Spent Left Speed +100 80.2M 100 80.2M 0 0 45.3M 0 0:00:01 0:00:01 --:--:-- 45.3M + +``` +
+--- +## Dataset preparation + +We will use the `TextVectorization` layer to vectorize the text into integer token ids. +It transforms a batch of strings into either +a sequence of token indices (one sample = 1D array of integer token indices, in order) +or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens). + +Below, we define 3 preprocessing functions. + +1. The `get_vectorize_layer` function builds the `TextVectorization` layer. +2. The `encode` function encodes raw text into integer token ids. +3. The `get_masked_input_and_labels` function will mask input token ids. +It masks 15% of all input tokens in each sequence at random. + + +```python + +def custom_standardization(input_data): + lowercase = tf.strings.lower(input_data) + stripped_html = tf.strings.regex_replace(lowercase, "
", " ") + return tf.strings.regex_replace( + stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), "" + ) + + +def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]): + """Build Text vectorization layer + + Args: + texts (list): List of string i.e input texts + vocab_size (int): vocab size + max_seq (int): Maximum sequence lenght. + special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]']. + + Returns: + layers.Layer: Return TextVectorization Keras Layer + """ + vectorize_layer = TextVectorization( + max_tokens=vocab_size, + output_mode="int", + standardize=custom_standardization, + output_sequence_length=max_seq, + ) + vectorize_layer.adapt(texts) + + # Insert mask token in vocabulary + vocab = vectorize_layer.get_vocabulary() + vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"] + vectorize_layer.set_vocabulary(vocab) + return vectorize_layer + + +vectorize_layer = get_vectorize_layer( + all_data.review.values.tolist(), + config.VOCAB_SIZE, + config.MAX_LEN, + special_tokens=["[mask]"], +) + +# Get mask token id for masked language model +mask_token_id = vectorize_layer(["[mask]"]).numpy()[0][0] + + +def encode(texts): + encoded_texts = vectorize_layer(texts) + return encoded_texts.numpy() + + +def get_masked_input_and_labels(encoded_texts): + # 15% BERT masking + inp_mask = np.random.rand(*encoded_texts.shape) < 0.15 + # Do not mask special tokens + inp_mask[encoded_texts <= 2] = False + # Set targets to -1 by default, it means ignore + labels = -1 * np.ones(encoded_texts.shape, dtype=int) + # Set labels for masked tokens + labels[inp_mask] = encoded_texts[inp_mask] + + # Prepare input + encoded_texts_masked = np.copy(encoded_texts) + # Set input to [MASK] which is the last token for the 90% of tokens + # This means leaving 10% unchanged + inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90) + encoded_texts_masked[ + inp_mask_2mask + ] = mask_token_id # mask token is the last in the dict + + # Set 10% to a random token + inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9) + encoded_texts_masked[inp_mask_2random] = np.random.randint( + 3, mask_token_id, inp_mask_2random.sum() + ) + + # Prepare sample_weights to pass to .fit() method + sample_weights = np.ones(labels.shape) + sample_weights[labels == -1] = 0 + + # y_labels would be same as encoded_texts i.e input tokens + y_labels = np.copy(encoded_texts) + + return encoded_texts_masked, y_labels, sample_weights + + +# We have 25000 examples for training +x_train = encode(train_df.review.values) # encode reviews with vectorizer +y_train = train_df.sentiment.values +train_classifier_ds = ( + tf.data.Dataset.from_tensor_slices((x_train, y_train)) + .shuffle(1000) + .batch(config.BATCH_SIZE) +) + +# We have 25000 examples for testing +x_test = encode(test_df.review.values) +y_test = test_df.sentiment.values +test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch( + config.BATCH_SIZE +) + +# Build dataset for end to end model input (will be used at the end) +test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices( + (test_df.review.values, y_test) +).batch(config.BATCH_SIZE) + +# Prepare data for masked language model +x_all_review = encode(all_data.review.values) +x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels( + x_all_review +) + +mlm_ds = tf.data.Dataset.from_tensor_slices( + (x_masked_train, y_masked_labels, sample_weights) +) +mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE) +``` + +--- +## Create BERT model (Pretraining Model) for masked language modeling + +We will create a BERT-like pretraining model architecture +using the `MultiHeadAttention` layer. +It will take token ids as inputs (including masked tokens) +and it will predict the correct ids for the masked input tokens. + + +```python + +def bert_module(query, key, value, i): + # Multi headed self-attention + attention_output = layers.MultiHeadAttention( + num_heads=config.NUM_HEAD, + key_dim=config.EMBED_DIM // config.NUM_HEAD, + name="encoder_{}/multiheadattention".format(i), + )(query, key, value) + attention_output = layers.Dropout(0.1, name="encoder_{}/att_dropout".format(i))( + attention_output + ) + attention_output = layers.LayerNormalization( + epsilon=1e-6, name="encoder_{}/att_layernormalization".format(i) + )(query + attention_output) + + # Feed-forward layer + ffn = keras.Sequential( + [ + layers.Dense(config.FF_DIM, activation="relu"), + layers.Dense(config.EMBED_DIM), + ], + name="encoder_{}/ffn".format(i), + ) + ffn_output = ffn(attention_output) + ffn_output = layers.Dropout(0.1, name="encoder_{}/ffn_dropout".format(i))( + ffn_output + ) + sequence_output = layers.LayerNormalization( + epsilon=1e-6, name="encoder_{}/ffn_layernormalization".format(i) + )(attention_output + ffn_output) + return sequence_output + + +def get_pos_encoding_matrix(max_len, d_emb): + pos_enc = np.array( + [ + [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)] + if pos != 0 + else np.zeros(d_emb) + for pos in range(max_len) + ] + ) + pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2]) # dim 2i + pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2]) # dim 2i+1 + return pos_enc + + +loss_fn = keras.losses.SparseCategoricalCrossentropy( + reduction=tf.keras.losses.Reduction.NONE +) +loss_tracker = tf.keras.metrics.Mean(name="loss") + + +class MaskedLanguageModel(tf.keras.Model): + def train_step(self, inputs): + if len(inputs) == 3: + features, labels, sample_weight = inputs + else: + features, labels = inputs + sample_weight = None + + with tf.GradientTape() as tape: + predictions = self(features, training=True) + loss = loss_fn(labels, predictions, sample_weight=sample_weight) + + # Compute gradients + trainable_vars = self.trainable_variables + gradients = tape.gradient(loss, trainable_vars) + + # Update weights + self.optimizer.apply_gradients(zip(gradients, trainable_vars)) + + # Compute our own metrics + loss_tracker.update_state(loss, sample_weight=sample_weight) + + # Return a dict mapping metric names to current value + return {"loss": loss_tracker.result()} + + @property + def metrics(self): + # We list our `Metric` objects here so that `reset_states()` can be + # called automatically at the start of each epoch + # or at the start of `evaluate()`. + # If you don't implement this property, you have to call + # `reset_states()` yourself at the time of your choosing. + return [loss_tracker] + + +def create_masked_language_bert_model(): + inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64) + + word_embeddings = layers.Embedding( + config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding" + )(inputs) + position_embeddings = layers.Embedding( + input_dim=config.MAX_LEN, + output_dim=config.EMBED_DIM, + weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)], + name="position_embedding", + )(tf.range(start=0, limit=config.MAX_LEN, delta=1)) + embeddings = word_embeddings + position_embeddings + + encoder_output = embeddings + for i in range(config.NUM_LAYERS): + encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i) + + mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")( + encoder_output + ) + mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model") + + optimizer = keras.optimizers.Adam(learning_rate=config.LR) + mlm_model.compile(optimizer=optimizer) + return mlm_model + + +id2token = dict(enumerate(vectorize_layer.get_vocabulary())) +token2id = {y: x for x, y in id2token.items()} + + +class MaskedTextGenerator(keras.callbacks.Callback): + def __init__(self, sample_tokens, top_k=5): + self.sample_tokens = sample_tokens + self.k = top_k + + def decode(self, tokens): + return " ".join([id2token[t] for t in tokens if t != 0]) + + def convert_ids_to_tokens(self, id): + return id2token[id] + + def on_epoch_end(self, epoch, logs=None): + prediction = self.model.predict(self.sample_tokens) + + masked_index = np.where(self.sample_tokens == mask_token_id) + masked_index = masked_index[1] + mask_prediction = prediction[0][masked_index] + + top_indices = mask_prediction[0].argsort()[-self.k :][::-1] + values = mask_prediction[0][top_indices] + + for i in range(len(top_indices)): + p = top_indices[i] + v = values[i] + tokens = np.copy(sample_tokens[0]) + tokens[masked_index[0]] = p + result = { + "input_text": self.decode(sample_tokens[0].numpy()), + "prediction": self.decode(tokens), + "probability": v, + "predicted mask token": self.convert_ids_to_tokens(p), + } + pprint(result) + + +sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"]) +generator_callback = MaskedTextGenerator(sample_tokens.numpy()) + +bert_masked_model = create_masked_language_bert_model() +bert_masked_model.summary() +``` + +
+``` +Model: "masked_bert_model" +__________________________________________________________________________________________________ +Layer (type) Output Shape Param # Connected to +================================================================================================== +input_1 (InputLayer) [(None, 256)] 0 +__________________________________________________________________________________________________ +word_embedding (Embedding) (None, 256, 128) 3840000 input_1[0][0] +__________________________________________________________________________________________________ +tf.__operators__.add (TFOpLambd (None, 256, 128) 0 word_embedding[0][0] +__________________________________________________________________________________________________ +encoder_0/multiheadattention (M (None, 256, 128) 66048 tf.__operators__.add[0][0] + tf.__operators__.add[0][0] + tf.__operators__.add[0][0] +__________________________________________________________________________________________________ +encoder_0/att_dropout (Dropout) (None, 256, 128) 0 encoder_0/multiheadattention[0][0 +__________________________________________________________________________________________________ +tf.__operators__.add_1 (TFOpLam (None, 256, 128) 0 tf.__operators__.add[0][0] + encoder_0/att_dropout[0][0] +__________________________________________________________________________________________________ +encoder_0/att_layernormalizatio (None, 256, 128) 256 tf.__operators__.add_1[0][0] +__________________________________________________________________________________________________ +encoder_0/ffn (Sequential) (None, 256, 128) 33024 encoder_0/att_layernormalization[ +__________________________________________________________________________________________________ +encoder_0/ffn_dropout (Dropout) (None, 256, 128) 0 encoder_0/ffn[0][0] +__________________________________________________________________________________________________ +tf.__operators__.add_2 (TFOpLam (None, 256, 128) 0 encoder_0/att_layernormalization[ + encoder_0/ffn_dropout[0][0] +__________________________________________________________________________________________________ +encoder_0/ffn_layernormalizatio (None, 256, 128) 256 tf.__operators__.add_2[0][0] +__________________________________________________________________________________________________ +mlm_cls (Dense) (None, 256, 30000) 3870000 encoder_0/ffn_layernormalization[ +================================================================================================== +Total params: 7,809,584 +Trainable params: 7,809,584 +Non-trainable params: 0 +__________________________________________________________________________________________________ + +``` +
+--- +## Train and Save + + +```python +bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback]) +bert_masked_model.save("bert_mlm_imdb.h5") +``` + +
+``` +Epoch 1/5 +1563/1563 [==============================] - ETA: 0s - loss: 7.0111{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'this', + 'prediction': 'i have watched this this and it was awesome', + 'probability': 0.086307295} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'i', + 'prediction': 'i have watched this i and it was awesome', + 'probability': 0.066265985} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'movie', + 'prediction': 'i have watched this movie and it was awesome', + 'probability': 0.044195656} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'a', + 'prediction': 'i have watched this a and it was awesome', + 'probability': 0.04020928} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'was', + 'prediction': 'i have watched this was and it was awesome', + 'probability': 0.027878676} +1563/1563 [==============================] - 661s 423ms/step - loss: 7.0111 +Epoch 2/5 +1563/1563 [==============================] - ETA: 0s - loss: 6.4498{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'movie', + 'prediction': 'i have watched this movie and it was awesome', + 'probability': 0.44448906} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'film', + 'prediction': 'i have watched this film and it was awesome', + 'probability': 0.1507494} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'is', + 'prediction': 'i have watched this is and it was awesome', + 'probability': 0.06385628} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'one', + 'prediction': 'i have watched this one and it was awesome', + 'probability': 0.023549262} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'was', + 'prediction': 'i have watched this was and it was awesome', + 'probability': 0.022277055} +1563/1563 [==============================] - 660s 422ms/step - loss: 6.4498 +Epoch 3/5 +1563/1563 [==============================] - ETA: 0s - loss: 5.8709{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'movie', + 'prediction': 'i have watched this movie and it was awesome', + 'probability': 0.4759983} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'film', + 'prediction': 'i have watched this film and it was awesome', + 'probability': 0.18642229} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'one', + 'prediction': 'i have watched this one and it was awesome', + 'probability': 0.045611132} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'is', + 'prediction': 'i have watched this is and it was awesome', + 'probability': 0.028308254} +{'input_text': 'i have watched this [mask] and it was awesome', + 'predicted mask token': 'series', + 'prediction': 'i have watched this series and it was awesome', + 'probability': 0.027862877} +1563/1563 [==============================] - 661s 423ms/step - loss: 5.8709 +Epoch 4/5 + 771/1563 [=============>................] - ETA: 5:35 - loss: 5.3782 + +``` +
+--- +## Fine-tune a sentiment classification model + +We will fine-tune our self-supervised model on a downstream task of sentiment classification. +To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the +pretrained BERT features. + + +```python +# Load pretrained bert model +mlm_model = keras.models.load_model( + "bert_mlm_imdb.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel} +) +pretrained_bert_model = tf.keras.Model( + mlm_model.input, mlm_model.get_layer("encoder_0/ffn_layernormalization").output +) + +# Freeze it +pretrained_bert_model.trainable = False + + +def create_classifier_bert_model(): + inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64) + sequence_output = pretrained_bert_model(inputs) + pooled_output = layers.GlobalMaxPooling1D()(sequence_output) + hidden_layer = layers.Dense(64, activation="relu")(pooled_output) + outputs = layers.Dense(1, activation="sigmoid")(hidden_layer) + classifer_model = keras.Model(inputs, outputs, name="classification") + optimizer = keras.optimizers.Adam() + classifer_model.compile( + optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"] + ) + return classifer_model + + +classifer_model = create_classifier_bert_model() +classifer_model.summary() + +# Train the classifier with frozen BERT stage +classifer_model.fit( + train_classifier_ds, + epochs=5, + validation_data=test_classifier_ds, +) + +# Unfreeze the BERT model for fine-tuning +pretrained_bert_model.trainable = True +optimizer = keras.optimizers.Adam() +classifer_model.compile( + optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"] +) +classifer_model.fit( + train_classifier_ds, + epochs=5, + validation_data=test_classifier_ds, +) +``` + +
+``` +Model: "classification" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +input_2 (InputLayer) [(None, 256)] 0 +_________________________________________________________________ +model (Functional) (None, 256, 128) 3939584 +_________________________________________________________________ +global_max_pooling1d (Global (None, 128) 0 +_________________________________________________________________ +dense_2 (Dense) (None, 64) 8256 +_________________________________________________________________ +dense_3 (Dense) (None, 1) 65 +================================================================= +Total params: 3,947,905 +Trainable params: 8,321 +Non-trainable params: 3,939,584 +_________________________________________________________________ +Epoch 1/5 +782/782 [==============================] - 15s 19ms/step - loss: 0.8096 - accuracy: 0.5498 - val_loss: 0.6406 - val_accuracy: 0.6329 +Epoch 2/5 +782/782 [==============================] - 14s 18ms/step - loss: 0.6551 - accuracy: 0.6220 - val_loss: 0.6423 - val_accuracy: 0.6338 +Epoch 3/5 +782/782 [==============================] - 14s 18ms/step - loss: 0.6473 - accuracy: 0.6310 - val_loss: 0.6380 - val_accuracy: 0.6350 +Epoch 4/5 +782/782 [==============================] - 14s 18ms/step - loss: 0.6307 - accuracy: 0.6471 - val_loss: 0.6432 - val_accuracy: 0.6312 +Epoch 5/5 +782/782 [==============================] - 14s 18ms/step - loss: 0.6278 - accuracy: 0.6465 - val_loss: 0.6107 - val_accuracy: 0.6678 +Epoch 1/5 +782/782 [==============================] - 46s 59ms/step - loss: 0.5234 - accuracy: 0.7373 - val_loss: 0.3533 - val_accuracy: 0.8427 +Epoch 2/5 +782/782 [==============================] - 45s 57ms/step - loss: 0.2808 - accuracy: 0.8814 - val_loss: 0.3252 - val_accuracy: 0.8633 +Epoch 3/5 +782/782 [==============================] - 43s 55ms/step - loss: 0.1493 - accuracy: 0.9413 - val_loss: 0.4374 - val_accuracy: 0.8486 +Epoch 4/5 +782/782 [==============================] - 43s 55ms/step - loss: 0.0600 - accuracy: 0.9803 - val_loss: 0.6422 - val_accuracy: 0.8380 +Epoch 5/5 +782/782 [==============================] - 43s 55ms/step - loss: 0.0305 - accuracy: 0.9893 - val_loss: 0.6064 - val_accuracy: 0.8440 + + + +``` +
+--- +## Create an end-to-end model and evaluate it + +When you want to deploy a model, it's best if it already includes its preprocessing +pipeline, so that you don't have to reimplement the preprocessing logic in your +production environment. Let's create an end-to-end model that incorporates +the `TextVectorization` layer, and let's evaluate. Our model will accept raw strings +as input. + + +```python + +def get_end_to_end(model): + inputs_string = keras.Input(shape=(1,), dtype="string") + indices = vectorize_layer(inputs_string) + outputs = model(indices) + end_to_end_model = keras.Model(inputs_string, outputs, name="end_to_end_model") + optimizer = keras.optimizers.Adam(learning_rate=config.LR) + end_to_end_model.compile( + optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"] + ) + return end_to_end_model + + +end_to_end_classification_model = get_end_to_end(classifer_model) +end_to_end_classification_model.evaluate(test_raw_classifier_ds) +``` + +
+``` +782/782 [==============================] - 8s 11ms/step - loss: 0.5967 - accuracy: 0.8446 + +[0.6064175963401794, 0.8439599871635437] + +``` +
\ No newline at end of file diff --git a/examples/nlp/mlm_and_finetune_with_bert.py b/examples/nlp/mlm_and_finetune_with_bert.py new file mode 100644 index 0000000000..9b4ebc07e9 --- /dev/null +++ b/examples/nlp/mlm_and_finetune_with_bert.py @@ -0,0 +1,505 @@ +""" +Title: End-to-end Masked Language Modeling with BERT +Author: [Ankur Singh](https://twitter.com/ankur310794) +Date created: 2020/09/18 +Last modified: 2020/09/18 +Description: Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset. +""" +""" +## Introduction + +Masked Language Modeling is a fill-in-the-blank task, +where a model uses the context words surrounding a mask token to try to predict what the +masked word should be. + +For an input that contains one or more mask tokens, +the model will generate the most likely substitution for each. + +Example: + +- Input: "I have watched this [MASK] and it was awesome." +- Output: "I have watched this movie and it was awesome." + +Masked language modeling is a great way to train a language +model in a self-supervised setting (without human-annotated labels). +Such a model can then be fine-tuned to accomplish various supervised +NLP tasks. + +This example teaches you how to build a BERT model from scratch, +train it with the masked language modeling task, +and then fine-tune this model on a sentiment classification task. + +We will use the Keras `TextVectorization` and `MultiHeadAttention` layers +to create a BERT Transformer-Encoder network architecture. + +Note: This example should be run with `tf-nightly`. +""" + +""" +## Setup + +Install `tf-nightly` via `pip install tf-nightly`. +""" + +import tensorflow as tf +from tensorflow import keras +from tensorflow.keras import layers +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +from dataclasses import dataclass +import pandas as pd +import numpy as np +import glob +import re +from pprint import pprint + +""" +## Set-up Configuration +""" + + +@dataclass +class Config: + MAX_LEN = 256 + BATCH_SIZE = 32 + LR = 0.001 + VOCAB_SIZE = 30000 + EMBED_DIM = 128 + NUM_HEAD = 8 # used in bert model + FF_DIM = 128 # used in bert model + NUM_LAYERS = 1 + + +config = Config() + +""" +## Load the data + +We will first download the IMDB data and load into a Pandas dataframe. +""" + +"""shell +curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz +tar -xf aclImdb_v1.tar.gz +""" + + +def get_text_list_from_files(files): + text_list = [] + for name in files: + with open(name) as f: + for line in f: + text_list.append(line) + return text_list + + +def get_data_from_text_files(folder_name): + + pos_files = glob.glob("aclImdb/" + folder_name + "/pos/*.txt") + pos_texts = get_text_list_from_files(pos_files) + neg_files = glob.glob("aclImdb/" + folder_name + "/neg/*.txt") + neg_texts = get_text_list_from_files(neg_files) + df = pd.DataFrame( + { + "review": pos_texts + neg_texts, + "sentiment": [0] * len(pos_texts) + [1] * len(neg_texts), + } + ) + df = df.sample(len(df)).reset_index(drop=True) + return df + + +train_df = get_data_from_text_files("train") +test_df = get_data_from_text_files("test") + +all_data = train_df.append(test_df) + +""" +## Dataset preparation + +We will use the `TextVectorization` layer to vectorize the text into integer token ids. +It transforms a batch of strings into either +a sequence of token indices (one sample = 1D array of integer token indices, in order) +or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens). + +Below, we define 3 preprocessing functions. + +1. The `get_vectorize_layer` function builds the `TextVectorization` layer. +2. The `encode` function encodes raw text into integer token ids. +3. The `get_masked_input_and_labels` function will mask input token ids. +It masks 15% of all input tokens in each sequence at random. +""" + + +def custom_standardization(input_data): + lowercase = tf.strings.lower(input_data) + stripped_html = tf.strings.regex_replace(lowercase, "
", " ") + return tf.strings.regex_replace( + stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), "" + ) + + +def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]): + """Build Text vectorization layer + + Args: + texts (list): List of string i.e input texts + vocab_size (int): vocab size + max_seq (int): Maximum sequence lenght. + special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]']. + + Returns: + layers.Layer: Return TextVectorization Keras Layer + """ + vectorize_layer = TextVectorization( + max_tokens=vocab_size, + output_mode="int", + standardize=custom_standardization, + output_sequence_length=max_seq, + ) + vectorize_layer.adapt(texts) + + # Insert mask token in vocabulary + vocab = vectorize_layer.get_vocabulary() + vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"] + vectorize_layer.set_vocabulary(vocab) + return vectorize_layer + + +vectorize_layer = get_vectorize_layer( + all_data.review.values.tolist(), + config.VOCAB_SIZE, + config.MAX_LEN, + special_tokens=["[mask]"], +) + +# Get mask token id for masked language model +mask_token_id = vectorize_layer(["[mask]"]).numpy()[0][0] + + +def encode(texts): + encoded_texts = vectorize_layer(texts) + return encoded_texts.numpy() + + +def get_masked_input_and_labels(encoded_texts): + # 15% BERT masking + inp_mask = np.random.rand(*encoded_texts.shape) < 0.15 + # Do not mask special tokens + inp_mask[encoded_texts <= 2] = False + # Set targets to -1 by default, it means ignore + labels = -1 * np.ones(encoded_texts.shape, dtype=int) + # Set labels for masked tokens + labels[inp_mask] = encoded_texts[inp_mask] + + # Prepare input + encoded_texts_masked = np.copy(encoded_texts) + # Set input to [MASK] which is the last token for the 90% of tokens + # This means leaving 10% unchanged + inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90) + encoded_texts_masked[ + inp_mask_2mask + ] = mask_token_id # mask token is the last in the dict + + # Set 10% to a random token + inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9) + encoded_texts_masked[inp_mask_2random] = np.random.randint( + 3, mask_token_id, inp_mask_2random.sum() + ) + + # Prepare sample_weights to pass to .fit() method + sample_weights = np.ones(labels.shape) + sample_weights[labels == -1] = 0 + + # y_labels would be same as encoded_texts i.e input tokens + y_labels = np.copy(encoded_texts) + + return encoded_texts_masked, y_labels, sample_weights + + +# We have 25000 examples for training +x_train = encode(train_df.review.values) # encode reviews with vectorizer +y_train = train_df.sentiment.values +train_classifier_ds = ( + tf.data.Dataset.from_tensor_slices((x_train, y_train)) + .shuffle(1000) + .batch(config.BATCH_SIZE) +) + +# We have 25000 examples for testing +x_test = encode(test_df.review.values) +y_test = test_df.sentiment.values +test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch( + config.BATCH_SIZE +) + +# Build dataset for end to end model input (will be used at the end) +test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices( + (test_df.review.values, y_test) +).batch(config.BATCH_SIZE) + +# Prepare data for masked language model +x_all_review = encode(all_data.review.values) +x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels( + x_all_review +) + +mlm_ds = tf.data.Dataset.from_tensor_slices( + (x_masked_train, y_masked_labels, sample_weights) +) +mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE) + +""" +## Create BERT model (Pretraining Model) for masked language modeling + +We will create a BERT-like pretraining model architecture +using the `MultiHeadAttention` layer. +It will take token ids as inputs (including masked tokens) +and it will predict the correct ids for the masked input tokens. +""" + + +def bert_module(query, key, value, i): + # Multi headed self-attention + attention_output = layers.MultiHeadAttention( + num_heads=config.NUM_HEAD, + key_dim=config.EMBED_DIM // config.NUM_HEAD, + name="encoder_{}/multiheadattention".format(i), + )(query, key, value) + attention_output = layers.Dropout(0.1, name="encoder_{}/att_dropout".format(i))( + attention_output + ) + attention_output = layers.LayerNormalization( + epsilon=1e-6, name="encoder_{}/att_layernormalization".format(i) + )(query + attention_output) + + # Feed-forward layer + ffn = keras.Sequential( + [ + layers.Dense(config.FF_DIM, activation="relu"), + layers.Dense(config.EMBED_DIM), + ], + name="encoder_{}/ffn".format(i), + ) + ffn_output = ffn(attention_output) + ffn_output = layers.Dropout(0.1, name="encoder_{}/ffn_dropout".format(i))( + ffn_output + ) + sequence_output = layers.LayerNormalization( + epsilon=1e-6, name="encoder_{}/ffn_layernormalization".format(i) + )(attention_output + ffn_output) + return sequence_output + + +def get_pos_encoding_matrix(max_len, d_emb): + pos_enc = np.array( + [ + [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)] + if pos != 0 + else np.zeros(d_emb) + for pos in range(max_len) + ] + ) + pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2]) # dim 2i + pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2]) # dim 2i+1 + return pos_enc + + +loss_fn = keras.losses.SparseCategoricalCrossentropy( + reduction=tf.keras.losses.Reduction.NONE +) +loss_tracker = tf.keras.metrics.Mean(name="loss") + + +class MaskedLanguageModel(tf.keras.Model): + def train_step(self, inputs): + if len(inputs) == 3: + features, labels, sample_weight = inputs + else: + features, labels = inputs + sample_weight = None + + with tf.GradientTape() as tape: + predictions = self(features, training=True) + loss = loss_fn(labels, predictions, sample_weight=sample_weight) + + # Compute gradients + trainable_vars = self.trainable_variables + gradients = tape.gradient(loss, trainable_vars) + + # Update weights + self.optimizer.apply_gradients(zip(gradients, trainable_vars)) + + # Compute our own metrics + loss_tracker.update_state(loss, sample_weight=sample_weight) + + # Return a dict mapping metric names to current value + return {"loss": loss_tracker.result()} + + @property + def metrics(self): + # We list our `Metric` objects here so that `reset_states()` can be + # called automatically at the start of each epoch + # or at the start of `evaluate()`. + # If you don't implement this property, you have to call + # `reset_states()` yourself at the time of your choosing. + return [loss_tracker] + + +def create_masked_language_bert_model(): + inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64) + + word_embeddings = layers.Embedding( + config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding" + )(inputs) + position_embeddings = layers.Embedding( + input_dim=config.MAX_LEN, + output_dim=config.EMBED_DIM, + weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)], + name="position_embedding", + )(tf.range(start=0, limit=config.MAX_LEN, delta=1)) + embeddings = word_embeddings + position_embeddings + + encoder_output = embeddings + for i in range(config.NUM_LAYERS): + encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i) + + mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")( + encoder_output + ) + mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model") + + optimizer = keras.optimizers.Adam(learning_rate=config.LR) + mlm_model.compile(optimizer=optimizer) + return mlm_model + + +id2token = dict(enumerate(vectorize_layer.get_vocabulary())) +token2id = {y: x for x, y in id2token.items()} + + +class MaskedTextGenerator(keras.callbacks.Callback): + def __init__(self, sample_tokens, top_k=5): + self.sample_tokens = sample_tokens + self.k = top_k + + def decode(self, tokens): + return " ".join([id2token[t] for t in tokens if t != 0]) + + def convert_ids_to_tokens(self, id): + return id2token[id] + + def on_epoch_end(self, epoch, logs=None): + prediction = self.model.predict(self.sample_tokens) + + masked_index = np.where(self.sample_tokens == mask_token_id) + masked_index = masked_index[1] + mask_prediction = prediction[0][masked_index] + + top_indices = mask_prediction[0].argsort()[-self.k :][::-1] + values = mask_prediction[0][top_indices] + + for i in range(len(top_indices)): + p = top_indices[i] + v = values[i] + tokens = np.copy(sample_tokens[0]) + tokens[masked_index[0]] = p + result = { + "input_text": self.decode(sample_tokens[0].numpy()), + "prediction": self.decode(tokens), + "probability": v, + "predicted mask token": self.convert_ids_to_tokens(p), + } + pprint(result) + + +sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"]) +generator_callback = MaskedTextGenerator(sample_tokens.numpy()) + +bert_masked_model = create_masked_language_bert_model() +bert_masked_model.summary() + +""" +## Train and Save +""" + +bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback]) +bert_masked_model.save("bert_mlm_imdb.h5") + +""" +## Fine-tune a sentiment classification model + +We will fine-tune our self-supervised model on a downstream task of sentiment classification. +To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the +pretrained BERT features. + +""" + +# Load pretrained bert model +mlm_model = keras.models.load_model( + "bert_mlm_imdb.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel} +) +pretrained_bert_model = tf.keras.Model( + mlm_model.input, mlm_model.get_layer("encoder_0/ffn_layernormalization").output +) + +# Freeze it +pretrained_bert_model.trainable = False + + +def create_classifier_bert_model(): + inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64) + sequence_output = pretrained_bert_model(inputs) + pooled_output = layers.GlobalMaxPooling1D()(sequence_output) + hidden_layer = layers.Dense(64, activation="relu")(pooled_output) + outputs = layers.Dense(1, activation="sigmoid")(hidden_layer) + classifer_model = keras.Model(inputs, outputs, name="classification") + optimizer = keras.optimizers.Adam() + classifer_model.compile( + optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"] + ) + return classifer_model + + +classifer_model = create_classifier_bert_model() +classifer_model.summary() + +# Train the classifier with frozen BERT stage +classifer_model.fit( + train_classifier_ds, epochs=5, validation_data=test_classifier_ds, +) + +# Unfreeze the BERT model for fine-tuning +pretrained_bert_model.trainable = True +optimizer = keras.optimizers.Adam() +classifer_model.compile( + optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"] +) +classifer_model.fit( + train_classifier_ds, epochs=5, validation_data=test_classifier_ds, +) + +""" +## Create an end-to-end model and evaluate it + +When you want to deploy a model, it's best if it already includes its preprocessing +pipeline, so that you don't have to reimplement the preprocessing logic in your +production environment. Let's create an end-to-end model that incorporates +the `TextVectorization` layer, and let's evaluate. Our model will accept raw strings +as input. +""" + + +def get_end_to_end(model): + inputs_string = keras.Input(shape=(1,), dtype="string") + indices = vectorize_layer(inputs_string) + outputs = model(indices) + end_to_end_model = keras.Model(inputs_string, outputs, name="end_to_end_model") + optimizer = keras.optimizers.Adam(learning_rate=config.LR) + end_to_end_model.compile( + optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"] + ) + return end_to_end_model + + +end_to_end_classification_model = get_end_to_end(classifer_model) +end_to_end_classification_model.evaluate(test_raw_classifier_ds)