diff --git a/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb b/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb
new file mode 100644
index 0000000000..f2b5f2f8ca
--- /dev/null
+++ b/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb
@@ -0,0 +1,672 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "# End-to-end Masked Language Modeling with BERT\n",
+ "\n",
+ "**Author:** [Ankur Singh](https://twitter.com/ankur310794)
\n",
+ "**Date created:** 2020/09/18
\n",
+ "**Last modified:** 2020/09/18
\n",
+ "**Description:** Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Introduction\n",
+ "\n",
+ "Masked Language Modeling is a fill-in-the-blank task,\n",
+ "where a model uses the context words surrounding a mask token to try to predict what the\n",
+ "masked word should be.\n",
+ "\n",
+ "For an input that contains one or more mask tokens,\n",
+ "the model will generate the most likely substitution for each.\n",
+ "\n",
+ "Example:\n",
+ "\n",
+ "- Input: \"I have watched this [MASK] and it was awesome.\"\n",
+ "- Output: \"I have watched this movie and it was awesome.\"\n",
+ "\n",
+ "Masked language modeling is a great way to train a language\n",
+ "model in a self-supervised setting (without human-annotated labels).\n",
+ "Such a model can then be fine-tuned to accomplish various supervised\n",
+ "NLP tasks.\n",
+ "\n",
+ "This example teaches you how to build a BERT model from scratch,\n",
+ "train it with the masked language modeling task,\n",
+ "and then fine-tune this model on a sentiment classification task.\n",
+ "\n",
+ "We will use the Keras `TextVectorization` and `MultiHeadAttention` layers\n",
+ "to create a BERT Transformer-Encoder network architecture.\n",
+ "\n",
+ "Note: This example should be run with `tf-nightly`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Setup\n",
+ "\n",
+ "Install `tf-nightly` via `pip install tf-nightly`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "import tensorflow as tf\n",
+ "from tensorflow import keras\n",
+ "from tensorflow.keras import layers\n",
+ "from tensorflow.keras.layers.experimental.preprocessing import TextVectorization\n",
+ "from dataclasses import dataclass\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import glob\n",
+ "import re\n",
+ "from pprint import pprint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Set-up Configuration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "\n",
+ "@dataclass\n",
+ "class Config:\n",
+ " MAX_LEN = 256\n",
+ " BATCH_SIZE = 32\n",
+ " LR = 0.001\n",
+ " VOCAB_SIZE = 30000\n",
+ " EMBED_DIM = 128\n",
+ " NUM_HEAD = 8 # used in bert model\n",
+ " FF_DIM = 128 # used in bert model\n",
+ " NUM_LAYERS = 1\n",
+ "\n",
+ "\n",
+ "config = Config()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Load the data\n",
+ "\n",
+ "We will first download the IMDB data and load into a Pandas dataframe."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n",
+ "!tar -xf aclImdb_v1.tar.gz"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "\n",
+ "def get_text_list_from_files(files):\n",
+ " text_list = []\n",
+ " for name in files:\n",
+ " with open(name) as f:\n",
+ " for line in f:\n",
+ " text_list.append(line)\n",
+ " return text_list\n",
+ "\n",
+ "\n",
+ "def get_data_from_text_files(folder_name):\n",
+ "\n",
+ " pos_files = glob.glob(\"aclImdb/\" + folder_name + \"/pos/*.txt\")\n",
+ " pos_texts = get_text_list_from_files(pos_files)\n",
+ " neg_files = glob.glob(\"aclImdb/\" + folder_name + \"/neg/*.txt\")\n",
+ " neg_texts = get_text_list_from_files(neg_files)\n",
+ " df = pd.DataFrame(\n",
+ " {\n",
+ " \"review\": pos_texts + neg_texts,\n",
+ " \"sentiment\": [0] * len(pos_texts) + [1] * len(neg_texts),\n",
+ " }\n",
+ " )\n",
+ " df = df.sample(len(df)).reset_index(drop=True)\n",
+ " return df\n",
+ "\n",
+ "\n",
+ "train_df = get_data_from_text_files(\"train\")\n",
+ "test_df = get_data_from_text_files(\"test\")\n",
+ "\n",
+ "all_data = train_df.append(test_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Dataset preparation\n",
+ "\n",
+ "We will use the `TextVectorization` layer to vectorize the text into integer token ids.\n",
+ "It transforms a batch of strings into either\n",
+ "a sequence of token indices (one sample = 1D array of integer token indices, in order)\n",
+ "or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens).\n",
+ "\n",
+ "Below, we define 3 preprocessing functions.\n",
+ "\n",
+ "1. The `get_vectorize_layer` function builds the `TextVectorization` layer.\n",
+ "2. The `encode` function encodes raw text into integer token ids.\n",
+ "3. The `get_masked_input_and_labels` function will mask input token ids.\n",
+ "It masks 15% of all input tokens in each sequence at random."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "\n",
+ "def custom_standardization(input_data):\n",
+ " lowercase = tf.strings.lower(input_data)\n",
+ " stripped_html = tf.strings.regex_replace(lowercase, \"
\", \" \")\n",
+ " return tf.strings.regex_replace(\n",
+ " stripped_html, \"[%s]\" % re.escape(\"!#$%&'()*+,-./:;<=>?@\\^_`{|}~\"), \"\"\n",
+ " )\n",
+ "\n",
+ "\n",
+ "def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=[\"[MASK]\"]):\n",
+ " \"\"\"Build Text vectorization layer\n",
+ "\n",
+ " Args:\n",
+ " texts (list): List of string i.e input texts\n",
+ " vocab_size (int): vocab size\n",
+ " max_seq (int): Maximum sequence lenght.\n",
+ " special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].\n",
+ "\n",
+ " Returns:\n",
+ " layers.Layer: Return TextVectorization Keras Layer\n",
+ " \"\"\"\n",
+ " vectorize_layer = TextVectorization(\n",
+ " max_tokens=vocab_size,\n",
+ " output_mode=\"int\",\n",
+ " standardize=custom_standardization,\n",
+ " output_sequence_length=max_seq,\n",
+ " )\n",
+ " vectorize_layer.adapt(texts)\n",
+ "\n",
+ " # Insert mask token in vocabulary\n",
+ " vocab = vectorize_layer.get_vocabulary()\n",
+ " vocab = vocab[2 : vocab_size - len(special_tokens)] + [\"[mask]\"]\n",
+ " vectorize_layer.set_vocabulary(vocab)\n",
+ " return vectorize_layer\n",
+ "\n",
+ "\n",
+ "vectorize_layer = get_vectorize_layer(\n",
+ " all_data.review.values.tolist(),\n",
+ " config.VOCAB_SIZE,\n",
+ " config.MAX_LEN,\n",
+ " special_tokens=[\"[mask]\"],\n",
+ ")\n",
+ "\n",
+ "# Get mask token id for masked language model\n",
+ "mask_token_id = vectorize_layer([\"[mask]\"]).numpy()[0][0]\n",
+ "\n",
+ "\n",
+ "def encode(texts):\n",
+ " encoded_texts = vectorize_layer(texts)\n",
+ " return encoded_texts.numpy()\n",
+ "\n",
+ "\n",
+ "def get_masked_input_and_labels(encoded_texts):\n",
+ " # 15% BERT masking\n",
+ " inp_mask = np.random.rand(*encoded_texts.shape) < 0.15\n",
+ " # Do not mask special tokens\n",
+ " inp_mask[encoded_texts <= 2] = False\n",
+ " # Set targets to -1 by default, it means ignore\n",
+ " labels = -1 * np.ones(encoded_texts.shape, dtype=int)\n",
+ " # Set labels for masked tokens\n",
+ " labels[inp_mask] = encoded_texts[inp_mask]\n",
+ "\n",
+ " # Prepare input\n",
+ " encoded_texts_masked = np.copy(encoded_texts)\n",
+ " # Set input to [MASK] which is the last token for the 90% of tokens\n",
+ " # This means leaving 10% unchanged\n",
+ " inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)\n",
+ " encoded_texts_masked[\n",
+ " inp_mask_2mask\n",
+ " ] = mask_token_id # mask token is the last in the dict\n",
+ "\n",
+ " # Set 10% to a random token\n",
+ " inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)\n",
+ " encoded_texts_masked[inp_mask_2random] = np.random.randint(\n",
+ " 3, mask_token_id, inp_mask_2random.sum()\n",
+ " )\n",
+ "\n",
+ " # Prepare sample_weights to pass to .fit() method\n",
+ " sample_weights = np.ones(labels.shape)\n",
+ " sample_weights[labels == -1] = 0\n",
+ "\n",
+ " # y_labels would be same as encoded_texts i.e input tokens\n",
+ " y_labels = np.copy(encoded_texts)\n",
+ "\n",
+ " return encoded_texts_masked, y_labels, sample_weights\n",
+ "\n",
+ "\n",
+ "# We have 25000 examples for training\n",
+ "x_train = encode(train_df.review.values) # encode reviews with vectorizer\n",
+ "y_train = train_df.sentiment.values\n",
+ "train_classifier_ds = (\n",
+ " tf.data.Dataset.from_tensor_slices((x_train, y_train))\n",
+ " .shuffle(1000)\n",
+ " .batch(config.BATCH_SIZE)\n",
+ ")\n",
+ "\n",
+ "# We have 25000 examples for testing\n",
+ "x_test = encode(test_df.review.values)\n",
+ "y_test = test_df.sentiment.values\n",
+ "test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(\n",
+ " config.BATCH_SIZE\n",
+ ")\n",
+ "\n",
+ "# Build dataset for end to end model input (will be used at the end)\n",
+ "test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices(\n",
+ " (test_df.review.values, y_test)\n",
+ ").batch(config.BATCH_SIZE)\n",
+ "\n",
+ "# Prepare data for masked language model\n",
+ "x_all_review = encode(all_data.review.values)\n",
+ "x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(\n",
+ " x_all_review\n",
+ ")\n",
+ "\n",
+ "mlm_ds = tf.data.Dataset.from_tensor_slices(\n",
+ " (x_masked_train, y_masked_labels, sample_weights)\n",
+ ")\n",
+ "mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Create BERT model (Pretraining Model) for masked language modeling\n",
+ "\n",
+ "We will create a BERT-like pretraining model architecture\n",
+ "using the `MultiHeadAttention` layer.\n",
+ "It will take token ids as inputs (including masked tokens)\n",
+ "and it will predict the correct ids for the masked input tokens."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "\n",
+ "def bert_module(query, key, value, i):\n",
+ " # Multi headed self-attention\n",
+ " attention_output = layers.MultiHeadAttention(\n",
+ " num_heads=config.NUM_HEAD,\n",
+ " key_dim=config.EMBED_DIM // config.NUM_HEAD,\n",
+ " name=\"encoder_{}/multiheadattention\".format(i),\n",
+ " )(query, key, value)\n",
+ " attention_output = layers.Dropout(0.1, name=\"encoder_{}/att_dropout\".format(i))(\n",
+ " attention_output\n",
+ " )\n",
+ " attention_output = layers.LayerNormalization(\n",
+ " epsilon=1e-6, name=\"encoder_{}/att_layernormalization\".format(i)\n",
+ " )(query + attention_output)\n",
+ "\n",
+ " # Feed-forward layer\n",
+ " ffn = keras.Sequential(\n",
+ " [\n",
+ " layers.Dense(config.FF_DIM, activation=\"relu\"),\n",
+ " layers.Dense(config.EMBED_DIM),\n",
+ " ],\n",
+ " name=\"encoder_{}/ffn\".format(i),\n",
+ " )\n",
+ " ffn_output = ffn(attention_output)\n",
+ " ffn_output = layers.Dropout(0.1, name=\"encoder_{}/ffn_dropout\".format(i))(\n",
+ " ffn_output\n",
+ " )\n",
+ " sequence_output = layers.LayerNormalization(\n",
+ " epsilon=1e-6, name=\"encoder_{}/ffn_layernormalization\".format(i)\n",
+ " )(attention_output + ffn_output)\n",
+ " return sequence_output\n",
+ "\n",
+ "\n",
+ "def get_pos_encoding_matrix(max_len, d_emb):\n",
+ " pos_enc = np.array(\n",
+ " [\n",
+ " [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)]\n",
+ " if pos != 0\n",
+ " else np.zeros(d_emb)\n",
+ " for pos in range(max_len)\n",
+ " ]\n",
+ " )\n",
+ " pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2]) # dim 2i\n",
+ " pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2]) # dim 2i+1\n",
+ " return pos_enc\n",
+ "\n",
+ "\n",
+ "loss_fn = keras.losses.SparseCategoricalCrossentropy(\n",
+ " reduction=tf.keras.losses.Reduction.NONE\n",
+ ")\n",
+ "loss_tracker = tf.keras.metrics.Mean(name=\"loss\")\n",
+ "\n",
+ "\n",
+ "class MaskedLanguageModel(tf.keras.Model):\n",
+ " def train_step(self, inputs):\n",
+ " if len(inputs) == 3:\n",
+ " features, labels, sample_weight = inputs\n",
+ " else:\n",
+ " features, labels = inputs\n",
+ " sample_weight = None\n",
+ "\n",
+ " with tf.GradientTape() as tape:\n",
+ " predictions = self(features, training=True)\n",
+ " loss = loss_fn(labels, predictions, sample_weight=sample_weight)\n",
+ "\n",
+ " # Compute gradients\n",
+ " trainable_vars = self.trainable_variables\n",
+ " gradients = tape.gradient(loss, trainable_vars)\n",
+ "\n",
+ " # Update weights\n",
+ " self.optimizer.apply_gradients(zip(gradients, trainable_vars))\n",
+ "\n",
+ " # Compute our own metrics\n",
+ " loss_tracker.update_state(loss, sample_weight=sample_weight)\n",
+ "\n",
+ " # Return a dict mapping metric names to current value\n",
+ " return {\"loss\": loss_tracker.result()}\n",
+ "\n",
+ " @property\n",
+ " def metrics(self):\n",
+ " # We list our `Metric` objects here so that `reset_states()` can be\n",
+ " # called automatically at the start of each epoch\n",
+ " # or at the start of `evaluate()`.\n",
+ " # If you don't implement this property, you have to call\n",
+ " # `reset_states()` yourself at the time of your choosing.\n",
+ " return [loss_tracker]\n",
+ "\n",
+ "\n",
+ "def create_masked_language_bert_model():\n",
+ " inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)\n",
+ "\n",
+ " word_embeddings = layers.Embedding(\n",
+ " config.VOCAB_SIZE, config.EMBED_DIM, name=\"word_embedding\"\n",
+ " )(inputs)\n",
+ " position_embeddings = layers.Embedding(\n",
+ " input_dim=config.MAX_LEN,\n",
+ " output_dim=config.EMBED_DIM,\n",
+ " weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)],\n",
+ " name=\"position_embedding\",\n",
+ " )(tf.range(start=0, limit=config.MAX_LEN, delta=1))\n",
+ " embeddings = word_embeddings + position_embeddings\n",
+ "\n",
+ " encoder_output = embeddings\n",
+ " for i in range(config.NUM_LAYERS):\n",
+ " encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)\n",
+ "\n",
+ " mlm_output = layers.Dense(config.VOCAB_SIZE, name=\"mlm_cls\", activation=\"softmax\")(\n",
+ " encoder_output\n",
+ " )\n",
+ " mlm_model = MaskedLanguageModel(inputs, mlm_output, name=\"masked_bert_model\")\n",
+ "\n",
+ " optimizer = keras.optimizers.Adam(learning_rate=config.LR)\n",
+ " mlm_model.compile(optimizer=optimizer)\n",
+ " return mlm_model\n",
+ "\n",
+ "\n",
+ "id2token = dict(enumerate(vectorize_layer.get_vocabulary()))\n",
+ "token2id = {y: x for x, y in id2token.items()}\n",
+ "\n",
+ "\n",
+ "class MaskedTextGenerator(keras.callbacks.Callback):\n",
+ " def __init__(self, sample_tokens, top_k=5):\n",
+ " self.sample_tokens = sample_tokens\n",
+ " self.k = top_k\n",
+ "\n",
+ " def decode(self, tokens):\n",
+ " return \" \".join([id2token[t] for t in tokens if t != 0])\n",
+ "\n",
+ " def convert_ids_to_tokens(self, id):\n",
+ " return id2token[id]\n",
+ "\n",
+ " def on_epoch_end(self, epoch, logs=None):\n",
+ " prediction = self.model.predict(self.sample_tokens)\n",
+ "\n",
+ " masked_index = np.where(self.sample_tokens == mask_token_id)\n",
+ " masked_index = masked_index[1]\n",
+ " mask_prediction = prediction[0][masked_index]\n",
+ "\n",
+ " top_indices = mask_prediction[0].argsort()[-self.k :][::-1]\n",
+ " values = mask_prediction[0][top_indices]\n",
+ "\n",
+ " for i in range(len(top_indices)):\n",
+ " p = top_indices[i]\n",
+ " v = values[i]\n",
+ " tokens = np.copy(sample_tokens[0])\n",
+ " tokens[masked_index[0]] = p\n",
+ " result = {\n",
+ " \"input_text\": self.decode(sample_tokens[0].numpy()),\n",
+ " \"prediction\": self.decode(tokens),\n",
+ " \"probability\": v,\n",
+ " \"predicted mask token\": self.convert_ids_to_tokens(p),\n",
+ " }\n",
+ " pprint(result)\n",
+ "\n",
+ "\n",
+ "sample_tokens = vectorize_layer([\"I have watched this [mask] and it was awesome\"])\n",
+ "generator_callback = MaskedTextGenerator(sample_tokens.numpy())\n",
+ "\n",
+ "bert_masked_model = create_masked_language_bert_model()\n",
+ "bert_masked_model.summary()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Train and Save"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback])\n",
+ "bert_masked_model.save(\"bert_mlm_imdb.h5\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Fine-tune a sentiment classification model\n",
+ "\n",
+ "We will fine-tune our self-supervised model on a downstream task of sentiment classification.\n",
+ "To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the\n",
+ "pretrained BERT features."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "# Load pretrained bert model\n",
+ "mlm_model = keras.models.load_model(\n",
+ " \"bert_mlm_imdb.h5\", custom_objects={\"MaskedLanguageModel\": MaskedLanguageModel}\n",
+ ")\n",
+ "pretrained_bert_model = tf.keras.Model(\n",
+ " mlm_model.input, mlm_model.get_layer(\"encoder_0/ffn_layernormalization\").output\n",
+ ")\n",
+ "\n",
+ "# Freeze it\n",
+ "pretrained_bert_model.trainable = False\n",
+ "\n",
+ "\n",
+ "def create_classifier_bert_model():\n",
+ " inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)\n",
+ " sequence_output = pretrained_bert_model(inputs)\n",
+ " pooled_output = layers.GlobalMaxPooling1D()(sequence_output)\n",
+ " hidden_layer = layers.Dense(64, activation=\"relu\")(pooled_output)\n",
+ " outputs = layers.Dense(1, activation=\"sigmoid\")(hidden_layer)\n",
+ " classifer_model = keras.Model(inputs, outputs, name=\"classification\")\n",
+ " optimizer = keras.optimizers.Adam()\n",
+ " classifer_model.compile(\n",
+ " optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"accuracy\"]\n",
+ " )\n",
+ " return classifer_model\n",
+ "\n",
+ "\n",
+ "classifer_model = create_classifier_bert_model()\n",
+ "classifer_model.summary()\n",
+ "\n",
+ "# Train the classifier with frozen BERT stage\n",
+ "classifer_model.fit(\n",
+ " train_classifier_ds,\n",
+ " epochs=5,\n",
+ " validation_data=test_classifier_ds,\n",
+ ")\n",
+ "\n",
+ "# Unfreeze the BERT model for fine-tuning\n",
+ "pretrained_bert_model.trainable = True\n",
+ "optimizer = keras.optimizers.Adam()\n",
+ "classifer_model.compile(\n",
+ " optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"accuracy\"]\n",
+ ")\n",
+ "classifer_model.fit(\n",
+ " train_classifier_ds,\n",
+ " epochs=5,\n",
+ " validation_data=test_classifier_ds,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text"
+ },
+ "source": [
+ "## Create an end-to-end model and evaluate it\n",
+ "\n",
+ "When you want to deploy a model, it's best if it already includes its preprocessing\n",
+ "pipeline, so that you don't have to reimplement the preprocessing logic in your\n",
+ "production environment. Let's create an end-to-end model that incorporates\n",
+ "the `TextVectorization` layer, and let's evaluate. Our model will accept raw strings\n",
+ "as input."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab_type": "code"
+ },
+ "outputs": [],
+ "source": [
+ "\n",
+ "def get_end_to_end(model):\n",
+ " inputs_string = keras.Input(shape=(1,), dtype=\"string\")\n",
+ " indices = vectorize_layer(inputs_string)\n",
+ " outputs = model(indices)\n",
+ " end_to_end_model = keras.Model(inputs_string, outputs, name=\"end_to_end_model\")\n",
+ " optimizer = keras.optimizers.Adam(learning_rate=config.LR)\n",
+ " end_to_end_model.compile(\n",
+ " optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"accuracy\"]\n",
+ " )\n",
+ " return end_to_end_model\n",
+ "\n",
+ "\n",
+ "end_to_end_classification_model = get_end_to_end(classifer_model)\n",
+ "end_to_end_classification_model.evaluate(test_raw_classifier_ds)"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "collapsed_sections": [],
+ "name": "mlm_and_finetune_with_bert",
+ "private_outputs": false,
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/examples/nlp/md/mlm_and_finetune_with_bert.md b/examples/nlp/md/mlm_and_finetune_with_bert.md
new file mode 100644
index 0000000000..bb8e302951
--- /dev/null
+++ b/examples/nlp/md/mlm_and_finetune_with_bert.md
@@ -0,0 +1,704 @@
+
+# End-to-end Masked Language Modeling with BERT
+
+**Author:** [Ankur Singh](https://twitter.com/ankur310794)
+**Date created:** 2020/09/18
+**Last modified:** 2020/09/18
+
+
+ [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/mlm_and_finetune_with_bert.ipynb) • [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/examples/nlp/mlm_and_finetune_with_bert.py)
+
+
+**Description:** Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset.
+
+---
+## Introduction
+
+Masked Language Modeling is a fill-in-the-blank task,
+where a model uses the context words surrounding a mask token to try to predict what the
+masked word should be.
+
+For an input that contains one or more mask tokens,
+the model will generate the most likely substitution for each.
+
+Example:
+
+- Input: "I have watched this [MASK] and it was awesome."
+- Output: "I have watched this movie and it was awesome."
+
+Masked language modeling is a great way to train a language
+model in a self-supervised setting (without human-annotated labels).
+Such a model can then be fine-tuned to accomplish various supervised
+NLP tasks.
+
+This example teaches you how to build a BERT model from scratch,
+train it with the masked language modeling task,
+and then fine-tune this model on a sentiment classification task.
+
+We will use the Keras `TextVectorization` and `MultiHeadAttention` layers
+to create a BERT Transformer-Encoder network architecture.
+
+Note: This example should be run with `tf-nightly`.
+
+---
+## Setup
+
+Install `tf-nightly` via `pip install tf-nightly`.
+
+
+```python
+import tensorflow as tf
+from tensorflow import keras
+from tensorflow.keras import layers
+from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
+from dataclasses import dataclass
+import pandas as pd
+import numpy as np
+import glob
+import re
+from pprint import pprint
+```
+
+---
+## Set-up Configuration
+
+
+```python
+
+@dataclass
+class Config:
+ MAX_LEN = 256
+ BATCH_SIZE = 32
+ LR = 0.001
+ VOCAB_SIZE = 30000
+ EMBED_DIM = 128
+ NUM_HEAD = 8 # used in bert model
+ FF_DIM = 128 # used in bert model
+ NUM_LAYERS = 1
+
+
+config = Config()
+```
+
+---
+## Load the data
+
+We will first download the IMDB data and load into a Pandas dataframe.
+
+
+```python
+!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
+!tar -xf aclImdb_v1.tar.gz
+```
+
+```python
+
+def get_text_list_from_files(files):
+ text_list = []
+ for name in files:
+ with open(name) as f:
+ for line in f:
+ text_list.append(line)
+ return text_list
+
+
+def get_data_from_text_files(folder_name):
+
+ pos_files = glob.glob("aclImdb/" + folder_name + "/pos/*.txt")
+ pos_texts = get_text_list_from_files(pos_files)
+ neg_files = glob.glob("aclImdb/" + folder_name + "/neg/*.txt")
+ neg_texts = get_text_list_from_files(neg_files)
+ df = pd.DataFrame(
+ {
+ "review": pos_texts + neg_texts,
+ "sentiment": [0] * len(pos_texts) + [1] * len(neg_texts),
+ }
+ )
+ df = df.sample(len(df)).reset_index(drop=True)
+ return df
+
+
+train_df = get_data_from_text_files("train")
+test_df = get_data_from_text_files("test")
+
+all_data = train_df.append(test_df)
+```
+
+```
+ % Total % Received % Xferd Average Speed Time Time Time Current
+ Dload Upload Total Spent Left Speed
+100 80.2M 100 80.2M 0 0 45.3M 0 0:00:01 0:00:01 --:--:-- 45.3M
+
+```
+
+---
+## Dataset preparation
+
+We will use the `TextVectorization` layer to vectorize the text into integer token ids.
+It transforms a batch of strings into either
+a sequence of token indices (one sample = 1D array of integer token indices, in order)
+or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens).
+
+Below, we define 3 preprocessing functions.
+
+1. The `get_vectorize_layer` function builds the `TextVectorization` layer.
+2. The `encode` function encodes raw text into integer token ids.
+3. The `get_masked_input_and_labels` function will mask input token ids.
+It masks 15% of all input tokens in each sequence at random.
+
+
+```python
+
+def custom_standardization(input_data):
+ lowercase = tf.strings.lower(input_data)
+ stripped_html = tf.strings.regex_replace(lowercase, "
", " ")
+ return tf.strings.regex_replace(
+ stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), ""
+ )
+
+
+def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
+ """Build Text vectorization layer
+
+ Args:
+ texts (list): List of string i.e input texts
+ vocab_size (int): vocab size
+ max_seq (int): Maximum sequence lenght.
+ special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].
+
+ Returns:
+ layers.Layer: Return TextVectorization Keras Layer
+ """
+ vectorize_layer = TextVectorization(
+ max_tokens=vocab_size,
+ output_mode="int",
+ standardize=custom_standardization,
+ output_sequence_length=max_seq,
+ )
+ vectorize_layer.adapt(texts)
+
+ # Insert mask token in vocabulary
+ vocab = vectorize_layer.get_vocabulary()
+ vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
+ vectorize_layer.set_vocabulary(vocab)
+ return vectorize_layer
+
+
+vectorize_layer = get_vectorize_layer(
+ all_data.review.values.tolist(),
+ config.VOCAB_SIZE,
+ config.MAX_LEN,
+ special_tokens=["[mask]"],
+)
+
+# Get mask token id for masked language model
+mask_token_id = vectorize_layer(["[mask]"]).numpy()[0][0]
+
+
+def encode(texts):
+ encoded_texts = vectorize_layer(texts)
+ return encoded_texts.numpy()
+
+
+def get_masked_input_and_labels(encoded_texts):
+ # 15% BERT masking
+ inp_mask = np.random.rand(*encoded_texts.shape) < 0.15
+ # Do not mask special tokens
+ inp_mask[encoded_texts <= 2] = False
+ # Set targets to -1 by default, it means ignore
+ labels = -1 * np.ones(encoded_texts.shape, dtype=int)
+ # Set labels for masked tokens
+ labels[inp_mask] = encoded_texts[inp_mask]
+
+ # Prepare input
+ encoded_texts_masked = np.copy(encoded_texts)
+ # Set input to [MASK] which is the last token for the 90% of tokens
+ # This means leaving 10% unchanged
+ inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
+ encoded_texts_masked[
+ inp_mask_2mask
+ ] = mask_token_id # mask token is the last in the dict
+
+ # Set 10% to a random token
+ inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
+ encoded_texts_masked[inp_mask_2random] = np.random.randint(
+ 3, mask_token_id, inp_mask_2random.sum()
+ )
+
+ # Prepare sample_weights to pass to .fit() method
+ sample_weights = np.ones(labels.shape)
+ sample_weights[labels == -1] = 0
+
+ # y_labels would be same as encoded_texts i.e input tokens
+ y_labels = np.copy(encoded_texts)
+
+ return encoded_texts_masked, y_labels, sample_weights
+
+
+# We have 25000 examples for training
+x_train = encode(train_df.review.values) # encode reviews with vectorizer
+y_train = train_df.sentiment.values
+train_classifier_ds = (
+ tf.data.Dataset.from_tensor_slices((x_train, y_train))
+ .shuffle(1000)
+ .batch(config.BATCH_SIZE)
+)
+
+# We have 25000 examples for testing
+x_test = encode(test_df.review.values)
+y_test = test_df.sentiment.values
+test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(
+ config.BATCH_SIZE
+)
+
+# Build dataset for end to end model input (will be used at the end)
+test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices(
+ (test_df.review.values, y_test)
+).batch(config.BATCH_SIZE)
+
+# Prepare data for masked language model
+x_all_review = encode(all_data.review.values)
+x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(
+ x_all_review
+)
+
+mlm_ds = tf.data.Dataset.from_tensor_slices(
+ (x_masked_train, y_masked_labels, sample_weights)
+)
+mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)
+```
+
+---
+## Create BERT model (Pretraining Model) for masked language modeling
+
+We will create a BERT-like pretraining model architecture
+using the `MultiHeadAttention` layer.
+It will take token ids as inputs (including masked tokens)
+and it will predict the correct ids for the masked input tokens.
+
+
+```python
+
+def bert_module(query, key, value, i):
+ # Multi headed self-attention
+ attention_output = layers.MultiHeadAttention(
+ num_heads=config.NUM_HEAD,
+ key_dim=config.EMBED_DIM // config.NUM_HEAD,
+ name="encoder_{}/multiheadattention".format(i),
+ )(query, key, value)
+ attention_output = layers.Dropout(0.1, name="encoder_{}/att_dropout".format(i))(
+ attention_output
+ )
+ attention_output = layers.LayerNormalization(
+ epsilon=1e-6, name="encoder_{}/att_layernormalization".format(i)
+ )(query + attention_output)
+
+ # Feed-forward layer
+ ffn = keras.Sequential(
+ [
+ layers.Dense(config.FF_DIM, activation="relu"),
+ layers.Dense(config.EMBED_DIM),
+ ],
+ name="encoder_{}/ffn".format(i),
+ )
+ ffn_output = ffn(attention_output)
+ ffn_output = layers.Dropout(0.1, name="encoder_{}/ffn_dropout".format(i))(
+ ffn_output
+ )
+ sequence_output = layers.LayerNormalization(
+ epsilon=1e-6, name="encoder_{}/ffn_layernormalization".format(i)
+ )(attention_output + ffn_output)
+ return sequence_output
+
+
+def get_pos_encoding_matrix(max_len, d_emb):
+ pos_enc = np.array(
+ [
+ [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)]
+ if pos != 0
+ else np.zeros(d_emb)
+ for pos in range(max_len)
+ ]
+ )
+ pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2]) # dim 2i
+ pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2]) # dim 2i+1
+ return pos_enc
+
+
+loss_fn = keras.losses.SparseCategoricalCrossentropy(
+ reduction=tf.keras.losses.Reduction.NONE
+)
+loss_tracker = tf.keras.metrics.Mean(name="loss")
+
+
+class MaskedLanguageModel(tf.keras.Model):
+ def train_step(self, inputs):
+ if len(inputs) == 3:
+ features, labels, sample_weight = inputs
+ else:
+ features, labels = inputs
+ sample_weight = None
+
+ with tf.GradientTape() as tape:
+ predictions = self(features, training=True)
+ loss = loss_fn(labels, predictions, sample_weight=sample_weight)
+
+ # Compute gradients
+ trainable_vars = self.trainable_variables
+ gradients = tape.gradient(loss, trainable_vars)
+
+ # Update weights
+ self.optimizer.apply_gradients(zip(gradients, trainable_vars))
+
+ # Compute our own metrics
+ loss_tracker.update_state(loss, sample_weight=sample_weight)
+
+ # Return a dict mapping metric names to current value
+ return {"loss": loss_tracker.result()}
+
+ @property
+ def metrics(self):
+ # We list our `Metric` objects here so that `reset_states()` can be
+ # called automatically at the start of each epoch
+ # or at the start of `evaluate()`.
+ # If you don't implement this property, you have to call
+ # `reset_states()` yourself at the time of your choosing.
+ return [loss_tracker]
+
+
+def create_masked_language_bert_model():
+ inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)
+
+ word_embeddings = layers.Embedding(
+ config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding"
+ )(inputs)
+ position_embeddings = layers.Embedding(
+ input_dim=config.MAX_LEN,
+ output_dim=config.EMBED_DIM,
+ weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)],
+ name="position_embedding",
+ )(tf.range(start=0, limit=config.MAX_LEN, delta=1))
+ embeddings = word_embeddings + position_embeddings
+
+ encoder_output = embeddings
+ for i in range(config.NUM_LAYERS):
+ encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)
+
+ mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")(
+ encoder_output
+ )
+ mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model")
+
+ optimizer = keras.optimizers.Adam(learning_rate=config.LR)
+ mlm_model.compile(optimizer=optimizer)
+ return mlm_model
+
+
+id2token = dict(enumerate(vectorize_layer.get_vocabulary()))
+token2id = {y: x for x, y in id2token.items()}
+
+
+class MaskedTextGenerator(keras.callbacks.Callback):
+ def __init__(self, sample_tokens, top_k=5):
+ self.sample_tokens = sample_tokens
+ self.k = top_k
+
+ def decode(self, tokens):
+ return " ".join([id2token[t] for t in tokens if t != 0])
+
+ def convert_ids_to_tokens(self, id):
+ return id2token[id]
+
+ def on_epoch_end(self, epoch, logs=None):
+ prediction = self.model.predict(self.sample_tokens)
+
+ masked_index = np.where(self.sample_tokens == mask_token_id)
+ masked_index = masked_index[1]
+ mask_prediction = prediction[0][masked_index]
+
+ top_indices = mask_prediction[0].argsort()[-self.k :][::-1]
+ values = mask_prediction[0][top_indices]
+
+ for i in range(len(top_indices)):
+ p = top_indices[i]
+ v = values[i]
+ tokens = np.copy(sample_tokens[0])
+ tokens[masked_index[0]] = p
+ result = {
+ "input_text": self.decode(sample_tokens[0].numpy()),
+ "prediction": self.decode(tokens),
+ "probability": v,
+ "predicted mask token": self.convert_ids_to_tokens(p),
+ }
+ pprint(result)
+
+
+sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"])
+generator_callback = MaskedTextGenerator(sample_tokens.numpy())
+
+bert_masked_model = create_masked_language_bert_model()
+bert_masked_model.summary()
+```
+
+
+```
+Model: "masked_bert_model"
+__________________________________________________________________________________________________
+Layer (type) Output Shape Param # Connected to
+==================================================================================================
+input_1 (InputLayer) [(None, 256)] 0
+__________________________________________________________________________________________________
+word_embedding (Embedding) (None, 256, 128) 3840000 input_1[0][0]
+__________________________________________________________________________________________________
+tf.__operators__.add (TFOpLambd (None, 256, 128) 0 word_embedding[0][0]
+__________________________________________________________________________________________________
+encoder_0/multiheadattention (M (None, 256, 128) 66048 tf.__operators__.add[0][0]
+ tf.__operators__.add[0][0]
+ tf.__operators__.add[0][0]
+__________________________________________________________________________________________________
+encoder_0/att_dropout (Dropout) (None, 256, 128) 0 encoder_0/multiheadattention[0][0
+__________________________________________________________________________________________________
+tf.__operators__.add_1 (TFOpLam (None, 256, 128) 0 tf.__operators__.add[0][0]
+ encoder_0/att_dropout[0][0]
+__________________________________________________________________________________________________
+encoder_0/att_layernormalizatio (None, 256, 128) 256 tf.__operators__.add_1[0][0]
+__________________________________________________________________________________________________
+encoder_0/ffn (Sequential) (None, 256, 128) 33024 encoder_0/att_layernormalization[
+__________________________________________________________________________________________________
+encoder_0/ffn_dropout (Dropout) (None, 256, 128) 0 encoder_0/ffn[0][0]
+__________________________________________________________________________________________________
+tf.__operators__.add_2 (TFOpLam (None, 256, 128) 0 encoder_0/att_layernormalization[
+ encoder_0/ffn_dropout[0][0]
+__________________________________________________________________________________________________
+encoder_0/ffn_layernormalizatio (None, 256, 128) 256 tf.__operators__.add_2[0][0]
+__________________________________________________________________________________________________
+mlm_cls (Dense) (None, 256, 30000) 3870000 encoder_0/ffn_layernormalization[
+==================================================================================================
+Total params: 7,809,584
+Trainable params: 7,809,584
+Non-trainable params: 0
+__________________________________________________________________________________________________
+
+```
+
+---
+## Train and Save
+
+
+```python
+bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback])
+bert_masked_model.save("bert_mlm_imdb.h5")
+```
+
+
+```
+Epoch 1/5
+1563/1563 [==============================] - ETA: 0s - loss: 7.0111{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'this',
+ 'prediction': 'i have watched this this and it was awesome',
+ 'probability': 0.086307295}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'i',
+ 'prediction': 'i have watched this i and it was awesome',
+ 'probability': 0.066265985}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'movie',
+ 'prediction': 'i have watched this movie and it was awesome',
+ 'probability': 0.044195656}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'a',
+ 'prediction': 'i have watched this a and it was awesome',
+ 'probability': 0.04020928}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'was',
+ 'prediction': 'i have watched this was and it was awesome',
+ 'probability': 0.027878676}
+1563/1563 [==============================] - 661s 423ms/step - loss: 7.0111
+Epoch 2/5
+1563/1563 [==============================] - ETA: 0s - loss: 6.4498{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'movie',
+ 'prediction': 'i have watched this movie and it was awesome',
+ 'probability': 0.44448906}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'film',
+ 'prediction': 'i have watched this film and it was awesome',
+ 'probability': 0.1507494}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'is',
+ 'prediction': 'i have watched this is and it was awesome',
+ 'probability': 0.06385628}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'one',
+ 'prediction': 'i have watched this one and it was awesome',
+ 'probability': 0.023549262}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'was',
+ 'prediction': 'i have watched this was and it was awesome',
+ 'probability': 0.022277055}
+1563/1563 [==============================] - 660s 422ms/step - loss: 6.4498
+Epoch 3/5
+1563/1563 [==============================] - ETA: 0s - loss: 5.8709{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'movie',
+ 'prediction': 'i have watched this movie and it was awesome',
+ 'probability': 0.4759983}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'film',
+ 'prediction': 'i have watched this film and it was awesome',
+ 'probability': 0.18642229}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'one',
+ 'prediction': 'i have watched this one and it was awesome',
+ 'probability': 0.045611132}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'is',
+ 'prediction': 'i have watched this is and it was awesome',
+ 'probability': 0.028308254}
+{'input_text': 'i have watched this [mask] and it was awesome',
+ 'predicted mask token': 'series',
+ 'prediction': 'i have watched this series and it was awesome',
+ 'probability': 0.027862877}
+1563/1563 [==============================] - 661s 423ms/step - loss: 5.8709
+Epoch 4/5
+ 771/1563 [=============>................] - ETA: 5:35 - loss: 5.3782
+
+```
+
+---
+## Fine-tune a sentiment classification model
+
+We will fine-tune our self-supervised model on a downstream task of sentiment classification.
+To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the
+pretrained BERT features.
+
+
+```python
+# Load pretrained bert model
+mlm_model = keras.models.load_model(
+ "bert_mlm_imdb.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
+)
+pretrained_bert_model = tf.keras.Model(
+ mlm_model.input, mlm_model.get_layer("encoder_0/ffn_layernormalization").output
+)
+
+# Freeze it
+pretrained_bert_model.trainable = False
+
+
+def create_classifier_bert_model():
+ inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)
+ sequence_output = pretrained_bert_model(inputs)
+ pooled_output = layers.GlobalMaxPooling1D()(sequence_output)
+ hidden_layer = layers.Dense(64, activation="relu")(pooled_output)
+ outputs = layers.Dense(1, activation="sigmoid")(hidden_layer)
+ classifer_model = keras.Model(inputs, outputs, name="classification")
+ optimizer = keras.optimizers.Adam()
+ classifer_model.compile(
+ optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
+ )
+ return classifer_model
+
+
+classifer_model = create_classifier_bert_model()
+classifer_model.summary()
+
+# Train the classifier with frozen BERT stage
+classifer_model.fit(
+ train_classifier_ds,
+ epochs=5,
+ validation_data=test_classifier_ds,
+)
+
+# Unfreeze the BERT model for fine-tuning
+pretrained_bert_model.trainable = True
+optimizer = keras.optimizers.Adam()
+classifer_model.compile(
+ optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
+)
+classifer_model.fit(
+ train_classifier_ds,
+ epochs=5,
+ validation_data=test_classifier_ds,
+)
+```
+
+
+```
+Model: "classification"
+_________________________________________________________________
+Layer (type) Output Shape Param #
+=================================================================
+input_2 (InputLayer) [(None, 256)] 0
+_________________________________________________________________
+model (Functional) (None, 256, 128) 3939584
+_________________________________________________________________
+global_max_pooling1d (Global (None, 128) 0
+_________________________________________________________________
+dense_2 (Dense) (None, 64) 8256
+_________________________________________________________________
+dense_3 (Dense) (None, 1) 65
+=================================================================
+Total params: 3,947,905
+Trainable params: 8,321
+Non-trainable params: 3,939,584
+_________________________________________________________________
+Epoch 1/5
+782/782 [==============================] - 15s 19ms/step - loss: 0.8096 - accuracy: 0.5498 - val_loss: 0.6406 - val_accuracy: 0.6329
+Epoch 2/5
+782/782 [==============================] - 14s 18ms/step - loss: 0.6551 - accuracy: 0.6220 - val_loss: 0.6423 - val_accuracy: 0.6338
+Epoch 3/5
+782/782 [==============================] - 14s 18ms/step - loss: 0.6473 - accuracy: 0.6310 - val_loss: 0.6380 - val_accuracy: 0.6350
+Epoch 4/5
+782/782 [==============================] - 14s 18ms/step - loss: 0.6307 - accuracy: 0.6471 - val_loss: 0.6432 - val_accuracy: 0.6312
+Epoch 5/5
+782/782 [==============================] - 14s 18ms/step - loss: 0.6278 - accuracy: 0.6465 - val_loss: 0.6107 - val_accuracy: 0.6678
+Epoch 1/5
+782/782 [==============================] - 46s 59ms/step - loss: 0.5234 - accuracy: 0.7373 - val_loss: 0.3533 - val_accuracy: 0.8427
+Epoch 2/5
+782/782 [==============================] - 45s 57ms/step - loss: 0.2808 - accuracy: 0.8814 - val_loss: 0.3252 - val_accuracy: 0.8633
+Epoch 3/5
+782/782 [==============================] - 43s 55ms/step - loss: 0.1493 - accuracy: 0.9413 - val_loss: 0.4374 - val_accuracy: 0.8486
+Epoch 4/5
+782/782 [==============================] - 43s 55ms/step - loss: 0.0600 - accuracy: 0.9803 - val_loss: 0.6422 - val_accuracy: 0.8380
+Epoch 5/5
+782/782 [==============================] - 43s 55ms/step - loss: 0.0305 - accuracy: 0.9893 - val_loss: 0.6064 - val_accuracy: 0.8440
+
+
+
+```
+
+---
+## Create an end-to-end model and evaluate it
+
+When you want to deploy a model, it's best if it already includes its preprocessing
+pipeline, so that you don't have to reimplement the preprocessing logic in your
+production environment. Let's create an end-to-end model that incorporates
+the `TextVectorization` layer, and let's evaluate. Our model will accept raw strings
+as input.
+
+
+```python
+
+def get_end_to_end(model):
+ inputs_string = keras.Input(shape=(1,), dtype="string")
+ indices = vectorize_layer(inputs_string)
+ outputs = model(indices)
+ end_to_end_model = keras.Model(inputs_string, outputs, name="end_to_end_model")
+ optimizer = keras.optimizers.Adam(learning_rate=config.LR)
+ end_to_end_model.compile(
+ optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
+ )
+ return end_to_end_model
+
+
+end_to_end_classification_model = get_end_to_end(classifer_model)
+end_to_end_classification_model.evaluate(test_raw_classifier_ds)
+```
+
+
+```
+782/782 [==============================] - 8s 11ms/step - loss: 0.5967 - accuracy: 0.8446
+
+[0.6064175963401794, 0.8439599871635437]
+
+```
+
\ No newline at end of file
diff --git a/examples/nlp/mlm_and_finetune_with_bert.py b/examples/nlp/mlm_and_finetune_with_bert.py
new file mode 100644
index 0000000000..9b4ebc07e9
--- /dev/null
+++ b/examples/nlp/mlm_and_finetune_with_bert.py
@@ -0,0 +1,505 @@
+"""
+Title: End-to-end Masked Language Modeling with BERT
+Author: [Ankur Singh](https://twitter.com/ankur310794)
+Date created: 2020/09/18
+Last modified: 2020/09/18
+Description: Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset.
+"""
+"""
+## Introduction
+
+Masked Language Modeling is a fill-in-the-blank task,
+where a model uses the context words surrounding a mask token to try to predict what the
+masked word should be.
+
+For an input that contains one or more mask tokens,
+the model will generate the most likely substitution for each.
+
+Example:
+
+- Input: "I have watched this [MASK] and it was awesome."
+- Output: "I have watched this movie and it was awesome."
+
+Masked language modeling is a great way to train a language
+model in a self-supervised setting (without human-annotated labels).
+Such a model can then be fine-tuned to accomplish various supervised
+NLP tasks.
+
+This example teaches you how to build a BERT model from scratch,
+train it with the masked language modeling task,
+and then fine-tune this model on a sentiment classification task.
+
+We will use the Keras `TextVectorization` and `MultiHeadAttention` layers
+to create a BERT Transformer-Encoder network architecture.
+
+Note: This example should be run with `tf-nightly`.
+"""
+
+"""
+## Setup
+
+Install `tf-nightly` via `pip install tf-nightly`.
+"""
+
+import tensorflow as tf
+from tensorflow import keras
+from tensorflow.keras import layers
+from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
+from dataclasses import dataclass
+import pandas as pd
+import numpy as np
+import glob
+import re
+from pprint import pprint
+
+"""
+## Set-up Configuration
+"""
+
+
+@dataclass
+class Config:
+ MAX_LEN = 256
+ BATCH_SIZE = 32
+ LR = 0.001
+ VOCAB_SIZE = 30000
+ EMBED_DIM = 128
+ NUM_HEAD = 8 # used in bert model
+ FF_DIM = 128 # used in bert model
+ NUM_LAYERS = 1
+
+
+config = Config()
+
+"""
+## Load the data
+
+We will first download the IMDB data and load into a Pandas dataframe.
+"""
+
+"""shell
+curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
+tar -xf aclImdb_v1.tar.gz
+"""
+
+
+def get_text_list_from_files(files):
+ text_list = []
+ for name in files:
+ with open(name) as f:
+ for line in f:
+ text_list.append(line)
+ return text_list
+
+
+def get_data_from_text_files(folder_name):
+
+ pos_files = glob.glob("aclImdb/" + folder_name + "/pos/*.txt")
+ pos_texts = get_text_list_from_files(pos_files)
+ neg_files = glob.glob("aclImdb/" + folder_name + "/neg/*.txt")
+ neg_texts = get_text_list_from_files(neg_files)
+ df = pd.DataFrame(
+ {
+ "review": pos_texts + neg_texts,
+ "sentiment": [0] * len(pos_texts) + [1] * len(neg_texts),
+ }
+ )
+ df = df.sample(len(df)).reset_index(drop=True)
+ return df
+
+
+train_df = get_data_from_text_files("train")
+test_df = get_data_from_text_files("test")
+
+all_data = train_df.append(test_df)
+
+"""
+## Dataset preparation
+
+We will use the `TextVectorization` layer to vectorize the text into integer token ids.
+It transforms a batch of strings into either
+a sequence of token indices (one sample = 1D array of integer token indices, in order)
+or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens).
+
+Below, we define 3 preprocessing functions.
+
+1. The `get_vectorize_layer` function builds the `TextVectorization` layer.
+2. The `encode` function encodes raw text into integer token ids.
+3. The `get_masked_input_and_labels` function will mask input token ids.
+It masks 15% of all input tokens in each sequence at random.
+"""
+
+
+def custom_standardization(input_data):
+ lowercase = tf.strings.lower(input_data)
+ stripped_html = tf.strings.regex_replace(lowercase, "
", " ")
+ return tf.strings.regex_replace(
+ stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), ""
+ )
+
+
+def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
+ """Build Text vectorization layer
+
+ Args:
+ texts (list): List of string i.e input texts
+ vocab_size (int): vocab size
+ max_seq (int): Maximum sequence lenght.
+ special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].
+
+ Returns:
+ layers.Layer: Return TextVectorization Keras Layer
+ """
+ vectorize_layer = TextVectorization(
+ max_tokens=vocab_size,
+ output_mode="int",
+ standardize=custom_standardization,
+ output_sequence_length=max_seq,
+ )
+ vectorize_layer.adapt(texts)
+
+ # Insert mask token in vocabulary
+ vocab = vectorize_layer.get_vocabulary()
+ vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
+ vectorize_layer.set_vocabulary(vocab)
+ return vectorize_layer
+
+
+vectorize_layer = get_vectorize_layer(
+ all_data.review.values.tolist(),
+ config.VOCAB_SIZE,
+ config.MAX_LEN,
+ special_tokens=["[mask]"],
+)
+
+# Get mask token id for masked language model
+mask_token_id = vectorize_layer(["[mask]"]).numpy()[0][0]
+
+
+def encode(texts):
+ encoded_texts = vectorize_layer(texts)
+ return encoded_texts.numpy()
+
+
+def get_masked_input_and_labels(encoded_texts):
+ # 15% BERT masking
+ inp_mask = np.random.rand(*encoded_texts.shape) < 0.15
+ # Do not mask special tokens
+ inp_mask[encoded_texts <= 2] = False
+ # Set targets to -1 by default, it means ignore
+ labels = -1 * np.ones(encoded_texts.shape, dtype=int)
+ # Set labels for masked tokens
+ labels[inp_mask] = encoded_texts[inp_mask]
+
+ # Prepare input
+ encoded_texts_masked = np.copy(encoded_texts)
+ # Set input to [MASK] which is the last token for the 90% of tokens
+ # This means leaving 10% unchanged
+ inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
+ encoded_texts_masked[
+ inp_mask_2mask
+ ] = mask_token_id # mask token is the last in the dict
+
+ # Set 10% to a random token
+ inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
+ encoded_texts_masked[inp_mask_2random] = np.random.randint(
+ 3, mask_token_id, inp_mask_2random.sum()
+ )
+
+ # Prepare sample_weights to pass to .fit() method
+ sample_weights = np.ones(labels.shape)
+ sample_weights[labels == -1] = 0
+
+ # y_labels would be same as encoded_texts i.e input tokens
+ y_labels = np.copy(encoded_texts)
+
+ return encoded_texts_masked, y_labels, sample_weights
+
+
+# We have 25000 examples for training
+x_train = encode(train_df.review.values) # encode reviews with vectorizer
+y_train = train_df.sentiment.values
+train_classifier_ds = (
+ tf.data.Dataset.from_tensor_slices((x_train, y_train))
+ .shuffle(1000)
+ .batch(config.BATCH_SIZE)
+)
+
+# We have 25000 examples for testing
+x_test = encode(test_df.review.values)
+y_test = test_df.sentiment.values
+test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(
+ config.BATCH_SIZE
+)
+
+# Build dataset for end to end model input (will be used at the end)
+test_raw_classifier_ds = tf.data.Dataset.from_tensor_slices(
+ (test_df.review.values, y_test)
+).batch(config.BATCH_SIZE)
+
+# Prepare data for masked language model
+x_all_review = encode(all_data.review.values)
+x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(
+ x_all_review
+)
+
+mlm_ds = tf.data.Dataset.from_tensor_slices(
+ (x_masked_train, y_masked_labels, sample_weights)
+)
+mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)
+
+"""
+## Create BERT model (Pretraining Model) for masked language modeling
+
+We will create a BERT-like pretraining model architecture
+using the `MultiHeadAttention` layer.
+It will take token ids as inputs (including masked tokens)
+and it will predict the correct ids for the masked input tokens.
+"""
+
+
+def bert_module(query, key, value, i):
+ # Multi headed self-attention
+ attention_output = layers.MultiHeadAttention(
+ num_heads=config.NUM_HEAD,
+ key_dim=config.EMBED_DIM // config.NUM_HEAD,
+ name="encoder_{}/multiheadattention".format(i),
+ )(query, key, value)
+ attention_output = layers.Dropout(0.1, name="encoder_{}/att_dropout".format(i))(
+ attention_output
+ )
+ attention_output = layers.LayerNormalization(
+ epsilon=1e-6, name="encoder_{}/att_layernormalization".format(i)
+ )(query + attention_output)
+
+ # Feed-forward layer
+ ffn = keras.Sequential(
+ [
+ layers.Dense(config.FF_DIM, activation="relu"),
+ layers.Dense(config.EMBED_DIM),
+ ],
+ name="encoder_{}/ffn".format(i),
+ )
+ ffn_output = ffn(attention_output)
+ ffn_output = layers.Dropout(0.1, name="encoder_{}/ffn_dropout".format(i))(
+ ffn_output
+ )
+ sequence_output = layers.LayerNormalization(
+ epsilon=1e-6, name="encoder_{}/ffn_layernormalization".format(i)
+ )(attention_output + ffn_output)
+ return sequence_output
+
+
+def get_pos_encoding_matrix(max_len, d_emb):
+ pos_enc = np.array(
+ [
+ [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)]
+ if pos != 0
+ else np.zeros(d_emb)
+ for pos in range(max_len)
+ ]
+ )
+ pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2]) # dim 2i
+ pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2]) # dim 2i+1
+ return pos_enc
+
+
+loss_fn = keras.losses.SparseCategoricalCrossentropy(
+ reduction=tf.keras.losses.Reduction.NONE
+)
+loss_tracker = tf.keras.metrics.Mean(name="loss")
+
+
+class MaskedLanguageModel(tf.keras.Model):
+ def train_step(self, inputs):
+ if len(inputs) == 3:
+ features, labels, sample_weight = inputs
+ else:
+ features, labels = inputs
+ sample_weight = None
+
+ with tf.GradientTape() as tape:
+ predictions = self(features, training=True)
+ loss = loss_fn(labels, predictions, sample_weight=sample_weight)
+
+ # Compute gradients
+ trainable_vars = self.trainable_variables
+ gradients = tape.gradient(loss, trainable_vars)
+
+ # Update weights
+ self.optimizer.apply_gradients(zip(gradients, trainable_vars))
+
+ # Compute our own metrics
+ loss_tracker.update_state(loss, sample_weight=sample_weight)
+
+ # Return a dict mapping metric names to current value
+ return {"loss": loss_tracker.result()}
+
+ @property
+ def metrics(self):
+ # We list our `Metric` objects here so that `reset_states()` can be
+ # called automatically at the start of each epoch
+ # or at the start of `evaluate()`.
+ # If you don't implement this property, you have to call
+ # `reset_states()` yourself at the time of your choosing.
+ return [loss_tracker]
+
+
+def create_masked_language_bert_model():
+ inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)
+
+ word_embeddings = layers.Embedding(
+ config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding"
+ )(inputs)
+ position_embeddings = layers.Embedding(
+ input_dim=config.MAX_LEN,
+ output_dim=config.EMBED_DIM,
+ weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)],
+ name="position_embedding",
+ )(tf.range(start=0, limit=config.MAX_LEN, delta=1))
+ embeddings = word_embeddings + position_embeddings
+
+ encoder_output = embeddings
+ for i in range(config.NUM_LAYERS):
+ encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)
+
+ mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")(
+ encoder_output
+ )
+ mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model")
+
+ optimizer = keras.optimizers.Adam(learning_rate=config.LR)
+ mlm_model.compile(optimizer=optimizer)
+ return mlm_model
+
+
+id2token = dict(enumerate(vectorize_layer.get_vocabulary()))
+token2id = {y: x for x, y in id2token.items()}
+
+
+class MaskedTextGenerator(keras.callbacks.Callback):
+ def __init__(self, sample_tokens, top_k=5):
+ self.sample_tokens = sample_tokens
+ self.k = top_k
+
+ def decode(self, tokens):
+ return " ".join([id2token[t] for t in tokens if t != 0])
+
+ def convert_ids_to_tokens(self, id):
+ return id2token[id]
+
+ def on_epoch_end(self, epoch, logs=None):
+ prediction = self.model.predict(self.sample_tokens)
+
+ masked_index = np.where(self.sample_tokens == mask_token_id)
+ masked_index = masked_index[1]
+ mask_prediction = prediction[0][masked_index]
+
+ top_indices = mask_prediction[0].argsort()[-self.k :][::-1]
+ values = mask_prediction[0][top_indices]
+
+ for i in range(len(top_indices)):
+ p = top_indices[i]
+ v = values[i]
+ tokens = np.copy(sample_tokens[0])
+ tokens[masked_index[0]] = p
+ result = {
+ "input_text": self.decode(sample_tokens[0].numpy()),
+ "prediction": self.decode(tokens),
+ "probability": v,
+ "predicted mask token": self.convert_ids_to_tokens(p),
+ }
+ pprint(result)
+
+
+sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"])
+generator_callback = MaskedTextGenerator(sample_tokens.numpy())
+
+bert_masked_model = create_masked_language_bert_model()
+bert_masked_model.summary()
+
+"""
+## Train and Save
+"""
+
+bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback])
+bert_masked_model.save("bert_mlm_imdb.h5")
+
+"""
+## Fine-tune a sentiment classification model
+
+We will fine-tune our self-supervised model on a downstream task of sentiment classification.
+To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the
+pretrained BERT features.
+
+"""
+
+# Load pretrained bert model
+mlm_model = keras.models.load_model(
+ "bert_mlm_imdb.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
+)
+pretrained_bert_model = tf.keras.Model(
+ mlm_model.input, mlm_model.get_layer("encoder_0/ffn_layernormalization").output
+)
+
+# Freeze it
+pretrained_bert_model.trainable = False
+
+
+def create_classifier_bert_model():
+ inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)
+ sequence_output = pretrained_bert_model(inputs)
+ pooled_output = layers.GlobalMaxPooling1D()(sequence_output)
+ hidden_layer = layers.Dense(64, activation="relu")(pooled_output)
+ outputs = layers.Dense(1, activation="sigmoid")(hidden_layer)
+ classifer_model = keras.Model(inputs, outputs, name="classification")
+ optimizer = keras.optimizers.Adam()
+ classifer_model.compile(
+ optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
+ )
+ return classifer_model
+
+
+classifer_model = create_classifier_bert_model()
+classifer_model.summary()
+
+# Train the classifier with frozen BERT stage
+classifer_model.fit(
+ train_classifier_ds, epochs=5, validation_data=test_classifier_ds,
+)
+
+# Unfreeze the BERT model for fine-tuning
+pretrained_bert_model.trainable = True
+optimizer = keras.optimizers.Adam()
+classifer_model.compile(
+ optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
+)
+classifer_model.fit(
+ train_classifier_ds, epochs=5, validation_data=test_classifier_ds,
+)
+
+"""
+## Create an end-to-end model and evaluate it
+
+When you want to deploy a model, it's best if it already includes its preprocessing
+pipeline, so that you don't have to reimplement the preprocessing logic in your
+production environment. Let's create an end-to-end model that incorporates
+the `TextVectorization` layer, and let's evaluate. Our model will accept raw strings
+as input.
+"""
+
+
+def get_end_to_end(model):
+ inputs_string = keras.Input(shape=(1,), dtype="string")
+ indices = vectorize_layer(inputs_string)
+ outputs = model(indices)
+ end_to_end_model = keras.Model(inputs_string, outputs, name="end_to_end_model")
+ optimizer = keras.optimizers.Adam(learning_rate=config.LR)
+ end_to_end_model.compile(
+ optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
+ )
+ return end_to_end_model
+
+
+end_to_end_classification_model = get_end_to_end(classifer_model)
+end_to_end_classification_model.evaluate(test_raw_classifier_ds)