Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Masked language modeling with Bert example #244

Merged
merged 13 commits into from
Sep 23, 2020
Prev Previous commit
Next Next commit
mlm and fine tune with bert from scratch
I have implemented Mask Language Model and Fine-tune model with bert from scratch using tf-nightly
  • Loading branch information
Ankur3107 authored Sep 16, 2020
commit c5f5ef140793776c495c1a12764278a894622f82
290 changes: 290 additions & 0 deletions examples/nlp/mlm_and_finetune_with_bert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
"""
Title: End to End Masked Language Modeling & Fine-Tuning with BERT from Scratch
Author: [Ankur Singh](https://twitter.com/ankur310794)
Date created: 2020/09/03
Last modified: 2020/16/03
Description: Implement a Masked Language Modeling with BERT and fine-tune on IMDB Reviews dataset
"""
"""
## Introduction
Masked language modeling is a fill-in-the-blank task, where a model uses the context words surrounding a [MASK] token to try to predict what the [MASK] word should be.
"""

"""
## Setup
Install tf-nightly via pip install tf-nightly.
"""

import tensorflow as tf
from dataclasses import dataclass
import pandas as pd
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import numpy as np

"""
## Set-up Configuration
"""


@dataclass
class Config:
MAX_LEN = 128
BATCH_SIZE = 16
LR = 1e-5
VOCAB_SIZE = 20000
EMBED_DIM = 32
NUM_HEAD = 2 # used in bert model
FF_DIM = 32 # used in bert model


flags = Config()

"""
## Load Data
"""

"""shell
wget https://raw.githubusercontent.com/SrinidhiRaghavan/AI-Sentiment-Analysis-on-IMDB-Dataset/master/imdb_tr.csv
"""

data = pd.read_csv("imdb_tr.csv", encoding="ISO-8859-1")

"""
## Dataset Preparation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each subsection, please add a text paragraph explaining what you're doing:

  • How do you prepare the dataset? what do your samples end up looking like?
  • What does the masked language model do (what are its inputs are outputs)?
  • What does the classification model do?
  • What does the end to end model do?
  • How to interpret evaluation metrics?

"""


def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
"""Build Text vectorization layer

Args:
texts (list): List of String i.e input texts
vocab_size (int): vocab size
max_seq (int): maximum sequence len
special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, this can just be mask_token='[MASK]'.


Returns:
tf.keras.layers.Layer: Return TextVectorization Keras Layer
"""
vectorize_layer = TextVectorization(
max_tokens=vocab_size,
output_mode="int",
standardize=None,
output_sequence_length=max_seq,
)

vectorize_layer.adapt(texts)

vocab = vectorize_layer.get_vocabulary()
vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[MASK]"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put it as the start of the list you won't need to compute mask_token_id by hand (it will be 0)

Copy link
Contributor Author

@Ankur3107 Ankur3107 Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit confused. My understanding is index 0 is reserved for padding token. (Reference)

And I feel there is no argument called mask_token in the class constructor of TextVectorization (Reference)

I need to insert special tokens ('[MASK] is this case') into the vocabulary and I would like to use zero index for padding.

Am I thinking right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. The mask token is normally index 0. So you could just insert it at index 0, which is what I was saying. Note that there is a mask token already, by default it is the empty string. You could simply replace [MASK] with the empty string in your code and you wouldn't need to configure it (and its index would be 0).

Note, the purpose of the default mask token is to handle padding, so your data will already contain 0s for sections that are padded (sequences shorter than MAX_LEN). Not sure if that forces you to use a separate mask token...

I agree with you that it should be configurable in the constructor of TextVectorization. We will add it (alongside a couple other arguments). Internally TextVectorization uses a StringLookup layer which allows this customization. However if we configure it in this way it would still be used for padding as well.

vectorize_layer.set_vocabulary(vocab)
return vectorize_layer


vectorize_layer = get_vectorize_layer(
data.text.values.tolist(),
flags.VOCAB_SIZE,
flags.MAX_LEN,
special_tokens=["[MASK]"],
)

# get mask token id for mlm
mask_token_id = vectorize_layer(["[MASK]"]).numpy()[0][0]


class IMDBReviewsDataGenerator(tf.keras.utils.Sequence):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to handle this using tf.data.Dataset instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will try to implement in tf.data.Dataset

"""Generates batches of data

Args:
texts (list): List of String i.e input texts
labels (list): List of Labels i.e for classification model only
batch_size (int): batch size
vectorizer (TextVectorization): Keras TextVectorization layer
is_training (bool, optional): Is training dataset generator. Defaults to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using include_labels=True which is more explicit

model_type (str, optional): Used as a flag for mlm and classifier. Defaults to 'classification'.
"""

def __init__(
self,
texts,
labels,
batch_size,
vectorizer,
is_training=True,
model_type="classification",
):
self.texts = texts
self.labels = labels
self.batch_size = batch_size
self.vectorizer = vectorizer
self.is_training = is_training
self.model_type = model_type
self.indexes = np.arange(len(self.texts))

def __len__(self):
# Denotes the number of batches per epoch.
return len(self.texts) // self.batch_size

def __getitem__(self, idx):
# Retrieves the batch of index.
indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
texts = self.texts[indexes]

encoded = self.vectorizer(texts).numpy()

if self.is_training and self.model_type == "mlm":
X_mlm, y_labels, sample_weights = self.prepare_mlm_input_and_labels(encoded)
return (X_mlm, y_labels, sample_weights)

elif self.is_training and self.model_type == "classification":
labels = np.array(self.labels[indexes], dtype="int32")
return (encoded, labels)

else:
return encoded

def prepare_mlm_input_and_labels(self, X):
# 15% BERT masking
inp_mask = np.random.rand(*X.shape) < 0.15
# do not mask special tokens
inp_mask[X <= 2] = False
# set targets to -1 by default, it means ignore
labels = -1 * np.ones(X.shape, dtype=int)
# set labels for masked tokens
labels[inp_mask] = X[inp_mask]

# prepare input
X_mlm = np.copy(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: variable names should be lowercase. Please always use fully-spelled out variable names for better readability, instead of abbreviations or single letters.

# set input to [MASK] which is the last token for the 90% of tokens
# this means leaving 10% unchanged
inp_mask_2mask = inp_mask & (np.random.rand(*X.shape) < 0.90)
X_mlm[inp_mask_2mask] = mask_token_id # mask token is the last in the dict

# set 10% to a random token
inp_mask_2random = inp_mask_2mask & (np.random.rand(*X.shape) < 1 / 9)
X_mlm[inp_mask_2random] = np.random.randint(
3, mask_token_id, inp_mask_2random.sum()
)

# prepare sample_weights to pass to .fit() method
sample_weights = np.ones(labels.shape)
sample_weights[labels == -1] = 0

# y_labels would be same as X i.e input tokens
y_labels = np.copy(X)

return X_mlm, y_labels, sample_weights


classifier_train_data = IMDBReviewsDataGenerator(
data["text"].values.astype("str"),
data["polarity"].values.astype("int"),
flags.BATCH_SIZE,
vectorize_layer,
)
mlm_train_data = IMDBReviewsDataGenerator(
data["text"].values.astype("str"),
None,
flags.BATCH_SIZE,
vectorize_layer,
model_type="mlm",
)

"""
## Create BERT Model From Scratch
"""


class BERT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is intended to be used for Functional model construction, apparently. Currently it looks a lot like a subclassed model (with layer creation in __call__), which is confusing. Please refactor it as a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I will refactor and make a function.

def __init__(self, flags):
self.flags = flags

def __call__(self, inputs):
# embedding layer
embedding_layer = tf.keras.layers.Embedding(flags.VOCAB_SIZE, flags.EMBED_DIM)(
inputs
)

# query, value, key for attention
query = tf.keras.layers.Dense(flags.EMBED_DIM)(embedding_layer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still nit: use

from tensorflow.keras import layers

then layers.Embedding, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I will refactor in the same way.

value = tf.keras.layers.Dense(flags.EMBED_DIM)(embedding_layer)
key = tf.keras.layers.Dense(flags.EMBED_DIM)(embedding_layer)

# multi headed self-attention
attention_output = tf.keras.layers.MultiHeadAttention(
num_heads=flags.NUM_HEAD, key_dim=flags.EMBED_DIM // flags.NUM_HEAD
)(query, value, key)
attention_output = tf.keras.layers.Dropout(0.1)(attention_output)
attention_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(
embedding_layer + attention_output
)

# feed-forward leyer
ffn = tf.keras.Sequential(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: keras.Sequential

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I will refactor in the same way.

[
tf.keras.layers.Dense(flags.FF_DIM, activation="relu"),
tf.keras.layers.Dense(flags.EMBED_DIM),
]
)
ffn_output = ffn(attention_output)
ffn_output = tf.keras.layers.Dropout(0.1)(ffn_output)
sequence_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(
attention_output + ffn_output
)

pooled_output = tf.keras.layers.GlobalMaxPooling1D()(ffn_output)

return sequence_output, pooled_output


def build_mlm_model():
tf.keras.backend.clear_session()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to remove this statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I will refactor.

inputs = tf.keras.layers.Input((flags.MAX_LEN,), dtype=tf.int32)
sequence_output, pooled_output = BERT(flags)(inputs)
outputs = tf.keras.layers.Dense(flags.VOCAB_SIZE)(sequence_output)
mlm_model = tf.keras.Model(inputs, outputs)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
reduction=tf.keras.losses.Reduction.NONE, from_logits=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no reduction? since you're using compile/fit the results will be averaged for reduction anyway (since some form of reduction is necessary).

)
optimizer = tf.keras.optimizers.Adam(learning_rate=flags.LR)
mlm_model.compile(optimizer=optimizer, loss=loss_fn)
return mlm_model


def build_classifer_model():
tf.keras.backend.clear_session()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to remove this statement

inputs = tf.keras.layers.Input((flags.MAX_LEN,), dtype=tf.int32)
sequence_output, pooled_output = BERT(flags)(inputs)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(pooled_output)
classifer_model = tf.keras.Model(inputs, outputs)
optimizer = tf.keras.optimizers.Adam(learning_rate=flags.LR)
classifer_model.compile(
optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
)
return classifer_model


def get_end_to_end(model):
inputs_string = tf.keras.Input(shape=(1,), dtype="string")
indices = vectorize_layer(inputs_string)
outputs = model(indices)
end_to_end_model = tf.keras.Model(inputs_string, outputs, name="end_to_end")
return end_to_end_model


"""
## Build, Train, Save MLM
"""

mlm_model = build_mlm_model()
mlm_model.fit(mlm_train_data, epochs=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any call to fit() should include a way to monitor results on an evaluation set.

Why only train for 1 epoch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: since you're training using a generator, consider using multiple workers for better performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used one epoch because of time. It is taking 22min per epoch. I will use multiple workers for better performance and will do for 3 epochs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am not able to install (running without gpu) tf-nightly-gpu on colab or kaggle. I tried to install but I unable to get GPU support.

mlm_model.save("mlm_model.h5")

"""
## Build, Train, Save Fine-Tune Model
"""

classifer_model = build_classifer_model()
classifer_model.load_weights("mlm_model.h5", skip_mismatch=True, by_name=True)
classifer_model.fit(classifier_train_data, epochs=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a section about model evaluation and using the model in practice on a new sample.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Loading