Skip to content

Commit

Permalink
update readme and supported tasks (mindspore-lab#660)
Browse files Browse the repository at this point in the history
  • Loading branch information
lvyufeng authored Sep 15, 2023
1 parent 5279a3a commit f462435
Show file tree
Hide file tree
Showing 11 changed files with 95 additions and 231 deletions.
153 changes: 30 additions & 123 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,159 +25,66 @@
[Tutorials](#tutorials) |
[Notes](#notes)

## News 📢
### News 📢

* 🔥 **Latest Features**
* 📃 Support PreTrained Models, including **[BERT](./mindnlp/models/bert)**, **[Roberta](./mindnlp/models/roberta)**, **[GPT2](./mindnlp/models/gpt2)** and **[T5](./mindnlp/models/t5)**.
You can use them by following code snippet:

* 📝 MindNLP supports NLP tasks such as *language model*, *machine translation*, *question answering*, *sentiment analysis*, *sequence labeling*, *summarization*, etc. You can access them through [examples](./examples/).
* 🚀 MindNLP currently supports industry-leading Large Language Models (LLMs), including **Llama**, **GLM**, **RWKV**, etc. For support related to large language models, including ***pre-training***, ***fine-tuning***, and **inference** demo examples, you can find them in the ["llm" directory](./llm/).
* 🤗 Pretrained models support ***huggingface transformers-like apis***, including **28+** models like **[BERT](./mindnlp/models/bert)**, **[Roberta](./mindnlp/models/roberta)**, **[GPT2](./mindnlp/models/gpt2)**, **[T5](./mindnlp/models/t5)**, etc.
You can use them easily by following code snippet:
```python
from mindnlp.models import BertModel

model = BertModel.from_pretrained('bert-base-cased')
```
### Installation

Version Compatibility:

| MindNLP version | MindSpore version | Supported Python version |
|-----------------|-------------------|--------------------------|
| master | daily build | >=3.7.5, <=3.9 |
| 0.1.1 | >=1.8.1, <=2.0.0 | >=3.7.5, <=3.9 |
| 0.2.0 | >=2.1.0 | >=3.7.5, <=3.9 |

## Introduction

MindNLP is an open source NLP library based on MindSpore. It supports a platform for solving natural language processing tasks, containing many common approaches in NLP. It can help researchers and developers to construct and train models more conveniently and rapidly.
#### Daily build

The master branch works with **MindSpore master**.
You can download MindNLP daily wheel from [here](https://repo.mindspore.cn/mindspore-lab/mindnlp/newest/any/).

### Major Features

- **Comprehensive data processing**: Several classical NLP datasets are packaged into friendly module for easy use, such as Multi30k, SQuAD, CoNLL, etc.
- **Friendly NLP model toolset**: MindNLP provides various configurable components. It is friendly to customize models using MindNLP.
- **Easy-to-use engine**: MindNLP simplified complicated training process in MindSpore. It supports Trainer and Evaluator interfaces to train and evaluate models easily.

## Quick Links

- [Documentation](https://mindnlp.cqu.ai/en/latest/)
- [Examples](https://github.com/mindspore-lab/mindnlp/tree/master/examples)
- ...

## Installation

### Dependency

- mindspore >= 1.8.1

### Install from source
#### Install from source

To install MindNLP from source, please run:

```bash
pip install git+https://github.com/mindspore-lab/mindnlp.git
```

or

```bash
# or
git clone https://github.com/mindspore-lab/mindnlp.git
cd mindnlp
bash scripts/build_and_reinstall.sh
```


## Get Started

We will next quickly implement a sentiment classification task by using mindnlp.

### Define Model

```python
from mindspore import ops
from mindnlp.abc import Seq2vecModel

class SentimentClassification(Seq2vecModel):
def construct(self, text):
_, (hidden, _), _ = self.encoder(text)
context = ops.concat((hidden[-2, :, :], hidden[-1, :, :]), axis=1)
output = self.head(context)
return output
```

### Define Hyperparameters
The following are some of the required hyperparameters in the model training process.
```python
# define Models & Loss & Optimizer
hidden_size = 256
output_size = 1
num_layers = 2
bidirectional = True
drop = 0.5
lr = 0.001
```

### Data Preprocessing
The dataset was downloaded and preprocessed by calling the interface of dataset in mindnlp.

Load dataset:
```python
from mindnlp import load_dataset

imdb_train, imdb_test = load_dataset('imdb', shuffle=True)
```
### Introduction

Initializes the vocab and tokenizer for preprocessing:
```python
from mindnlp import Vocab
from mindnlp.transforms import BasicTokenizer

tokenizer = BasicTokenizer(True)
vocab = Vocab.from_pretrained(name="glove.6B.100d")
```
MindNLP is an open source NLP library based on MindSpore. It supports a platform for solving natural language processing tasks, containing many common approaches in NLP. It can help researchers and developers to construct and train models more conveniently and rapidly.

The loaded dataset is preprocessed and divided into training and validation:
```python
from mindnlp.dataset import process
The master branch works with **MindSpore master**.

imdb_train = process('imdb', imdb_train, tokenizer=tokenizer, vocab=vocab, \
bucket_boundaries=[400, 500], max_len=600, drop_remainder=True)
imdb_test = process('imdb', imdb_test, tokenizer=tokenizer, vocab=vocab, \
bucket_boundaries=[400, 500], max_len=600, drop_remainder=False)
```
#### Major Features

### Instantiate Model
```python
from mindnlp.modules import RNNEncoder, Glove

embedding = Glove.from_pretrained('6B', 100, special_tokens=["<unk>", "<pad>"])
# build encoder
lstm_layer = nn.LSTM(100, hidden_size, num_layers=num_layers, batch_first=True,
dropout=dropout, bidirectional=bidirectional)
encoder = RNNEncoder(embedding, lstm_layer)

# build head
head = nn.SequentialCell([
nn.Dropout(p=dropout),
nn.Sigmoid(),
nn.Dense(hidden_size * 2, output_size,
weight_init=HeUniform(math.sqrt(5)),
bias_init=Uniform(1 / math.sqrt(hidden_size * 2)))

])

# build network
network = SentimentClassification(encoder, head)
loss = nn.BCELoss(reduction='mean')
optimizer = nn.Adam(network.trainable_params(), learning_rate=lr)
```
- **Comprehensive data processing**: Several classical NLP datasets are packaged into friendly module for easy use, such as Multi30k, SQuAD, CoNLL, etc.
- **Friendly NLP model toolset**: MindNLP provides various configurable components. It is friendly to customize models using MindNLP.
- **Easy-to-use engine**: MindNLP simplified complicated training process in MindSpore. It supports Trainer and Evaluator interfaces to train and evaluate models easily.

### Training Process
Now that we have completed all the preparations, we can begin to train the model.
```python
from mindnlp.engine.metrics import Accuracy
from mindnlp.engine.trainer import Trainer
### Quick Links

# define metrics
metric = Accuracy()
- [Documentation](https://mindnlp.cqu.ai/en/latest/)
- [Tutorials](./tutorials/)
- [Examples](./examples)
- [LLMs](./llm)
- ...

# define trainer
trainer = Trainer(network=network, train_dataset=imdb_train, eval_dataset=imdb_test, metrics=metric,
epochs=5, loss_fn=loss, optimizer=optimizer)
trainer.run(tgt_columns="label")
```

<!-- ## Tutorials
Expand Down
70 changes: 61 additions & 9 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,63 @@
# MindNLP Examples

In the exmaples catalogue maily provodes rich application examples covering mainstream NLP task to help developers accelerate problem solving.

### NLP Tasks

- [x] Sentiment Analysis
- [x] Language Model
- [x] Machine Translation
- [x] Question Answer
- [x] Sequence Labeling
MindNLP currently supports a variety of different NLP tasks and offers a wide range of state-of-the-art open-source models. We provide them in the form of examples.

## Supported Tasks in MindNLP 💡

MindNLP is a versatile repository that supports a variety of natural language processing tasks. It offers a wide array of state-of-the-art models for these tasks. Here's a brief overview:

### Classification 📊

MindNLP supports text classification tasks, including sentiment analysis, document classification, and more. You can quickly classify text into predefined categories or analyze sentiment.

| Task | Model | Dataset | Example |
|--------------------|---------|----------|---------|
| Sentiment analysis | BERT | Emotect | [Notebook](./classification/bert_emotect_finetune.ipynb) |
| | GPT | IMDB | [Notebook](./classification/gpt_imdb_finetune.ipynb) |
| | Bi-LSTM | IMDB | [Notebook](./classification/bilstm_imdb_concise.ipynb) |
| Chinese news | NeZha | THUCNews | [Notebook](./classification/nezha_classification.ipynb) |

### Language Model 🧠

MindNLP provides access to cutting-edge language models, which can be used for tasks like text generation, text completion, and text classification. These models are highly capable of understanding and generating human-like text.

| Model | Dataset | Example |
|---------|----------|---------|
| FastText | AGNews | [Script](./language_model/fasttext.py) |

### Machine Translation 🌐

MindNLP supports machine translation, allowing you to translate text from one language to another. It covers a wide range of language pairs and ensures accurate translations.

| Model | Dataset | Example |
|---------|----------|---------|
| Seq2seq(GRU) | Multi30k | [Notebook](./machine_translation/mindspore_sequence_to_sequence.ipynb) |

### Question Answer❓

You can build question answering systems using MindNLP. Given a context and a question, these models can extract answers directly from the provided text.

| Model | Dataset | Example |
|---------|----------|---------|
| Bidaf | Squad1 | [Notebook](./question_answer/bidaf_squad_concise.ipynb) |

### Sequence Labeling 🏷️

For tasks like named entity recognition (NER) and part-of-speech tagging, MindNLP offers sequence labeling models. These models can identify and label entities or segments within a text.

| Task | Model | Dataset | Example |
|--------------------|---------|----------|---------|
| Named Entity Recognation | Bi-LSTM+CRF | Coll2003 | [Notebook](./sequence_labeling/LSTM-CRF.ipynb) |
| | BERT+Bi-LSTM+CRF | Coll2003 | [Notebook](./sequence_labeling/Bert-LSTM-CRF.ipynb) |

### Text Generation 📝

MindNLP includes models for text generation, which can create new text based on provided prompts, generate creative content, or produce concise summaries of long documents or articles.

| Task | Model | Dataset | Example |
|--------------------|---------|----------|---------|
| Named Entity Recognation | GPT2 | NLPCC2017 | [Notebook](./text_generation/gpt2_summarization.ipynb) |

<!-- ### Language Understanding 🧐
In addition to the mentioned tasks, MindNLP supports various other language understanding tasks, including text entailment, paraphrasing, and more. -->
95 changes: 0 additions & 95 deletions examples/sequence_labeling/sequence_tagging.py

This file was deleted.

6 changes: 3 additions & 3 deletions tests/ut/models/bart/test_modeling_bart.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ def test_bart_decoder(self):
if self.use_amp:
model = mindnlp._legacy.amp.auto_mixed_precision(model)

input_ids = Tensor(np.random.randn(1, 2), mindspore.int32)
input_ids = Tensor(np.random.randint(0, self.config.vocab_size, (1, 2)), mindspore.int32)
outputs = model(input_ids)
assert outputs[0].shape == (1, 2, self.config.d_model)

Expand All @@ -148,7 +148,7 @@ def test_bart_model(self):
if self.use_amp:
model = mindnlp._legacy.amp.auto_mixed_precision(model)

input_ids = Tensor(np.random.randn(1, 2), mindspore.int32)
input_ids = Tensor(np.random.randint(0, self.config.vocab_size, (1, 2)), mindspore.int32)
outputs = model(input_ids)
assert outputs[0].shape == (1, 2, self.config.d_model)

Expand All @@ -161,7 +161,7 @@ def test_bart_for_conditional_generation(self):
if self.use_amp:
model = mindnlp._legacy.amp.auto_mixed_precision(model)

input_ids = Tensor(np.random.randn(1, 2), mindspore.int32)
input_ids = Tensor(np.random.randint(0, self.config.vocab_size, (1, 2)), mindspore.int32)
outputs = model(input_ids)
assert outputs[0].shape == (1, 2, self.config.vocab_size)

Expand Down
2 changes: 1 addition & 1 deletion tests/ut/models/rwkv/test_modeling_rwkv.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def test_modeling_rwkv(self):
"""
model = RwkvModel(self.config)

input_ids = Tensor(np.random.randn(1, self.config.context_length), mindspore.int32)
input_ids = Tensor(np.random.randint(0, self.config.vocab_size, (1, self.config.context_length)), mindspore.int32)

def forward(input_ids):
outputs = model(input_ids)
Expand Down

0 comments on commit f462435

Please sign in to comment.