Skip to content

Commit

Permalink
renaming stanfordnlp to stanza!!
Browse files Browse the repository at this point in the history
  • Loading branch information
yuhui-zh15 committed Mar 6, 2020
1 parent 289f148 commit 451a923
Show file tree
Hide file tree
Showing 101 changed files with 416 additions and 416 deletions.
14 changes: 7 additions & 7 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ install:
- wget $CORENLP_URL -O corenlp.zip
- unzip corenlp.zip
- mv $CORENLP_VERSION $CORENLP_HOME
- mkdir ~/stanfordnlp_test
- mkdir ~/stanfordnlp_test/in
- mkdir ~/stanfordnlp_test/out
- mkdir ~/stanfordnlp_test/scripts
- cp tests/data/external_server.properties ~/stanfordnlp_test/scripts
- cp tests/data/example_french.json ~/stanfordnlp_test/out
- export STANFORDNLP_TEST_HOME=~/stanfordnlp_test
- mkdir ~/stanza_test
- mkdir ~/stanza_test/in
- mkdir ~/stanza_test/out
- mkdir ~/stanza_test/scripts
- cp tests/data/external_server.properties ~/stanza_test/scripts
- cp tests/data/example_french.json ~/stanza_test/out
- export STANZA_TEST_HOME=~/stanza_test
script:
- python -m pytest -m travis tests/
50 changes: 25 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# StanfordNLP: A Python NLP Library for Many Human Languages
# Stanza: A Python NLP Library for Many Human Languages

[![Travis Status](https://travis-ci.com/stanfordnlp/stanfordnlp.svg?token=RPNzRzNDQRoq2x3J2juj&branch=master)](https://travis-ci.com/stanfordnlp/stanfordnlp)
[![PyPI Version](https://img.shields.io/pypi/v/stanfordnlp.svg?colorB=blue)](https://pypi.org/project/stanfordnlp/)
![Python Versions](https://img.shields.io/pypi/pyversions/stanfordnlp.svg?colorB=blue)

The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server. For detailed information please visit our [official website](https://stanfordnlp.github.io/stanfordnlp/).
The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server. For detailed information please visit our [official website](https://stanfordnlp.github.io/stanza/).

### References

Expand All @@ -31,45 +31,45 @@ If you use the CoreNLP server, please cite the CoreNLP software package and the

## Issues and Usage Q&A

To ask questions, report issues or request features, please use the [GitHub Issue Tracker](https://github.com/stanfordnlp/stanfordnlp/issues).
To ask questions, report issues or request features, please use the [GitHub Issue Tracker](https://github.com/stanfordnlp/stanza/issues).

## Setup

StanfordNLP supports Python 3.6 or later. We strongly recommend that you install StanfordNLP from PyPI. If you already have [pip installed](https://pip.pypa.io/en/stable/installing/), simply run:
Stanza supports Python 3.6 or later. We strongly recommend that you install Stanza from PyPI. If you already have [pip installed](https://pip.pypa.io/en/stable/installing/), simply run:
```bash
pip install stanfordnlp
pip install stanza
```
this should also help resolve all of the dependencies of StanfordNLP, for instance [PyTorch](https://pytorch.org/) 1.0.0 or above.
this should also help resolve all of the dependencies of Stanza, for instance [PyTorch](https://pytorch.org/) 1.0.0 or above.

If you currently have a previous version of `stanfordnlp` installed, use:
If you currently have a previous version of `stanza` installed, use:
```bash
pip install stanfordnlp -U
pip install stanza -U
```

Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of StanfordNLP and training your own models. For this option, run
Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of Stanza and training your own models. For this option, run
```bash
git clone https://github.com/stanfordnlp/stanfordnlp.git
cd stanfordnlp
git clone https://github.com/stanfordnlp/stanza.git
cd stanza
pip install -e .
```

## Running StanfordNLP
## Running Stanza

### Getting Started with the neural pipeline

To run your first StanfordNLP pipeline, simply following these steps in your Python interactive interpreter:
To run your first Stanza pipeline, simply following these steps in your Python interactive interpreter:

```python
>>> import stanfordnlp
>>> stanfordnlp.download('en') # This downloads the English models for the neural pipeline
>>> import stanza
>>> stanza.download('en') # This downloads the English models for the neural pipeline
# IMPORTANT: The above line prompts you before downloading, which doesn't work well in a Jupyter notebook.
# To avoid a prompt when using notebooks, instead use: >>> stanfordnlp.download('en', force=True)
>>> nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
# To avoid a prompt when using notebooks, instead use: >>> stanza.download('en', force=True)
>>> nlp = stanza.Pipeline() # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()
```

The last command will print out the words in the first sentence in the input string (or `Document`, as it is represented in StanfordNLP), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:
The last command will print out the words in the first sentence in the input string (or `Document`, as it is represented in Stanza), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:

```
('Barack', '4', 'nsubj:pass')
Expand All @@ -83,13 +83,13 @@ The last command will print out the words in the first sentence in the input str

**Note:** If you are running into issues like `OSError: [Errno 22] Invalid argument`, it's very likely that you are affected by a [known Python issue](https://bugs.python.org/issue24658), and we would recommend Python 3.6.8 or later and Python 3.7.2 or later.

We also provide a multilingual [demo script](https://github.com/stanfordnlp/stanfordnlp/blob/master/demo/pipeline_demo.py) that demonstrates how one uses StanfordNLP in other languages than English, for example Chinese (traditional)
We also provide a multilingual [demo script](https://github.com/stanfordnlp/stanza/blob/master/demo/pipeline_demo.py) that demonstrates how one uses Stanza in other languages than English, for example Chinese (traditional)

```bash
python demo/pipeline_demo.py -l zh
```

See [our getting started guide](https://stanfordnlp.github.io/stanfordnlp/installation_usage.html#getting-started) for more details.
See [our getting started guide](https://stanfordnlp.github.io/stanza/installation_usage.html#getting-started) for more details.

### Access to Java Stanford CoreNLP Server

Expand All @@ -101,7 +101,7 @@ There are a few initial setup steps.
* Put the model jars in the distribution folder
* Tell the python code where Stanford CoreNLP is located: `export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05`

We provide another [demo script](https://github.com/stanfordnlp/stanfordnlp/blob/master/demo/corenlp.py) that shows how one can use the CoreNLP client and extract various annotations from it.
We provide another [demo script](https://github.com/stanfordnlp/stanza/blob/master/demo/corenlp.py) that shows how one can use the CoreNLP client and extract various annotations from it.

### Online Colab Notebooks

Expand All @@ -110,11 +110,11 @@ To get your started, we also provide interactive Jupyter notebooks in the `demo`
* Go to the [Google Colab website](https://colab.research.google.com)
* Navigate to `File` -> `Open notebook`, and choose `GitHub` in the pop-up menu
* Note that you do **not** need to give Colab access permission to your github account
* Type `stanfordnlp/stanfordnlp` in the search bar, and click enter
* Type `stanfordnlp/stanza` in the search bar, and click enter

### Trained Models for the Neural Pipeline

We currently provide models for all of the treebanks in the CoNLL 2018 Shared Task. You can find instructions for downloading and using these models [here](https://stanfordnlp.github.io/stanfordnlp/models.html).
We currently provide models for all of the treebanks in the CoNLL 2018 Shared Task. You can find instructions for downloading and using these models [here](https://stanfordnlp.github.io/stanza/models.html).

### Batching To Maximize Pipeline Speed

Expand All @@ -127,8 +127,8 @@ We are actively working on improving multi-document processing.

All neural modules in this library, including the tokenizer, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer and the dependency parser, can be trained with your own [CoNLL-U](https://universaldependencies.org/format.html) format data. Currently, we do not support model training via the `Pipeline` interface. Therefore, to train your own models, you need to clone this git repository and set up from source.

For detailed step-by-step guidance on how to train and evaluate your own models, please visit our [training documentation](https://stanfordnlp.github.io/stanfordnlp/training.html).
For detailed step-by-step guidance on how to train and evaluate your own models, please visit our [training documentation](https://stanfordnlp.github.io/stanza/training.html).

## LICENSE

StanfordNLP is released under the Apache License, Version 2.0. See the [LICENSE](https://github.com/stanfordnlp/stanfordnlp/blob/master/LICENSE) file for more details.
Stanza is released under the Apache License, Version 2.0. See the [LICENSE](https://github.com/stanfordnlp/stanza/blob/master/LICENSE) file for more details.
2 changes: 1 addition & 1 deletion demo/corenlp.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from stanfordnlp.server import CoreNLPClient
from stanza.server import CoreNLPClient

# example text
print('---')
Expand Down
10 changes: 5 additions & 5 deletions demo/pipeline_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
import argparse
import os

import stanfordnlp
from stanfordnlp.utils.resources import DEFAULT_MODEL_DIR
import stanza
from stanza.utils.resources import DEFAULT_MODEL_DIR


if __name__ == '__main__':
# get arguments
parser = argparse.ArgumentParser()
parser.add_argument('-d', '--models_dir', help='location of models files | default: ~/stanfordnlp_resources',
parser.add_argument('-d', '--models_dir', help='location of models files | default: ~/stanza_resources',
default=DEFAULT_MODEL_DIR)
parser.add_argument('-l', '--lang', help='Demo language',
default="en")
Expand All @@ -30,11 +30,11 @@
sys.exit(1)

# download the models
stanfordnlp.download(args.lang, args.models_dir, confirm_if_exists=True)
stanza.download(args.lang, args.models_dir, confirm_if_exists=True)
# set up a pipeline
print('---')
print('Building pipeline...')
pipeline = stanfordnlp.Pipeline(models_dir=args.models_dir, lang=args.lang, use_gpu=(not args.cpu))
pipeline = stanza.Pipeline(models_dir=args.models_dir, lang=args.lang, use_gpu=(not args.cpu))
# process the document
doc = pipeline(example_sentences[args.lang])
# access nlp annotations
Expand Down
2 changes: 1 addition & 1 deletion scripts/config.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash
#
# Set environment variables for the training and testing of stanfordnlp modules.
# Set environment variables for the training and testing of stanza modules.

# Set UDBASE to the location of UD data folder
# The data should be CoNLL-U format
Expand Down
2 changes: 1 addition & 1 deletion scripts/lang2code.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"""
import sys

from stanfordnlp.models.common.constant import lang2lcode
from stanza.models.common.constant import lang2lcode

if len(sys.argv) <= 1:
raise Exception("Language name not provided.")
Expand Down
4 changes: 2 additions & 2 deletions scripts/prep_depparse_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,15 @@ elif [ $tag_type == 'predicted' ]; then
# run part-of-speech tagging on the train file
echo '---'
echo 'running part of speech model to generate predicted tags for train data'
train_cmd='python -m stanfordnlp.models.tagger --wordvec_dir '${WORDVEC_DIR}' --eval_file '${gold_train_file}' --gold_file '${gold_train_file}' --output_file '${train_in_file}' --lang '${original_short}' --shorthand '${original_short}' --batch_size '${batch_size}' --mode predict'
train_cmd='python -m stanza.models.tagger --wordvec_dir '${WORDVEC_DIR}' --eval_file '${gold_train_file}' --gold_file '${gold_train_file}' --output_file '${train_in_file}' --lang '${original_short}' --shorthand '${original_short}' --batch_size '${batch_size}' --mode predict'
echo ''
echo $train_cmd
echo ''
eval $train_cmd
# run part-of-speech tagging on the train file
echo '---'
echo 'running part of speech model to generate predicted tags for dev data'
dev_cmd='python -m stanfordnlp.models.tagger --wordvec_dir '${WORDVEC_DIR}' --eval_file '${gold_dev_file}' --gold_file '${gold_dev_file}' --output_file '${dev_in_file}' --lang '${original_short}' --shorthand '${original_short}' --batch_size '${batch_size}' --mode predict'
dev_cmd='python -m stanza.models.tagger --wordvec_dir '${WORDVEC_DIR}' --eval_file '${gold_dev_file}' --gold_file '${gold_dev_file}' --output_file '${dev_in_file}' --lang '${original_short}' --shorthand '${original_short}' --batch_size '${batch_size}' --mode predict'
echo ''
echo $dev_cmd
eval $dev_cmd
Expand Down
2 changes: 1 addition & 1 deletion scripts/prep_mwt_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ fi

if [ -e $dev_conllu ]; then
echo "Preparing dev data..."
python stanfordnlp/utils/contract_mwt.py $dev_conllu $dev_in_file
python stanza/utils/contract_mwt.py $dev_conllu $dev_in_file
bash scripts/prep_tokenize_data.sh $src_treebank dev
else
touch $dev_in_file
Expand Down
6 changes: 3 additions & 3 deletions scripts/prep_ner_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,17 @@ test_json_file=$NER_DATA_DIR/${short}.test.json

# create json file if exists; otherwise create empty files
if [ -e $train_file ]; then
python stanfordnlp/utils/prepare_ner_data.py $train_file $train_json_file
python stanza/utils/prepare_ner_data.py $train_file $train_json_file
else
touch $train_json_file
fi
if [ -e $dev_file ]; then
python stanfordnlp/utils/prepare_ner_data.py $dev_file $dev_json_file
python stanza/utils/prepare_ner_data.py $dev_file $dev_json_file
else
touch $dev_json_file
fi
if [ -e $test_file ]; then
python stanfordnlp/utils/prepare_ner_data.py $test_file $test_json_file
python stanza/utils/prepare_ner_data.py $test_file $test_json_file
else
touch $test_json_file
fi
Expand Down
4 changes: 2 additions & 2 deletions scripts/prep_tokenize_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ short=`bash scripts/treebank_to_shorthand.sh ud $treebank`

lang=`echo $short | sed -e 's#_.*##g'`
echo "Preparing tokenizer $dataset data..."
python stanfordnlp/utils/prepare_tokenizer_data.py $UDBASE/$treebank/${short}-ud-${dataset}.txt $UDBASE/$treebank/${short}-ud-${dataset}.conllu -o ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}.toklabels -m ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}-mwt.json
python stanza/utils/prepare_tokenizer_data.py $UDBASE/$treebank/${short}-ud-${dataset}.txt $UDBASE/$treebank/${short}-ud-${dataset}.conllu -o ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}.toklabels -m ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}-mwt.json
cp $UDBASE/$treebank/${short}-ud-${dataset}.conllu ${TOKENIZE_DATA_DIR}/${short}.${dataset}.gold.conllu
cp $UDBASE/$treebank/${short}-ud-${dataset}.txt ${TOKENIZE_DATA_DIR}/${short}.${dataset}.txt
# handle Vietnamese data
if [ $lang == "vi" ]; then
python stanfordnlp/utils/postprocess_vietnamese_tokenizer_data.py $UDBASE/$treebank/${short}-ud-${dataset}.txt --char_level_pred ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}.toklabels -o ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}.json
python stanza/utils/postprocess_vietnamese_tokenizer_data.py $UDBASE/$treebank/${short}-ud-${dataset}.txt --char_level_pred ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}.toklabels -o ${TOKENIZE_DATA_DIR}/${short}-ud-${dataset}.json
fi
6 changes: 3 additions & 3 deletions scripts/run_charlm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ dev_file=${CHARLM_DATA_DIR}/${lang}/${corpus_name}/dev.txt
test_file=${CHARLM_DATA_DIR}/${lang}/${corpus_name}/test.txt

echo "Running charlm for $lang:$corpus with $args..."
python -m stanfordnlp.models.charlm --train_dir $train_dir --eval_file $dev_file \
python -m stanza.models.charlm --train_dir $train_dir --eval_file $dev_file \
--direction $direction --lang $lang --shorthand $short --mode train $args
python -m stanfordnlp.models.charlm --eval_file $dev_file \
python -m stanza.models.charlm --eval_file $dev_file \
--direction $direction --lang $lang --shorthand $short --mode predict $args
python -m stanfordnlp.models.charlm --eval_file $test_file \
python -m stanza.models.charlm --eval_file $test_file \
--direction $direction --lang $lang --shorthand $short --mode predict $args
6 changes: 3 additions & 3 deletions scripts/run_depparse.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@ fi
echo "Using batch size $batch_size"

echo "Running parser with $args..."
python -m stanfordnlp.models.parser --wordvec_dir $WORDVEC_DIR --train_file $train_file --eval_file $eval_file \
python -m stanza.models.parser --wordvec_dir $WORDVEC_DIR --train_file $train_file --eval_file $eval_file \
--output_file $output_file --gold_file $gold_file --lang $lang --shorthand $short --batch_size $batch_size --mode train $args
python -m stanfordnlp.models.parser --wordvec_dir $WORDVEC_DIR --eval_file $eval_file \
python -m stanza.models.parser --wordvec_dir $WORDVEC_DIR --eval_file $eval_file \
--output_file $output_file --gold_file $gold_file --lang $lang --shorthand $short --mode predict $args
results=`python stanfordnlp/utils/conll18_ud_eval.py -v $gold_file $output_file | head -12 | tail -n+12 | awk '{print $7}'`
results=`python stanza/utils/conll18_ud_eval.py -v $gold_file $output_file | head -12 | tail -n+12 | awk '{print $7}'`
echo $results $args >> ${DEPPARSE_DATA_DIR}/${short}.results
echo $short $results $args
Loading

0 comments on commit 451a923

Please sign in to comment.