Skip to content

Commit

Permalink
update readme to support bilingual.
Browse files Browse the repository at this point in the history
  • Loading branch information
mymagicpower committed Mar 21, 2023
1 parent cda4bd6 commit 6e20414
Show file tree
Hide file tree
Showing 59 changed files with 2,145 additions and 3,270 deletions.
97 changes: 38 additions & 59 deletions 2_nlp_sdks/embedding/sentence_encoder_100_sdk/README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,77 @@
### 官网:
[官网链接](http://www.aias.top/)

### 下载模型,放置于models目录
### Download the model and put it in the models directory
- 链接: https://github.com/mymagicpower/AIAS/releases/download/apps/paraphrase-xlm-r-multilingual-v1.zip

### 句向量SDK【支持100种语言】
句向量是指将语句映射至固定维度的实数向量。
将不定长的句子用定长的向量表示,为NLP下游任务提供服务。
### Sentence Vector SDK [Supports 100 languages]

- 支持下面100种语言:
Sentence vector refers to mapping sentences to fixed-dimensional real vectors.
Representing variable-length sentences as fixed-length vectors serves downstream NLP tasks.

- Supports the following 100 languages:
![img](https://aias-home.oss-cn-beijing.aliyuncs.com/AIAS/nlp_sdks/languages_100.jpeg)

- 句向量
- Sentence Vector
![img](https://aias-home.oss-cn-beijing.aliyuncs.com/AIAS/nlp_sdks/Universal-Sentence-Encoder.png)

-

句向量应用:
- 语义搜索,通过句向量相似性,检索语料库中与query最匹配的文本
- 文本聚类,文本转为定长向量,通过聚类模型可无监督聚集相似文本
- 文本分类,表示成句向量,直接用简单分类器即训练文本分类器
Sentence Vector Applications:

- Semantic search: Retrieve text from a corpus that best matches the query by sentence vector similarity.
- Text clustering: Convert text to fixed-length vectors and use clustering models to cluster similar texts without supervision.
- Text classification: Represent text as sentence vectors and train text classifiers directly with a simple classifier.

### SDK Functionality:

- Sentence vector extraction
- Similarity (cosine) calculation
- max_seq_length: 128 (subword segmentation, up to an average of about 60 words for English sentences)

### SDK功能:
- 句向量提取
- 相似度(余弦)计算
- max_seq_length: 128(subword切词,如果是英文句子,上限平均大约60个单词)
### Running example - SentenceEncoderExample

After running successfully, you should see the following information on the command line:

#### 运行例子 - SentenceEncoderExample
运行成功后,命令行应该看到下面的信息:
```text
...
# 测试语句:
# 英文一组
#Test sentences:
# A set of English
[INFO ] - input Sentence1: This model generates embeddings for input sentence
[INFO ] - input Sentence2: This model generates embeddings
# 中文一组
# A set of Chinese
[INFO ] - input Sentence3: 今天天气不错
[INFO ] - input Sentence4: 今天风和日丽
# 向量维度:
# Vector dimensions:
[INFO ] - Vector dimensions: 768
# 英文 - 生成向量:
# English - Generate vectors:
[INFO ] - Sentence1 embeddings: [0.10717804, 0.0023716218, ..., -0.087652676, 0.5144994]
[INFO ] - Sentence2 embeddings: [0.06960095, 0.09246655, ..., -0.06324193, 0.2669841]
#计算英文相似度:
[INFO ] - 英文 Similarity: 0.84808713
# Calculate English similarity:
[INFO ] - Similarity: 0.84808713
# 中文 - 生成向量:
# Chinese - Generate vectors:
[INFO ] - Sentence1 embeddings: [0.19896796, 0.46568888,..., 0.09489663, 0.19511698]
[INFO ] - Sentence2 embeddings: [0.1639189, 0.43350196, ..., -0.025053274, -0.121924624]
#计算中文相似度:
#由于使用了sentencepiece切词器,中文切词更准确,比15种语言的模型(只切成字,没有考虑词)精度更好。
[INFO ] - 中文 Similarity: 0.67201
# Calculate Chinese Similarity:
# Due to the use of the sentencepiece tokenizer, Chinese word segmentation is more accurate and has better precision than the 15-language model (which only segments into characters without considering words).
[INFO ] - Similarity: 0.67201
```

### 开源算法
#### 1. sdk使用的开源算法
### Open source algorithm
#### 1. Open source algorithms used by the SDK
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- [预训练模型](https://www.sbert.net/docs/pretrained_models.html)
- [安装](https://www.sbert.net/docs/installation.html)
- [Pre-trained models](https://www.sbert.net/docs/pretrained_models.html)
- [Installation](https://www.sbert.net/docs/installation.html)


#### 2. 模型如何导出 ?
#### 2. How to export the model?
- [how_to_convert_your_model_to_torchscript](http://docs.djl.ai/docs/pytorch/how_to_convert_your_model_to_torchscript.html)

- 导出CPU模型(pytorch 模型特殊,CPU&GPU模型不通用。所以CPU,GPU需要分别导出)
- Exporting CPU models (PyTorch models are special, and CPU and GPU models are not interchangeable. Therefore, CPU and GPU models need to be exported separately)
- device = torch.device("cpu")
- device = torch.device("gpu")
- export_model_100.py
Expand All @@ -94,26 +96,3 @@ input_features = {'input_ids': input_ids, 'attention_mask': input_mask}
traced_model = torch.jit.trace(model, example_inputs=input_features,strict=False)
traced_model.save("models/paraphrase-xlm-r-multilingual-v1/paraphrase-xlm-r-multilingual-v1.pt")
```



### 其它帮助信息
http://aias.top/guides.html


### Git地址:
[Github链接](https://github.com/mymagicpower/AIAS)
[Gitee链接](https://gitee.com/mymagicpower/AIAS)



#### 帮助文档:
- http://aias.top/guides.html
- 1.性能优化常见问题:
- http://aias.top/AIAS/guides/performance.html
- 2.引擎配置(包括CPU,GPU在线自动加载,及本地配置):
- http://aias.top/AIAS/guides/engine_config.html
- 3.模型加载方式(在线自动加载,及本地配置):
- http://aias.top/AIAS/guides/load_model.html
- 4.Windows环境常见问题:
- http://aias.top/AIAS/guides/windows.html
90 changes: 34 additions & 56 deletions 2_nlp_sdks/embedding/sentence_encoder_15_sdk/README.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,72 @@
### 官网:
[官网链接](http://www.aias.top/)

### 下载模型,放置于models目录
- 链接: https://github.com/mymagicpower/AIAS/releases/download/apps/distiluse-base-multilingual-cased-v1.zip
### Download the model and place it in the models directory
- Link: https://github.com/mymagicpower/AIAS/releases/download/apps/distiluse-base-multilingual-cased-v1.zip

### 句向量SDK【支持15种语言】
句向量是指将语句映射至固定维度的实数向量。
将不定长的句子用定长的向量表示,为NLP下游任务提供服务。
支持 15 种语言:
### Sentence Vector SDK [Supports 15 languages]
Sentence vector refers to mapping sentences to fixed-dimensional real vectors.
Representing variable-length sentences as fixed-length vectors serves downstream NLP tasks.
Supports 15 languages:
Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
- 句向量

- Sentence vector
![img](https://aias-home.oss-cn-beijing.aliyuncs.com/AIAS/nlp_sdks/Universal-Sentence-Encoder.png)


句向量应用:
- 语义搜索,通过句向量相似性,检索语料库中与query最匹配的文本
- 文本聚类,文本转为定长向量,通过聚类模型可无监督聚集相似文本
- 文本分类,表示成句向量,直接用简单分类器即训练文本分类器
Sentence vector applications:

- Semantic search retrieves text from the corpus that matches the query best based on sentence vector similarity.
- Text clustering: Text is converted to fixed-length vectors and unsupervised clustering of similar text is performed using a clustering model.
- Text classification: Representing text as sentence vectors and directly training text classifiers using simple classifiers.

### SDK functions:

### SDK功能:
- 句向量提取
- 相似度(余弦)计算
- Sentence vector extraction
- Similarity (cosine) calculation

### Running example - SentenceEncoderExample

#### 运行例子 - SentenceEncoderExample
运行成功后,命令行应该看到下面的信息:
After running successfully, the command line should see the following information:
```text
...
# 测试语句:
# 英文一组
# Test sentences:
# A set of English sentences
[INFO ] - input Sentence1: This model generates embeddings for input sentence
[INFO ] - input Sentence2: This model generates embeddings
# 中文一组
# A set of Chinese sentences
[INFO ] - input Sentence3: 今天天气不错
[INFO ] - input Sentence4: 今天风和日丽
# 向量维度:
# Vector dimensions:
[INFO ] - Vector dimensions: 512
# 英文 - 生成向量:
# English - Generated vectors:
[INFO ] - Sentence1 embeddings: [-0.07397884, 0.023079528, ..., -0.028247012, -0.08646198]
[INFO ] - Sentence2 embeddings: [-0.084004365, -0.021871908, ..., -0.039803937, -0.090846084]
#计算英文相似度:
[INFO ] - 英文 Similarity: 0.77445346
# Calculating English similarity:
[INFO ] - Similarity: 0.77445346
# 中文 - 生成向量:
# Chinese - Generated vectors:
[INFO ] - Sentence1 embeddings: [0.012180057, -0.035749275, ..., 0.0208446, -0.048238125]
[INFO ] - Sentence2 embeddings: [0.016560446, -0.03528302, ..., 0.023508975, -0.046362665]
#计算中文相似度:
# Calculating Chinese similarity:
[INFO ] - 中文 Similarity: 0.9972926
```

### 开源算法
#### 1. sdk使用的开源算法
### Open source algorithm
#### 1. Open source algorithms used by the SDK
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- [预训练模型](https://www.sbert.net/docs/pretrained_models.html)
- [安装](https://www.sbert.net/docs/installation.html)
- [Pre-trained models](https://www.sbert.net/docs/pretrained_models.html)
- [Installation](https://www.sbert.net/docs/installation.html)


#### 2. 模型如何导出 ?
#### 2. How to export the model?
- [how_to_convert_your_model_to_torchscript](http://docs.djl.ai/docs/pytorch/how_to_convert_your_model_to_torchscript.html)

- 导出CPU模型(pytorch 模型特殊,CPU&GPU模型不通用。所以CPU,GPU需要分别导出)
- Exporting CPU models (PyTorch models are special, and CPU and GPU models are not interchangeable. Therefore, CPU and GPU models need to be exported separately)
- device = torch.device("cpu")
- device = torch.device("gpu")
- export_model_15.py
Expand All @@ -92,25 +92,3 @@ traced_model = torch.jit.trace(model, example_inputs=input_features,strict=False
traced_model.save("models/distiluse-base-multilingual-cased-v1/distiluse-base-multilingual-cased-v1.pt")
```



### 其它帮助信息
http://aias.top/guides.html


### Git地址:
[Github链接](https://github.com/mymagicpower/AIAS)
[Gitee链接](https://gitee.com/mymagicpower/AIAS)


#### 帮助文档:
- http://aias.top/guides.html
- 1.性能优化常见问题:
- http://aias.top/AIAS/guides/performance.html
- 2.引擎配置(包括CPU,GPU在线自动加载,及本地配置):
- http://aias.top/AIAS/guides/engine_config.html
- 3.模型加载方式(在线自动加载,及本地配置):
- http://aias.top/AIAS/guides/load_model.html
- 4.Windows环境常见问题:
- http://aias.top/AIAS/guides/windows.html

78 changes: 30 additions & 48 deletions 2_nlp_sdks/embedding/sentence_encoder_en_sdk/README.md
Original file line number Diff line number Diff line change
@@ -1,60 +1,60 @@
### 官网:
[官网链接](http://www.aias.top/)

### 下载模型,放置于models目录
- 链接: https://github.com/mymagicpower/AIAS/releases/download/apps/paraphrase-MiniLM-L6-v2.zip
### Download the model and place it in the models directory
- Link: https://github.com/mymagicpower/AIAS/releases/download/apps/paraphrase-MiniLM-L6-v2.zip

### 轻量句向量SDK【英文】
句向量是指将语句映射至固定维度的实数向量。
将不定长的句子用定长的向量表示,为NLP下游任务提供服务。
### Lightweight sentence vector SDK [English]

- 句向量
Sentence vector refers to mapping sentences to fixed-dimensional real vectors.
Representing variable-length sentences as fixed-length vectors provides services for downstream NLP tasks.

- Sentence vector
![img](https://aias-home.oss-cn-beijing.aliyuncs.com/AIAS/nlp_sdks/Universal-Sentence-Encoder.png)


句向量应用:
- 语义搜索,通过句向量相似性,检索语料库中与query最匹配的文本
- 文本聚类,文本转为定长向量,通过聚类模型可无监督聚集相似文本
- 文本分类,表示成句向量,直接用简单分类器即训练文本分类器
Applications of sentence vectors:
-Semantic search: Retrieve the most matching text in the corpus with the query through sentence vector similarity
-Text clustering: Convert text to fixed-length vectors, and unsupervisedly cluster similar texts through clustering models
-Text classification: Represented as sentence vectors, training text classifiers directly using simple classifiers

### SDK functions:

### SDK功能:
- 句向量提取
- 相似度计算
- Sentence vector extraction
- Similarity calculation

#### 运行例子 - SentenceEncoderExample
运行成功后,命令行应该看到下面的信息:
#### Running example - SentenceEncoderExample
After running successfully, you should see the following information on the command line:
```text
...
# 测试语句:
# Test sentences:
[INFO ] - input Sentence1: This model generates embeddings for input sentence
[INFO ] - input Sentence2: This model generates embeddings
# 向量维度:
# Vector dimensions:
[INFO ] - Vector dimensions: 384
# 生成向量:
# Generate vectors:
[INFO ] - Sentence1 embeddings: [-0.14147712, -0.025930656, -0.18829542,..., -0.11860573, -0.13064586]
[INFO ] - Sentence2 embeddings: [-0.43392915, -0.23374224, -0.12924, ..., 0.0916177, 0.080070406]
#计算相似度:
# Calculate Similarity:
[INFO ] - Similarity: 0.7306041
```


### 开源算法
#### 1. sdk使用的开源算法
### Open source algorithm
#### 1. Open source algorithms used by the SDK
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- [预训练模型](https://www.sbert.net/docs/pretrained_models.html)
- [安装](https://www.sbert.net/docs/installation.html)
- [Pre-trained models](https://www.sbert.net/docs/pretrained_models.html)
- [Installation](https://www.sbert.net/docs/installation.html)


#### 2. 模型如何导出 ?
#### 2. How to export the model?
- [how_to_convert_your_model_to_torchscript](http://docs.djl.ai/docs/pytorch/how_to_convert_your_model_to_torchscript.html)

- 导出CPU模型(pytorch 模型特殊,CPU&GPU模型不通用。所以CPU,GPU需要分别导出)
- device='cpu'
- device='gpu'
- Exporting CPU models (PyTorch models are special, and CPU and GPU models are not interchangeable. Therefore, CPU and GPU models need to be exported separately)
- device = torch.device("cpu")
- device = torch.device("gpu")
- export_model.py
```text
from sentence_transformers import SentenceTransformer
Expand All @@ -76,22 +76,4 @@ input_features = {'input_ids': input_ids, 'token_type_ids': input_type_ids, 'att
# traced_model = torch.jit.trace(model, example_inputs=input_features)
traced_model = torch.jit.trace(model, example_inputs=input_features,strict=False)
traced_model.save("traced_st_model.pt")
```



### Git地址:
[Github链接](https://github.com/mymagicpower/AIAS)
[Gitee链接](https://gitee.com/mymagicpower/AIAS)


#### 帮助文档:
- http://aias.top/guides.html
- 1.性能优化常见问题:
- http://aias.top/AIAS/guides/performance.html
- 2.引擎配置(包括CPU,GPU在线自动加载,及本地配置):
- http://aias.top/AIAS/guides/engine_config.html
- 3.模型加载方式(在线自动加载,及本地配置):
- http://aias.top/AIAS/guides/load_model.html
- 4.Windows环境常见问题:
- http://aias.top/AIAS/guides/windows.html
```
Loading

0 comments on commit 6e20414

Please sign in to comment.