These applications showcase inference of a causal language model (LM). They don't have many configuration options to encourage the reader to explore and modify the source code. There's a Jupyter notebook which corresponds to these pipelines and discusses how to create an LLM-powered Chatbot: https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/254-llm-chatbot.
Note
This project is not for production use.
The program loads a tokenizer, a detokenizer and a model (.xml
and .bin
) to OpenVINO. A prompt is tokenized and passed to the model. The model greedily generates token by token until the special end of sequence (EOS) token is obtained. The predicted tokens are converted to chars and printed in a streaming fashion.
The program loads a tokenizer, a detokenizer and a model (.xml
and .bin
) to OpenVINO. A prompt is tokenized and passed to the model. The model predicts a distribution over the next tokens and group beam search samples from that distribution to explore possible sequesnses. The result is converted to chars and printed.
Install OpenVINO Runtime from an archive: Linux. <INSTALL_DIR>
below refers to the extraction location.
git submodule update --init
source <INSTALL_DIR>/setupvars.sh
cmake -DCMAKE_BUILD_TYPE=Release -S ./ -B ./build/ && cmake --build ./build/ -j
git submodule update --init
<INSTALL_DIR>\setupvars.bat
cmake -S .\ -B .\build\ && cmake --build .\build\ --config Release -j
- chatglm - refer to
chatglm2-6b - AttributeError: can't set attribute
in case of
AttributeError
- LLaMA 2
- https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-13b-hf
- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-7b-hf
- https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-70b-hf
- Llama2-7b-WhoIsHarryPotter
- OpenLLaMA
- TinyLlama
- Qwen
- https://huggingface.co/Qwen/Qwen-7B-Chat
- https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 - refer to
Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled
in case of
AssertionError
This pipeline can work with other similar topologies produced by optimum-intel
with the same model signature.
The --upgrade-strategy eager
option is needed to ensure optimum-intel
is upgraded to the latest version.
source <INSTALL_DIR>/setupvars.sh
python3 -m pip install --upgrade-strategy eager "optimum>=1.14" -r ../../../llm_bench/python/requirements.txt ../../../thirdparty/openvino_contrib/modules/custom_operations/[transformers] --extra-index-url https://download.pytorch.org/whl/cpu
python3 ../../../llm_bench/python/convert.py --model_id meta-llama/Llama-2-7b-hf --output_dir ./Llama-2-7b-hf/ --precision FP16 --stateful
convert_tokenizer ./Llama-2-7b-hf/pytorch/dldt/FP16/ --output ./Llama-2-7b-hf/pytorch/dldt/FP16/ --with-detokenizer --trust-remote-code
<INSTALL_DIR>\setupvars.bat
python -m pip install --upgrade-strategy eager "optimum>=1.14" -r ..\..\..\llm_bench\python\requirements.txt ..\..\..\thirdparty\openvino_contrib\modules\custom_operations\[transformers] --extra-index-url https://download.pytorch.org/whl/cpu
python ..\..\..\llm_bench\python\convert.py --model_id meta-llama/Llama-2-7b-hf --output_dir .\Llama-2-7b-hf\ --precision FP16 --stateful
convert_tokenizer .\Llama-2-7b-hf\pytorch\dldt\FP16\ --output .\Llama-2-7b-hf\pytorch\dldt\FP16\ --with-detokenizer --trust-remote-code
Usage:
greedy_causal_lm <MODEL_DIR> "<PROMPT>"
beam_search_causal_lm <MODEL_DIR> "<PROMPT>"
Examples:
./build/greedy_causal_lm ./Llama-2-7b-hf/pytorch/dldt/FP16/ "Why is the Sun yellow?"
./build/beam_search_causal_lm ./Llama-2-7b-hf/pytorch/dldt/FP16/ "Why is the Sun yellow?"
To enable Unicode characters for Windows cmd open Region
settings from Control panel
. Administrative
->Change system locale
->Beta: Use Unicode UTF-8 for worldwide language support
->OK
. Reboot.