OpenVINO™ GenAI is a flavor of OpenVINO™, aiming to simplify running inference of generative AI models. It hides the complexity of the generation process and minimizes the amount of code required.
The OpenVINO™ GenAI flavor is available for installation via Archive and PyPI distributions. To install OpenVINO™ GenAI, refer to the Install Guide.
To build OpenVINO™ GenAI library from source, refer to the Build Instructions.
-
Installed OpenVINO™ GenAI
If OpenVINO GenAI is installed via archive distribution or built from source, you will need to install additional python dependencies (e.g.
optimum-cli
for simplified model downloading and exporting, it's not required to install ./samples/requirements.txt for deployment if the model has already been exported):# (Optional) Clone OpenVINO GenAI repository if it does not exist git clone --recursive https://github.com/openvinotoolkit/openvino.genai.git cd openvino.genai # Install python dependencies python -m pip install ./thirdparty/openvino_tokenizers/[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release python -m pip install --upgrade-strategy eager -r ./samples/requirements.txt
-
A model in OpenVINO IR format
Download and convert a model with
optimum-cli
:optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --trust-remote-code "TinyLlama-1.1B-Chat-v1.0"
LLMPipeline
is the main object used for decoding. You can construct it straight away from the folder with the converted model. It will automatically load the main model, tokenizer, detokenizer and default generation configuration.
A simple example:
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
print(pipe.generate("The Sun is yellow because"))
Calling generate with custom generation config parameters, e.g. config for grouped beam search:
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
result = pipe.generate("The Sun is yellow because", max_new_tokens=30, num_beam_groups=3, num_beams=15, diversity_penalty=1.5)
print(result)
output:
'it is made up of carbon atoms. The carbon atoms are arranged in a linear pattern, which gives the yellow color. The arrangement of carbon atoms in'
A simple chat in Python:
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path)
config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)
pipe.start_chat()
while True:
print('question:')
prompt = input()
if prompt == 'Stop!':
break
print(pipe(prompt))
pipe.finish_chat()
Test to compare with Huggingface outputs
A simple example:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow because");
}
Using group beam search decoding:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
ov::genai::GenerationConfig config;
config.max_new_tokens = 256;
config.num_beam_groups = 3;
config.num_beams = 15;
config.diversity_penalty = 1.0f;
std::cout << pipe.generate("The Sun is yellow because", config);
}
A simple chat in C++ using grouped beam search decoding:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string prompt;
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
config.num_beam_groups = 3;
config.num_beams = 15;
config.diversity_penalty = 1.0f;
pipe.start_chat();
for (;;;) {
std::cout << "question:\n";
std::getline(std::cin, prompt);
if (prompt == "Stop!")
break;
std::cout << "answer:\n";
auto answer = pipe(prompt, config);
std::cout << answer << std::endl;
}
pipe.finish_chat();
}
Streaming example with lambda function:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
auto streamer = [](std::string word) {
std::cout << word << std::flush;
// Return flag corresponds whether generation should be stopped.
// false means continue generation.
return false;
};
std::cout << pipe.generate("The Sun is yellow bacause", ov::genai::streamer(streamer));
}
Streaming with a custom class:
#include "openvino/genai/streamer_base.hpp"
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
class CustomStreamer: public ov::genai::StreamerBase {
public:
bool put(int64_t token) {
bool stop_flag = false;
/*
custom decoding/tokens processing code
tokens_cache.push_back(token);
std::string text = m_tokenizer.decode(tokens_cache);
...
*/
return stop_flag; // flag whether generation should be stoped, if true generation stops.
};
void end() {
/* custom finalization */
};
};
int main(int argc, char* argv[]) {
CustomStreamer custom_streamer;
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow because", ov::genai::streamer(custom_streamer));
}
For information on how OpenVINO™ GenAI works, refer to the How It Works Section.
For a list of supported models, refer to the Supported Models Section.