LLaMA-rs

This project is a Rust port of llama.cpp 🦙🦀🚀

Just like its C++ counterpart, it is powered by the ggml tensor library, which allows running inference for Facebook's LLaMA model on a CPU with good performance using full precision, f16 or 4-bit quantized versions of the model.

Image by @darthdeus, using Stable Diffusion

Getting started

Make sure you have a Rust 1.65.0 or above and C toolchain¹ set up.

llm-base, gpt2, and llama are Rust libraries, while llm-cli is a CLI applications that wraps gpt2 and llama and offer basic inference capabilities.

The following instructions explain how to build CLI applications.

NOTE: For best results, make sure to build and run in release mode. Debug builds are going to be very slow.

Building using `cargo`

Run

cargo install --git https://github.com/rustformers/llama-rs llm-cli

to install llm-cli to your Cargo bin directory, which rustup is likely to have added to your PATH.

The CLI application can then be run through llm-cli.

Building from repository

Clone the repository and then build it with

git clone --recurse-submodules git@github.com:rustformers/llama-rs.git
cargo build --release

The resulting binary will be at target/release/llm-cli[.exe].

It can also be run directly through Cargo, using

cargo run --release --bin llm-cli -- <ARGS>

This is useful for development.

Getting LLaMA weights

In order to run the inference code in llama-rs, a copy of the model's weights are required.

From Hugging Face

Compatible weights - not necessarily the original LLaMA weights - can be found on Hugging Face by searching for GGML. At present, LLaMA-architecture models are supported.

LLaMA original weights

Currently, the only legal source to get the original weights is this repository. Note that the choice of words also may or may not hint at the existence of other kinds of sources.

After acquiring the weights, it is necessary to convert them into a format that is compatible with ggml. To achieve this, follow the steps outlined below:

Warning

To run the Python scripts, a Python version of 3.9 or 3.10 is required. 3.11 is unsupported at the time of writing.

# Convert the model to f16 ggml format
python3 scripts/convert-pth-to-ggml.py /path/to/your/models/7B/ 1

# Quantize the model to 4-bit ggml format
cargo run -p llama-cli quantize /path/to/your/models/7B/ggml-model-f16.bin /path/to/your/models/7B/ggml-model-q4_0.bin q4_0

Note

The llama.cpp repository has additional information on how to obtain and run specific models.

GPT2

OpenAI's GPT-2 architecture is also supported. The open-source family of Cerebras models is built on this architecture.

Support for other open source models is currently planned. For models where weights can be legally distributed, this section will be updated with scripts to make the install process as user-friendly as possible. Due to the model's legal requirements, this is currently not possible with LLaMA itself and a more lengthy setup is required.

Running

For example, try the following prompt:

llama-cli infer -m <path>/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"

Some additional things to try:

Use --help to see a list of available options.

If you have the alpaca-lora weights, try repl mode!

llama-cli repl -m <path>/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt

Sessions can be loaded (--load-session) or saved (--save-session) to file. To automatically load and save the same session, use --persist-session. This can be used to cache prompts to reduce load time, too:

(This GIF shows an older version of the flags, but the mechanics are still the same.)

Docker

# To build (This will take some time, go grab some coffee):
docker build -t llama-rs .

# To run with prompt:
docker run --rm --name llama-rs -it -v ${PWD}/data:/data -v ${PWD}/examples:/examples llama-rs infer -m data/gpt4all-lora-quantized-ggml.bin -p "Tell me how cool the Rust programming language is:"

# To run with prompt file and repl (will wait for user input):
docker run --rm --name llama-rs -it -v ${PWD}/data:/data -v ${PWD}/examples:/examples llama-rs repl -m data/gpt4all-lora-quantized-ggml.bin -f examples/alpaca_prompt.txt

Q&A

Why did you do this?

It was not my choice. Ferris appeared to me in my dreams and asked me to rewrite this in the name of the Holy crab.

Seriously now.

Come on! I don't want to get into a flame war. You know how it goes, something something memory something something cargo is nice, don't make me say it, everybody knows this already.

I insist.

Sheesh! Okaaay. After seeing the huge potential for llama.cpp, the first thing I did was to see how hard would it be to turn it into a library to embed in my projects. I started digging into the code, and realized the heavy lifting is done by ggml (a C library, easy to bind to Rust) and the whole project was just around ~2k lines of C++ code (not so easy to bind). After a couple of (failed) attempts to build an HTTP server into the tool, I realized I'd be much more productive if I just ported the code to Rust, where I'm more comfortable.

Is this the real reason?

Haha. Of course not. I just like collecting imaginary internet points, in the form of little stars, that people seem to give to me whenever I embark on pointless quests for rewriting X thing, but in Rust.

How is this different from `llama.cpp`?

This is a reimplementation of llama.cpp that does not share any code with it outside of ggml. This was done for a variety of reasons:

llama.cpp requires a C++ compiler, which can cause problems for cross-compilation to more esoteric platforms. An example of such a platform is WebAssembly, which can require a non-standard compiler SDK.
Rust is easier to work with from a development and open-source perspective; it offers better tooling for writing "code in the large" with many other authors. Additionally, we can benefit from the larger Rust ecosystem with ease.
We would like to make ggml an optional backend (see this issue).

In general, we hope to build a solution for model inferencing that is as easy to use and deploy as any other Rust crate.

A modern-ish C toolchain is required to compile ggml. A C++ toolchain should not be necessary. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 383 Commits
.github		.github
.vscode		.vscode
doc/resources		doc/resources
examples		examples
ggml-rs		ggml-rs
gpt2		gpt2
llama		llama
llm-base		llm-base
llm-cli		llm-cli
llm		llm
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

LLaMA-rs

Getting started

Building using `cargo`

Building from repository

Getting LLaMA weights

From Hugging Face

LLaMA original weights

GPT2

Running

Docker

Q&A

Why did you do this?

Seriously now.

I insist.

Is this the real reason?

How is this different from `llama.cpp`?

About

Licenses found

Releases 1

Packages

Used by 320

Contributors 47

Languages

License

Licenses found

rustformers/llm

Folders and files

Latest commit

History

Repository files navigation

LLaMA-rs

Getting started

Building using cargo

Building from repository

Getting LLaMA weights

From Hugging Face

LLaMA original weights

GPT2

Running

Docker

Q&A

Why did you do this?

Seriously now.

I insist.

Is this the real reason?

How is this different from llama.cpp?

Footnotes

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 1

Packages 0

Used by 320

Contributors 47

Languages

Building using `cargo`

How is this different from `llama.cpp`?

Packages