llama.cpp chat example implementation #15

SagsMug · 2023-04-03T21:04:48Z

This commit adds a port of llama.cpp's main function as an example.
It's finally reached a stage where its readable enough for general usage and learning.
There are some differences from the original main since i wanted programmatic I/O.

Future work:

Implement a circular buffer
Context saving for chat resuming

On the first, like the original we just use a list and pop the first element.
Python's deque doesn't support slicing and implementing a custom class seemed out of scope for an example.
It's not the slowest part anyway, since we're waiting for llama most of the time, so its not a high priority.

We can say that that's left as an exercise for the reader 😋

On the second, see #14

Also solves #7, unless we want a higher level interactive mode also

MillionthOdin16 · 2023-04-04T00:36:37Z

Context saving for chat resuming

Awesome! I have a question about the context since you mentioned it. Using the llama.cpp main, I can have a continuous conversation with the model where the context is stored for up to 2048 tokens. The only time I lose the model's memory is when I terminate the executable. When I start the model again, it loads the prompt, and I start fresh like I did last time I started the memory.

Does your implementation of main still maintain context between individual messages? I ask because I was running into issues while testing. One of the big concerns I had while testing today was that I couldn't figure out why it wasn't maintaining context, and I definitely don't want to resend the full prompt and message history each time.

Other than that, I need to check in with llama.cpp to see where they are on state saving. I know lots of people are interested. Would def be a cool ability!

Instruction mode

I do thing it would be a good idea to implement instruction mode and before people get the example. I think it's one of the more heavily use flags between -i and `-ins'. Is there something the makes it difficult?

Thanks :)

SagsMug · 2023-04-04T10:03:12Z

Does your implementation of main still maintain context between individual messages? I ask because I was running into issues while testing. One of the big concerns I had while testing today was that I couldn't figure out why it wasn't maintaining context, and I definitely don't want to resend the full prompt and message history each time.

This is the infinite text generation part of llama.cpp, and yes its implemented but untested.
EDIT:
Its now tested, fixed and working.
See:

llama-cpp-python/examples/low_level_api_chatllama_cpp.py

Lines 111 to 124 in 3d96ddf

    
           if len(self.embd) > 0: 
        
           	# infinite text generation via context swapping 
        
           	# if we run out of context: 
        
           	# - take the n_keep first tokens from the original prompt (via n_past) 
        
           	# - take half of the last (n_ctx - n_keep) tokens and recompute the logits in a batch 
        
           	if (self.n_past + len(self.embd) > self.n_ctx): 
        
           		n_left = self.n_past - self.n_keep 
        
           		self.n_past = self.n_keep 
        
           		# insert n_left/2 tokens at the start of embd from last_n_tokens 
        
           		_insert = self.last_n_tokens[ 
        
           			self.n_ctx - int(n_left/2) - len(self.embd):-len(self.embd) 
        
           		] 
        
           		self.embd = _insert + self.embd

C++ confuses me sometimes

Other than that, I need to check in with llama.cpp to see where they are on state saving. I know lots of people are interested. Would def be a cool ability!

Definitely, that's also why i made the issue cause i thought that i could have state saving, but alas, it wasn't meant to be.

I do thing it would be a good idea to implement instruction mode and before people get the example. I think it's one of the more heavily use flags between -i and `-ins'. Is there something the makes it difficult?

Its now implemented using instruct=True flag, there aren't any command-line flags right now.
One note is that you should feed the initial prompt with some text that shows it how to return to instruction mode, but you probably do that anyway for normal llama.
Otherwise it will infinitely generate nonsense, and you will have to implement a KeyboardInterrupt catcher to stop generation, and i will be sad.
EDIT:
okay i implemented a KeyboardInterrupt catcher.

…fixes

…hers.

abetlen · 2023-04-06T03:52:14Z

Good progress so far, I would like to include this in the examples I just request that we follow the llama.cpp examples more closely, specifically, could you instead store / load parameters into using something like the below:

import os
import argparse

from dataclasses import dataclass, field
from typing import List, Optional

# Based on https://github.com/ggerganov/llama.cpp/blob/master/examples/common.cpp


@dataclass
class GptParams:
    seed: int = -1
    n_threads: int = min(4, os.cpu_count() or 1)
    n_predict: int = 128
    repeat_last_n: int = 64
    n_parts: int = -1
    n_ctx: int = 512
    n_batch: int = 8
    n_keep: int = 0

    top_k: int = 40
    top_p: float = 0.95
    temp: float = 0.80
    repeat_penalty: float = 1.10

    model: str = "models/lamma-7B/ggml-model.bin"
    prompt: str = ""
    input_prefix: str = ""

    antiprompt: List[str] = field(default_factory=list)

    memory_f16: bool = True
    random_prompt: bool = False
    use_color: bool = False
    interactive: bool = False

    embedding: bool = False
    interactive_start: bool = False

    instruct: bool = False
    ignore_eos: bool = False
    perplexity: bool = False
    use_mlock: bool = False
    mem_test: bool = False
    verbose_prompt: bool = False


def gpt_params_parse(argv, params: Optional[GptParams] = None):
    if params is None:
        params = GptParams()

    parser = argparse.ArgumentParser()
    parser.add_argument("-s", "--seed", type=int, default=-1, help="")
    parser.add_argument("-t", "--threads", type=int, default=1, help="")
    parser.add_argument("-p", "--prompt", type=str, default="", help="")
    parser.add_argument("-f", "--file", type=str, default=None, help="")
    parser.add_argument("-c", "--context_size", type=int, default=512, help="")
    parser.add_argument("--memory_f32", action="store_true", help="")
    parser.add_argument("--top_p", type=float, default=0.9, help="")
    parser.add_argument("--temp", type=float, default=1.0, help="")
    parser.add_argument("--repeat_last_n", type=int, default=64, help="")
    parser.add_argument("--repeat_penalty", type=float, default=1.0, help="")
    parser.add_argument("-b", "--batch_size", type=int, default=8, help="")
    parser.add_argument("-m", "--model", type=str, help="")
    parser.add_argument(
        "-i", "--interactive", action="store_true", help="run in interactive mode"
    )
    parser.add_argument("--embedding", action="store_true", help="")
    parser.add_argument("--interactive-start", action="store_true", help="")
    parser.add_argument(
        "--interactive-first",
        action="store_true",
        help="run in interactive mode and wait for input right away",
    )
    parser.add_argument(
        "-ins",
        "--instruct",
        action="store_true",
        help="run in instruction mode (use with Alpaca models)",
    )
    parser.add_argument(
        "--color",
        action="store_true",
        help="colorise output to distinguish prompt and user input from generations",
    )
    parser.add_argument("--mlock", action="store_true")
    parser.add_argument("--mtest", action="store_true")
    parser.add_argument(
        "-r",
        "--reverse-prompt",
        type=str,
        default="",
        help="run in interactive mode and poll user input upon seeing PROMPT (can be\nspecified more than once for multiple prompts).",
    )
    parser.add_argument("--perplexity", action="store_true", help="")
    parser.add_argument("--ignore-eos", action="store_true", help="")
    parser.add_argument("--n_parts", type=int, default=-1, help="")
    parser.add_argument("--random-prompt", action="store_true", help="")
    parser.add_argument("--in-prefix", type=str, default="", help="")
    args = parser.parse_args(argv)
    return args

Ideally place this into a examples/low_level_api/common.py file to follow how main.cpp works.

Has some too many newline issues so WIP

Still needs shipping work so you could do "python -m llama_cpp.examples." etc.

examples/common.py

abetlen

Can you move this common.py file to just the low_level_api folder and remove the init.py files?

SagsMug · 2023-04-07T11:37:21Z

Can you move this common.py file to just the low_level_api folder and remove the init.py files?

Done!
It should now very closely match the original implementation with arguments and everything, while retaining it's more library like implementation.
Tested on Miku.sh from the original repo

abetlen · 2023-04-08T00:46:18Z

@SagsMug thank you!

Chat llama.cpp example implementation

f1615f0

Add instruction mode

0b32bb3

Added instruction mode, fixed infinite generation, and various other …

da5a6a7

…fixes

SagsMug force-pushed the main branch from 430c981 to da5a6a7 Compare April 4, 2023 14:18

Fix stripping instruction prompt

9cde797

SagsMug force-pushed the main branch from e12b3ad to c7fd2f5 Compare April 4, 2023 14:28

Fix repeating instructions and an antiprompt bug

c862e8b

SagsMug force-pushed the main branch from 266f009 to c862e8b Compare April 4, 2023 15:54

Mug added 3 commits April 5, 2023 14:18

Merge branch 'main' of https://github.com/abetlen/llama-cpp-python

e4c6f34

Move to new examples directory

99ceecf

Fix bug in init_break not being set when exited via antiprompt and ot…

283e59c

…hers.

Mug added 2 commits April 6, 2023 15:30

Better llama.cpp interoperability

085cc92

Has some too many newline issues so WIP

Fixed too many newlines, now onto args.

10c7571

Still needs shipping work so you could do "python -m llama_cpp.examples." etc.

abetlen reviewed Apr 6, 2023

View reviewed changes

examples/common.py Outdated Show resolved Hide resolved

abetlen requested changes Apr 6, 2023

View reviewed changes

ClassicDirt mentioned this pull request Apr 7, 2023

Using llama.cpp, the entire context gets reprocessed each generation oobabooga/text-generation-webui#866

Closed

More interoperability to the original llama.cpp, and arguments now work

16fc5b5

abetlen merged commit 41365b0 into abetlen:main Apr 8, 2023

xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this pull request Jun 13, 2023

Fix un-initialized FP16 tables on x86 (abetlen#15, abetlen#2)

a9e5852

anhcq3003 mentioned this pull request Mar 20, 2024

Segmentation fault (core dumped) when generating #1292

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp chat example implementation #15

llama.cpp chat example implementation #15

SagsMug commented Apr 3, 2023 •

edited

Loading

MillionthOdin16 commented Apr 4, 2023

SagsMug commented Apr 4, 2023 •

edited

Loading

abetlen commented Apr 6, 2023

abetlen left a comment

SagsMug commented Apr 7, 2023 •

edited

Loading

abetlen commented Apr 8, 2023

llama.cpp chat example implementation #15

llama.cpp chat example implementation #15

Conversation

SagsMug commented Apr 3, 2023 • edited Loading

MillionthOdin16 commented Apr 4, 2023

SagsMug commented Apr 4, 2023 • edited Loading

abetlen commented Apr 6, 2023

abetlen left a comment

Choose a reason for hiding this comment

SagsMug commented Apr 7, 2023 • edited Loading

abetlen commented Apr 8, 2023

SagsMug commented Apr 3, 2023 •

edited

Loading

SagsMug commented Apr 4, 2023 •

edited

Loading

SagsMug commented Apr 7, 2023 •

edited

Loading