Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.cpp chat example implementation #15

Merged
merged 11 commits into from
Apr 8, 2023
Merged

llama.cpp chat example implementation #15

merged 11 commits into from
Apr 8, 2023

Conversation

SagsMug
Copy link
Contributor

@SagsMug SagsMug commented Apr 3, 2023

This commit adds a port of llama.cpp's main function as an example.
It's finally reached a stage where its readable enough for general usage and learning.
There are some differences from the original main since i wanted programmatic I/O.

Future work:

  • Implement a circular buffer
  • Context saving for chat resuming

On the first, like the original we just use a list and pop the first element.
Python's deque doesn't support slicing and implementing a custom class seemed out of scope for an example.
It's not the slowest part anyway, since we're waiting for llama most of the time, so its not a high priority.

We can say that that's left as an exercise for the reader 😋

On the second, see #14

Also solves #7, unless we want a higher level interactive mode also

@MillionthOdin16
Copy link
Contributor

Context saving for chat resuming

Awesome! I have a question about the context since you mentioned it. Using the llama.cpp main, I can have a continuous conversation with the model where the context is stored for up to 2048 tokens. The only time I lose the model's memory is when I terminate the executable. When I start the model again, it loads the prompt, and I start fresh like I did last time I started the memory.

Does your implementation of main still maintain context between individual messages? I ask because I was running into issues while testing. One of the big concerns I had while testing today was that I couldn't figure out why it wasn't maintaining context, and I definitely don't want to resend the full prompt and message history each time.

Other than that, I need to check in with llama.cpp to see where they are on state saving. I know lots of people are interested. Would def be a cool ability!

Instruction mode

I do thing it would be a good idea to implement instruction mode and before people get the example. I think it's one of the more heavily use flags between -i and `-ins'. Is there something the makes it difficult?

Thanks :)

@SagsMug
Copy link
Contributor Author

SagsMug commented Apr 4, 2023

Does your implementation of main still maintain context between individual messages? I ask because I was running into issues while testing. One of the big concerns I had while testing today was that I couldn't figure out why it wasn't maintaining context, and I definitely don't want to resend the full prompt and message history each time.

This is the infinite text generation part of llama.cpp, and yes its implemented but untested.
EDIT:
Its now tested, fixed and working.
See:

if len(self.embd) > 0:
# infinite text generation via context swapping
# if we run out of context:
# - take the n_keep first tokens from the original prompt (via n_past)
# - take half of the last (n_ctx - n_keep) tokens and recompute the logits in a batch
if (self.n_past + len(self.embd) > self.n_ctx):
n_left = self.n_past - self.n_keep
self.n_past = self.n_keep
# insert n_left/2 tokens at the start of embd from last_n_tokens
_insert = self.last_n_tokens[
self.n_ctx - int(n_left/2) - len(self.embd):-len(self.embd)
]
self.embd = _insert + self.embd

C++ confuses me sometimes

Other than that, I need to check in with llama.cpp to see where they are on state saving. I know lots of people are interested. Would def be a cool ability!

Definitely, that's also why i made the issue cause i thought that i could have state saving, but alas, it wasn't meant to be.

I do thing it would be a good idea to implement instruction mode and before people get the example. I think it's one of the more heavily use flags between -i and `-ins'. Is there something the makes it difficult?

Its now implemented using instruct=True flag, there aren't any command-line flags right now.
One note is that you should feed the initial prompt with some text that shows it how to return to instruction mode, but you probably do that anyway for normal llama.
Otherwise it will infinitely generate nonsense, and you will have to implement a KeyboardInterrupt catcher to stop generation, and i will be sad.
EDIT:
okay i implemented a KeyboardInterrupt catcher.

@abetlen
Copy link
Owner

abetlen commented Apr 6, 2023

Good progress so far, I would like to include this in the examples I just request that we follow the llama.cpp examples more closely, specifically, could you instead store / load parameters into using something like the below:

import os
import argparse

from dataclasses import dataclass, field
from typing import List, Optional

# Based on https://github.com/ggerganov/llama.cpp/blob/master/examples/common.cpp


@dataclass
class GptParams:
    seed: int = -1
    n_threads: int = min(4, os.cpu_count() or 1)
    n_predict: int = 128
    repeat_last_n: int = 64
    n_parts: int = -1
    n_ctx: int = 512
    n_batch: int = 8
    n_keep: int = 0

    top_k: int = 40
    top_p: float = 0.95
    temp: float = 0.80
    repeat_penalty: float = 1.10

    model: str = "models/lamma-7B/ggml-model.bin"
    prompt: str = ""
    input_prefix: str = ""

    antiprompt: List[str] = field(default_factory=list)

    memory_f16: bool = True
    random_prompt: bool = False
    use_color: bool = False
    interactive: bool = False

    embedding: bool = False
    interactive_start: bool = False

    instruct: bool = False
    ignore_eos: bool = False
    perplexity: bool = False
    use_mlock: bool = False
    mem_test: bool = False
    verbose_prompt: bool = False


def gpt_params_parse(argv, params: Optional[GptParams] = None):
    if params is None:
        params = GptParams()

    parser = argparse.ArgumentParser()
    parser.add_argument("-s", "--seed", type=int, default=-1, help="")
    parser.add_argument("-t", "--threads", type=int, default=1, help="")
    parser.add_argument("-p", "--prompt", type=str, default="", help="")
    parser.add_argument("-f", "--file", type=str, default=None, help="")
    parser.add_argument("-c", "--context_size", type=int, default=512, help="")
    parser.add_argument("--memory_f32", action="store_true", help="")
    parser.add_argument("--top_p", type=float, default=0.9, help="")
    parser.add_argument("--temp", type=float, default=1.0, help="")
    parser.add_argument("--repeat_last_n", type=int, default=64, help="")
    parser.add_argument("--repeat_penalty", type=float, default=1.0, help="")
    parser.add_argument("-b", "--batch_size", type=int, default=8, help="")
    parser.add_argument("-m", "--model", type=str, help="")
    parser.add_argument(
        "-i", "--interactive", action="store_true", help="run in interactive mode"
    )
    parser.add_argument("--embedding", action="store_true", help="")
    parser.add_argument("--interactive-start", action="store_true", help="")
    parser.add_argument(
        "--interactive-first",
        action="store_true",
        help="run in interactive mode and wait for input right away",
    )
    parser.add_argument(
        "-ins",
        "--instruct",
        action="store_true",
        help="run in instruction mode (use with Alpaca models)",
    )
    parser.add_argument(
        "--color",
        action="store_true",
        help="colorise output to distinguish prompt and user input from generations",
    )
    parser.add_argument("--mlock", action="store_true")
    parser.add_argument("--mtest", action="store_true")
    parser.add_argument(
        "-r",
        "--reverse-prompt",
        type=str,
        default="",
        help="run in interactive mode and poll user input upon seeing PROMPT (can be\nspecified more than once for multiple prompts).",
    )
    parser.add_argument("--perplexity", action="store_true", help="")
    parser.add_argument("--ignore-eos", action="store_true", help="")
    parser.add_argument("--n_parts", type=int, default=-1, help="")
    parser.add_argument("--random-prompt", action="store_true", help="")
    parser.add_argument("--in-prefix", type=str, default="", help="")
    args = parser.parse_args(argv)
    return args

Ideally place this into a examples/low_level_api/common.py file to follow how main.cpp works.

Mug added 2 commits April 6, 2023 15:30
Has some too many newline issues so WIP
Still needs shipping work so you could do "python -m llama_cpp.examples." etc.
examples/common.py Outdated Show resolved Hide resolved
Copy link
Owner

@abetlen abetlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this common.py file to just the low_level_api folder and remove the init.py files?

@SagsMug
Copy link
Contributor Author

SagsMug commented Apr 7, 2023

Can you move this common.py file to just the low_level_api folder and remove the init.py files?

Done!
It should now very closely match the original implementation with arguments and everything, while retaining it's more library like implementation.
Tested on Miku.sh from the original repo

@abetlen abetlen merged commit 41365b0 into abetlen:main Apr 8, 2023
@abetlen
Copy link
Owner

abetlen commented Apr 8, 2023

@SagsMug thank you!

xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this pull request Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants