-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: add options to dry run and debug for chat and generate #8165
base: main
Are you sure you want to change the base?
Conversation
151663e
to
1d529d8
Compare
// Warn user if messages are truncated from the input | ||
if numTruncatedMessages := len(msgs[0:currMsgIdx]); numTruncatedMessages > 0 { | ||
slog.Warn("truncated first messages from input", "num_truncated", numTruncatedMessages) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- OpenAI returns an error on exceeding context length and tells the user the max context length.
- We are currently doing a
slog.Warn
inrunner.go
for when a single message's content is truncated. - This block adds a warn on dropping whole messages as well.
- I don't think it makes sense to return an error as that might be a really breaking experience but this information should definitely be surfaced up at minimum through the warn
server/routes.go
Outdated
@@ -1539,6 +1572,18 @@ func (s *Server) ChatHandler(c *gin.Context) { | |||
return | |||
} | |||
|
|||
if req.DryRun { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass at this and have some thoughts:
- I'm wondering if we could wrap this under the existing "options" parameter although that is meant for the model options and I'm not a fan of having the two conflated - it's something I'll try it out
- With this method we have to load the full model into the vRAM and use the scheduler due to needing the tokenizer as well as the truncated content of the messages.
- There is a world where we can side load the model without the vRAM (like in the tokenize draft PR).
- We'd still have to refactor doing the truncation based on context length not in the runner and I think that's where we start bleeding scope.
My take is:
- Cleanup this PR - keep scope small and stick to this pattern for just using the scheduler for now
- Figure out more long term model loading + swapping for quick interactions vs loading into vRAM and have a shared interface for those common patterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also need to address streaming)
Rather than shoot from the hip here and design-by-gut, I thought it might be helpful to draft some thoughts on a proposal, so we can ensure we're not hurting ourselves and users later. Here it is: Proposal: Debug and Dry Run Modes for Ollama Prompt GenerationWe propose adding debug and dry run capabilities to Ollama's prompt generation system. These features would help users understand, test, and verify how their inputs are transformed into final prompts for LLM inference. BackgroundOllama uses prompt templates to convert user messages into final prompts for LLM inference. Currently, users cannot inspect this transformation process or verify token counts without running a full generation. ProposalThis proposal introduces two new optional request parameters:
This is intentionally designed as an opt-in feature to maintain compatibility with existing clients while providing valuable debugging capabilities when needed. Separating the two concerns allows users to perform generation in these ways:
The combinations above have many powerful use cases. RationaleThe ability to inspect prompt generation serves several key needs:
CompatibilityThis proposal maintains compatibility by:
ExamplesDebug PromptStandard chat request with prompt debugging: curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"debug": { "mode": "prompt" },
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
]
}' Response with debug information in responses: The first with the prompt, second with generation, and the third with final counts: {
"model": "llama3.2",
"created_at": "2023-08-04T08:52:19.385406455-07:00",
"debug": {
"prompt": "<|start_header_id|>user<|end_header_id|>Given the following functions, please respond with a JSON, ..."
},
"done": false
}
{
"model": "llama3.2",
"created_at": "2023-08-04T08:52:19.385406455-07:00",
"message": {
"role": "assistant",
"content": "The sky is blue because",
"images": null
},
"done": false
}
{
"model": "llama3.2",
"created_at": "2023-08-04T19:22:45.499127Z",
"done": true,
"total_duration": 4883583458,
"load_duration": 1334875,
"prompt_eval_count": 26,
"prompt_eval_duration": 342546000,
"eval_count": 282,
"eval_duration": 4535599000
} Dry Run ModeRequest with dry run enabled: curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"dry": true,
"messages": [
{
"role": "user",
"content": "why is the sky blue?",
}
]
}' Response from dry run is a single, final response: {
"model": "llama3.2",
"created_at": "2023-08-04T19:22:45.499127Z",
"done": true,
"total_duration": 4883583458,
"load_duration": 1334875,
"prompt_eval_count": 26,
"prompt_eval_duration": 342546000,
"eval_count": 282,
"eval_duration": 4535599000
} Debug ModesThe debug field accepts the following modes:
Token Counts Users will not need to change how or where they get token counts either of the Additional considerations: It may be useful to echo back the dry and debug parameters in the response so clients |
NOTE: While it was considered that Providing a |
I'm now also considering: "debug": { "prompt": true } and later we could add |
The design doc is nice, my feedback.
Gotta balance not making things too complicated here for the actual use-cases, but this is feeling like a good direction. |
I like the idea of having a meta field as @BruceMacD mentioned and I suppose it's also true that it might be used alongside a generation. This in conjunction with the split fields would be really useful for testing and debugging |
I went back-and-fourth on Maybe we consider:
which would eliminate the proposal for |
I think in this pattern something like The reason I decided to make it an object was clear I thought. We're cornering ourselves by using more restrictive types or structures. Maybe:
or
I like the nested debug -> include -> list. I think that's the most clear |
|
7ca7798
to
cce57e4
Compare
cce57e4
to
6556540
Compare
After playing around with this a bit - I think the splitting of responses like this is okay if streaming is set to true. |
@@ -103,10 +103,18 @@ type ChatRequest struct { | |||
// Tools is an optional list of tools the model has access to. | |||
Tools `json:"tools,omitempty"` | |||
|
|||
Debug *Debug `json:"debug,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
@@ -190,6 +198,8 @@ type ChatResponse struct { | |||
Message Message `json:"message"` | |||
DoneReason string `json:"done_reason,omitempty"` | |||
|
|||
Debug map[string]any `json:"debug,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name debug
isn't a good choice here since we aren't really de-bugging something as much as looking to have the final prompt echo'd back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmorganca There is some explanation and alternative proposals in the proposal here .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParthSareen Please be more judicious with your uses of map[string]any
. That is to say, it should be so rare you forget it is an option most of the time.
To implement the proposal, which states this is an object, always, use a struct
with fields of specific types, not any
.
This also allows for docs, which we need much more of in ollama.
//...
// Debug, if set, adds additional context to generate responses as
// specified by its fields.
Debug: *DebugOptions,
}
// DebugOptions defines options available for generate requests.
type DebugOptions struct {
// Mode specifies which mode of debugging is requested of the server. The
// available modes are "default", and "prompt. If empty, "default" is used. [... more words about modes]
Mode string `json:"mode"`
}
Did we consider using a single field in the API? (vs needing to set two?) |
The design has two distinct concerns:
This creates a 2x2 matrix. A single field could have values like Two separate fields yield the same configurations more clearly and with better documentation. The question is: Must we add two fields? Can we use fewer without complex schemes? Yes. In our discussion, @jmorganca proposed Example: Getting token count without completion:
Using For returning the generated prompt, we need one new field. While
|
chatPrompt
functionPrecursor to enabling tokenization endpoints: #8106