Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add options to dry run and debug for chat and generate #8165

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ParthSareen
Copy link
Contributor

@ParthSareen ParthSareen commented Dec 18, 2024

  • Doesn't actually load the model
  • No tokenization or context length clipping
  • Barebones implementation of the chatPrompt function

Precursor to enabling tokenization endpoints: #8106

@ParthSareen ParthSareen self-assigned this Dec 18, 2024
api/types.go Outdated Show resolved Hide resolved
server/routes.go Outdated Show resolved Hide resolved
server/routes.go Outdated Show resolved Hide resolved
server/routes.go Outdated Show resolved Hide resolved
server/routes.go Outdated Show resolved Hide resolved
server/routes.go Outdated Show resolved Hide resolved
api/types.go Outdated Show resolved Hide resolved
api/client.go Outdated Show resolved Hide resolved
server/routes.go Outdated Show resolved Hide resolved
Comment on lines +85 to +88
// Warn user if messages are truncated from the input
if numTruncatedMessages := len(msgs[0:currMsgIdx]); numTruncatedMessages > 0 {
slog.Warn("truncated first messages from input", "num_truncated", numTruncatedMessages)
}
Copy link
Contributor Author

@ParthSareen ParthSareen Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • OpenAI returns an error on exceeding context length and tells the user the max context length.
  • We are currently doing a slog.Warn in runner.go for when a single message's content is truncated.
  • This block adds a warn on dropping whole messages as well.
  • I don't think it makes sense to return an error as that might be a really breaking experience but this information should definitely be surfaced up at minimum through the warn

server/routes.go Outdated
@@ -1539,6 +1572,18 @@ func (s *Server) ChatHandler(c *gin.Context) {
return
}

if req.DryRun {
Copy link
Contributor Author

@ParthSareen ParthSareen Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass at this and have some thoughts:

  • I'm wondering if we could wrap this under the existing "options" parameter although that is meant for the model options and I'm not a fan of having the two conflated - it's something I'll try it out
  • With this method we have to load the full model into the vRAM and use the scheduler due to needing the tokenizer as well as the truncated content of the messages.
  • There is a world where we can side load the model without the vRAM (like in the tokenize draft PR).
  • We'd still have to refactor doing the truncation based on context length not in the runner and I think that's where we start bleeding scope.

My take is:

  • Cleanup this PR - keep scope small and stick to this pattern for just using the scheduler for now
  • Figure out more long term model loading + swapping for quick interactions vs loading into vRAM and have a shared interface for those common patterns

Copy link
Contributor Author

@ParthSareen ParthSareen Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also need to address streaming)

api/types.go Outdated Show resolved Hide resolved
@ParthSareen ParthSareen changed the title Add /template endpoint server: add option to dry run a prompt Dec 19, 2024
@bmizerany
Copy link
Contributor

bmizerany commented Dec 19, 2024

Rather than shoot from the hip here and design-by-gut, I thought it might be helpful to draft some thoughts on a proposal, so we can ensure we're not hurting ourselves and users later. Here it is:


Proposal: Debug and Dry Run Modes for Ollama Prompt Generation

We propose adding debug and dry run capabilities to Ollama's prompt generation system. These features would help users understand, test, and verify how their inputs are transformed into final prompts for LLM inference.

Background

Ollama uses prompt templates to convert user messages into final prompts for LLM inference. Currently, users cannot inspect this transformation process or verify token counts without running a full generation.

Proposal

This proposal introduces two new optional request parameters:

  • A debug mode for exposing prompt generation details
  • A dry run mode for previewing prompts and/or token counts without generation

This is intentionally designed as an opt-in feature to maintain compatibility with existing clients while providing valuable debugging capabilities when needed.

Separating the two concerns allows users to perform generation in these ways:

  • normal chat response (with token count)
  • normal chat response with final prompt generated for generation (with token count)
  • no chat response (with token count)
  • no chat response _with _final prompt (with token count)

The combinations above have many powerful use cases.

Rationale

The ability to inspect prompt generation serves several key needs:

  • Debug support for prompt engineering
  • Regression testing across runtime versions
  • Token count estimation without generation
  • Verification of template behavior

Compatibility

This proposal maintains compatibility by:

  • Making all new fields optional
  • Preserving existing behavior when fields are omitted
  • Minimizing performance impact on non-debug flows

Examples

Debug Prompt

Standard chat request with prompt debugging:

curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "debug": { "mode": "prompt" },
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?"
        }
    ]
}'

Response with debug information in responses: The first with the prompt, second with generation, and the third with final counts:

{
  "model": "llama3.2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "debug": {
    "prompt": "<|start_header_id|>user<|end_header_id|>Given the following functions, please respond with a JSON, ..."
  },
  "done": false
}
{
  "model": "llama3.2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "message": {
    "role": "assistant",
    "content": "The sky is blue because",
    "images": null
  },
  "done": false
}
{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "done": true,
  "total_duration": 4883583458,
  "load_duration": 1334875,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 342546000,
  "eval_count": 282,
  "eval_duration": 4535599000
}

Dry Run Mode

Request with dry run enabled:

curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "dry": true,
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?",
        }
    ]
}'

Response from dry run is a single, final response:

{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "done": true,
  "total_duration": 4883583458,
  "load_duration": 1334875,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 342546000,
  "eval_count": 282,
  "eval_duration": 4535599000
}

Debug Modes

The debug field accepts the following modes:

  1. "default" (or omitted): Standard behavior with no debug output
  2. "prompt": Includes the complete prompt template expansion in the response. In stream mode it is only included in the first json object of the stream, without messages.

Token Counts

Users will not need to change how or where they get token counts either of the
the above modes. The token counts will be included in the response as they are
now. The only difference is that the token counts will be included in the
response even if the prompt is not generated.

Additional considerations:

It may be useful to echo back the dry and debug parameters in the response so clients
can verify that the server received the request as intended. It would also be
helpful for SDKs to have a way to perform specific actions based on the debug
and dry parameters. For example, a Python SDK could automatically print the
prompt template when debug mode is enabled.

@bmizerany
Copy link
Contributor

bmizerany commented Dec 19, 2024

NOTE: While it was considered that "debug": { "mode": "prompt" }, could just be "debug": true,, making it an object now keeps it open for future additions we may want to include, and in a way to prevents us from doing silly things in JSON like say that a field can be a boolean or a string or an object, making writing clients a pain.

Providing a "mode" also makes it explicit what the user is debugging.

@bmizerany
Copy link
Contributor

NOTE: While it was considered that "debug": { "mode": "prompt" }, could just be "debug": true,, making it an object now keeps it open for future additions we may want to include, and in a way to prevents us from doing silly things in JSON like say that a field can be a boolean or a string or an object, making writing clients a pain.

Providing a "mode" also makes it explicit what the user is debugging.

I'm now also considering:

"debug": { "prompt": true }

and later we could add { "prompt": true, "shields": true, ...

@BruceMacD
Copy link
Contributor

BruceMacD commented Dec 19, 2024

The design doc is nice, my feedback.

  • I'm not sure debug is the right name for the field. The returned values may be used in normal operations, like a client with long chat that wants to manually manage the context length. I'd suggest meta:
curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "meta": { "prompt": true },
    "dry": true,
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?"
        }
    ]
}'
  • I also like the boolean fields that Blake suggested in a follow-up to allow for getting multiple debug/meta fields back.

Gotta balance not making things too complicated here for the actual use-cases, but this is feeling like a good direction.

@ParthSareen
Copy link
Contributor Author

I like the idea of having a meta field as @BruceMacD mentioned and I suppose it's also true that it might be used alongside a generation. This in conjunction with the split fields would be really useful for testing and debugging

@bmizerany
Copy link
Contributor

bmizerany commented Dec 19, 2024

  • I'm not sure debug is the right name for the field. The returned values may be used in normal operations, like a client with long chat that wants to manually manage the context length. I'd suggest meta:

I went back-and-fourth on "include": { ... } too. I'm not sure that is descriptive enough though. Are was saying "include a prompt in generation" or "include a prompt in response"...

Maybe we consider:

...
"response": {
    "include": ["prompt", ...]
}

which would eliminate the proposal for {"prompt": true}, which I like less after this came to mind.

@ParthSareen
Copy link
Contributor Author

ParthSareen commented Dec 19, 2024

I went back-and-fourth on "include": { ... } too. I'm not sure that is descriptive enough though. Are was saying "include a prompt in generation" or "include a prompt in response"...

Maybe we consider:

...
"response": {
    "include": ["prompt", ...]
}

which would eliminate the proposal for {"prompt": true}, which I like less after this came to mind.

I think in this pattern something like debug:["prompt"] is more descriptive as it's all the fields the user could want round-tripped with everything we add to it

The reason I decided to make it an object was clear I thought. We're cornering ourselves by using more restrictive types or structures.

Maybe:

"debug": {
    "include": ["prompt", "shields", ...]
}

or

"debug:" { "show": [...] },

I like the nested debug -> include -> list. I think that's the most clear

@bmizerany
Copy link
Contributor

I like the idea of having a meta field as @BruceMacD mentioned and I suppose it's also true that it might be used alongside a generation. This in conjunction with the split fields would be really useful for testing and debugging

meta doesn't tell the user much IMO. Am I sending meta? Why would I send some meta like "prompt": true? It feels like it will require having to look it up in docs, often.

@ParthSareen ParthSareen force-pushed the parth/templating branch 2 times, most recently from 7ca7798 to cce57e4 Compare December 20, 2024 00:38
@ParthSareen
Copy link
Contributor Author

Response with debug information in responses: The first with the prompt, second with generation, and the third with final counts:

After playing around with this a bit - I think the splitting of responses like this is okay if streaming is set to true.
Otherwise we'll have to change the logic of how we're doing streaming/not streaming responses but also breaks the user expectation of what they're getting back.

@ParthSareen ParthSareen changed the title server: add option to dry run a prompt server: add options to dry run and debug Dec 20, 2024
@ParthSareen ParthSareen changed the title server: add options to dry run and debug server: add options to dry run and debug for chat and generate Dec 20, 2024
@@ -103,10 +103,18 @@ type ChatRequest struct {
// Tools is an optional list of tools the model has access to.
Tools `json:"tools,omitempty"`

Debug *Debug `json:"debug,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

@@ -190,6 +198,8 @@ type ChatResponse struct {
Message Message `json:"message"`
DoneReason string `json:"done_reason,omitempty"`

Debug map[string]any `json:"debug,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name debug isn't a good choice here since we aren't really de-bugging something as much as looking to have the final prompt echo'd back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmorganca There is some explanation and alternative proposals in the proposal here .

Copy link
Contributor

@bmizerany bmizerany Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthSareen Please be more judicious with your uses of map[string]any. That is to say, it should be so rare you forget it is an option most of the time.

To implement the proposal, which states this is an object, always, use a struct with fields of specific types, not any.

This also allows for docs, which we need much more of in ollama.

    //...

    // Debug, if set, adds additional context to generate responses as
    // specified by its fields.
    Debug: *DebugOptions,
}

// DebugOptions defines options available for generate requests.
type DebugOptions struct {
    // Mode specifies which mode of debugging is requested of the server. The
    // available modes are "default", and "prompt. If empty, "default" is used. [... more words about modes]
    Mode string `json:"mode"`
}

@jmorganca
Copy link
Member

Did we consider using a single field in the API? (vs needing to set two?)

@bmizerany
Copy link
Contributor

bmizerany commented Dec 20, 2024

Consider using a single field in the API instead of two?

The design has two distinct concerns:

  1. Retrieving the final prompt used for generation, with or without generation
  2. Retrieving the tokenized prompt, with or without generation

This creates a 2x2 matrix. A single field could have values like "prompt-with-response", "prompt-no-response", "response", "no-response". These names are illustrative, not proposed. They demonstrate the complexity of this approach.

Two separate fields yield the same configurations more clearly and with better documentation.

The question is: Must we add two fields? Can we use fewer without complex schemes? Yes.

In our discussion, @jmorganca proposed num_predict. This elegantly removes one control.

Example: Getting token count without completion:

# Request
curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "num_predict": 0,       # 0 means "no generation"
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?",
        }
    ]
}'

# Response
{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "done": true,
  "total_duration": 342546000,
  "load_duration": 0,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 342546000, # equals total_duration; only tokenization done
  "eval_count": 0, # no generation performed
  "eval_duration": 0
}

Using num_predict lets clients explicitly request zero generation overhead.

For returning the generated prompt, we need one new field. While debug saw little use, "response": { "include": [...] } remains viable. See earlier comments. Naming is still being flushed out on this.

num_predict and SQL query planning

The approach parallels SQL's limit 0. Examples in PostgreSQL:

No limit:

EXPLAIN ANALYZE SELECT generate_series(0, 100);
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 ProjectSet  (cost=0.00..0.52 rows=101 width=4) (actual time=0.008..0.017 rows=101 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
 Planning Time: 0.072 ms
 Execution Time: 0.066 ms

Limit 1:

EXPLAIN ANALYZE SELECT generate_series(0, 100) LIMIT 1;
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.01 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)
   ->  ProjectSet  (cost=0.00..0.52 rows=101 width=4) (actual time=0.006..0.007 rows=1 loops=1)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Planning Time: 0.056 ms
 Execution Time: 0.042 ms

With limit 0:

EXPLAIN ANALYZE SELECT generate_series(0, 100) LIMIT 0;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Limit  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 rows=0 loops=1)
   ->  ProjectSet  (cost=0.00..0.52 rows=101 width=4) (never executed)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (never executed)

"Never executed" matches expected behavior with num_predict: 0. It parses and plans but never executes.

Model loading and num_predict

Setting num_predict: 0 should not affect model loading or eviction. An unloaded model stays unloaded. The eviction timer continues; only non-zero num_predict requests prevent eviction.

To warm up a model, request one token (use num_predict: 1). This is analogous to SQL's SELECT 1 ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants