server: add options to dry run and debug for chat and generate #8165

ParthSareen · 2024-12-18T23:20:47Z

Doesn't actually load the model
No tokenization or context length clipping
Barebones implementation of the chatPrompt function

Precursor to enabling tokenization endpoints: #8106

api/types.go

server/routes.go

api/types.go

api/client.go

server/routes.go

ParthSareen · 2024-12-19T22:20:01Z

server/prompt.go

+	// Warn user if messages are truncated from the input
+	if numTruncatedMessages := len(msgs[0:currMsgIdx]); numTruncatedMessages > 0 {
+		slog.Warn("truncated first messages from input", "num_truncated", numTruncatedMessages)
+	}


OpenAI returns an error on exceeding context length and tells the user the max context length.

We are currently doing a slog.Warn in runner.go for when a single message's content is truncated.

This block adds a warn on dropping whole messages as well.

I don't think it makes sense to return an error as that might be a really breaking experience but this information should definitely be surfaced up at minimum through the warn

ParthSareen · 2024-12-19T22:34:23Z

server/routes.go

@@ -1539,6 +1572,18 @@ func (s *Server) ChatHandler(c *gin.Context) {
 		return
 	}

+	if req.DryRun {


First pass at this and have some thoughts:

I'm wondering if we could wrap this under the existing "options" parameter although that is meant for the model options and I'm not a fan of having the two conflated - it's something I'll try it out

With this method we have to load the full model into the vRAM and use the scheduler due to needing the tokenizer as well as the truncated content of the messages.

There is a world where we can side load the model without the vRAM (like in the tokenize draft PR).

We'd still have to refactor doing the truncation based on context length not in the runner and I think that's where we start bleeding scope.

My take is:

Cleanup this PR - keep scope small and stick to this pattern for just using the scheduler for now

Figure out more long term model loading + swapping for quick interactions vs loading into vRAM and have a shared interface for those common patterns

(also need to address streaming)

api/types.go

bmizerany · 2024-12-19T22:56:28Z

Rather than shoot from the hip here and design-by-gut, I thought it might be helpful to draft some thoughts on a proposal, so we can ensure we're not hurting ourselves and users later. Here it is:

Proposal: Debug and Dry Run Modes for Ollama Prompt Generation

We propose adding debug and dry run capabilities to Ollama's prompt generation system. These features would help users understand, test, and verify how their inputs are transformed into final prompts for LLM inference.

Background

Ollama uses prompt templates to convert user messages into final prompts for LLM inference. Currently, users cannot inspect this transformation process or verify token counts without running a full generation.

Proposal

This proposal introduces two new optional request parameters:

A debug mode for exposing prompt generation details
A dry run mode for previewing prompts and/or token counts without generation

This is intentionally designed as an opt-in feature to maintain compatibility with existing clients while providing valuable debugging capabilities when needed.

Separating the two concerns allows users to perform generation in these ways:

normal chat response (with token count)
normal chat response with final prompt generated for generation (with token count)
no chat response (with token count)
no chat response _with _final prompt (with token count)

The combinations above have many powerful use cases.

Rationale

The ability to inspect prompt generation serves several key needs:

Debug support for prompt engineering
Regression testing across runtime versions
Token count estimation without generation
Verification of template behavior

Compatibility

This proposal maintains compatibility by:

Making all new fields optional
Preserving existing behavior when fields are omitted
Minimizing performance impact on non-debug flows

Examples

Debug Prompt

Standard chat request with prompt debugging:

curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "debug": { "mode": "prompt" },
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?"
        }
    ]
}'

Response with debug information in responses: The first with the prompt, second with generation, and the third with final counts:

{
  "model": "llama3.2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "debug": {
    "prompt": "<|start_header_id|>user<|end_header_id|>Given the following functions, please respond with a JSON, ..."
  },
  "done": false
}
{
  "model": "llama3.2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "message": {
    "role": "assistant",
    "content": "The sky is blue because",
    "images": null
  },
  "done": false
}
{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "done": true,
  "total_duration": 4883583458,
  "load_duration": 1334875,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 342546000,
  "eval_count": 282,
  "eval_duration": 4535599000
}

Dry Run Mode

Request with dry run enabled:

curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "dry": true,
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?",
        }
    ]
}'

Response from dry run is a single, final response:

{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "done": true,
  "total_duration": 4883583458,
  "load_duration": 1334875,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 342546000,
  "eval_count": 282,
  "eval_duration": 4535599000
}

Debug Modes

The debug field accepts the following modes:

"default" (or omitted): Standard behavior with no debug output
"prompt": Includes the complete prompt template expansion in the response. In stream mode it is only included in the first json object of the stream, without messages.

Token Counts

Users will not need to change how or where they get token counts either of the
the above modes. The token counts will be included in the response as they are
now. The only difference is that the token counts will be included in the
response even if the prompt is not generated.

Additional considerations:

It may be useful to echo back the dry and debug parameters in the response so clients
can verify that the server received the request as intended. It would also be
helpful for SDKs to have a way to perform specific actions based on the debug
and dry parameters. For example, a Python SDK could automatically print the
prompt template when debug mode is enabled.

bmizerany · 2024-12-19T23:02:11Z

NOTE: While it was considered that "debug": { "mode": "prompt" }, could just be "debug": true,, making it an object now keeps it open for future additions we may want to include, and in a way to prevents us from doing silly things in JSON like say that a field can be a boolean or a string or an object, making writing clients a pain.

Providing a "mode" also makes it explicit what the user is debugging.

bmizerany · 2024-12-19T23:07:19Z

NOTE: While it was considered that "debug": { "mode": "prompt" }, could just be "debug": true,, making it an object now keeps it open for future additions we may want to include, and in a way to prevents us from doing silly things in JSON like say that a field can be a boolean or a string or an object, making writing clients a pain.

Providing a "mode" also makes it explicit what the user is debugging.

I'm now also considering:

"debug": { "prompt": true }

and later we could add { "prompt": true, "shields": true, ...

BruceMacD · 2024-12-19T23:13:01Z

The design doc is nice, my feedback.

I'm not sure debug is the right name for the field. The returned values may be used in normal operations, like a client with long chat that wants to manually manage the context length. I'd suggest meta:

curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "meta": { "prompt": true },
    "dry": true,
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?"
        }
    ]
}'

I also like the boolean fields that Blake suggested in a follow-up to allow for getting multiple debug/meta fields back.

Gotta balance not making things too complicated here for the actual use-cases, but this is feeling like a good direction.

ParthSareen · 2024-12-19T23:21:01Z

I like the idea of having a meta field as @BruceMacD mentioned and I suppose it's also true that it might be used alongside a generation. This in conjunction with the split fields would be really useful for testing and debugging

bmizerany · 2024-12-19T23:21:58Z

I'm not sure debug is the right name for the field. The returned values may be used in normal operations, like a client with long chat that wants to manually manage the context length. I'd suggest meta:

I went back-and-fourth on "include": { ... } too. I'm not sure that is descriptive enough though. Are was saying "include a prompt in generation" or "include a prompt in response"...

Maybe we consider:

...
"response": {
    "include": ["prompt", ...]
}

which would eliminate the proposal for {"prompt": true}, which I like less after this came to mind.

ParthSareen · 2024-12-19T23:23:49Z

I went back-and-fourth on "include": { ... } too. I'm not sure that is descriptive enough though. Are was saying "include a prompt in generation" or "include a prompt in response"...

Maybe we consider:
...
"response": {
    "include": ["prompt", ...]
}
which would eliminate the proposal for {"prompt": true}, which I like less after this came to mind.

I think in this pattern something like debug:["prompt"] is more descriptive as it's all the fields the user could want round-tripped with everything we add to it

The reason I decided to make it an object was clear I thought. We're cornering ourselves by using more restrictive types or structures.

Maybe:

"debug": {
    "include": ["prompt", "shields", ...]
}

or

"debug:" { "show": [...] },

I like the nested debug -> include -> list. I think that's the most clear

bmizerany · 2024-12-19T23:24:19Z

I like the idea of having a meta field as @BruceMacD mentioned and I suppose it's also true that it might be used alongside a generation. This in conjunction with the split fields would be really useful for testing and debugging

meta doesn't tell the user much IMO. Am I sending meta? Why would I send some meta like "prompt": true? It feels like it will require having to look it up in docs, often.

ParthSareen · 2024-12-20T00:45:09Z

Response with debug information in responses: The first with the prompt, second with generation, and the third with final counts:

After playing around with this a bit - I think the splitting of responses like this is okay if streaming is set to true.
Otherwise we'll have to change the logic of how we're doing streaming/not streaming responses but also breaks the user expectation of what they're getting back.

jmorganca · 2024-12-20T01:00:42Z

api/types.go

@@ -103,10 +103,18 @@ type ChatRequest struct {
 	// Tools is an optional list of tools the model has access to.
 	Tools `json:"tools,omitempty"`

+	Debug *Debug `json:"debug,omitempty"`


What does this do?

jmorganca · 2024-12-20T01:04:21Z

api/types.go

@@ -190,6 +198,8 @@ type ChatResponse struct {
 	Message    Message   `json:"message"`
 	DoneReason string    `json:"done_reason,omitempty"`

+	Debug map[string]any `json:"debug,omitempty"`


The name debug isn't a good choice here since we aren't really de-bugging something as much as looking to have the final prompt echo'd back

@jmorganca There is some explanation and alternative proposals in the proposal here .

@ParthSareen Please be more judicious with your uses of map[string]any. That is to say, it should be so rare you forget it is an option most of the time.

To implement the proposal, which states this is an object, always, use a struct with fields of specific types, not any.

This also allows for docs, which we need much more of in ollama.

//... // Debug, if set, adds additional context to generate responses as // specified by its fields. Debug: *DebugOptions, } // DebugOptions defines options available for generate requests. type DebugOptions struct { // Mode specifies which mode of debugging is requested of the server. The // available modes are "default", and "prompt. If empty, "default" is used. [... more words about modes] Mode string `json:"mode"` }

jmorganca · 2024-12-20T01:53:11Z

Did we consider using a single field in the API? (vs needing to set two?)

bmizerany · 2024-12-20T05:15:22Z

Consider using a single field in the API instead of two?

The design has two distinct concerns:

Retrieving the final prompt used for generation, with or without generation
Retrieving the tokenized prompt, with or without generation

This creates a 2x2 matrix. A single field could have values like "prompt-with-response", "prompt-no-response", "response", "no-response". These names are illustrative, not proposed. They demonstrate the complexity of this approach.

Two separate fields yield the same configurations more clearly and with better documentation.

The question is: Must we add two fields? Can we use fewer without complex schemes? Yes.

In our discussion, @jmorganca proposed num_predict. This elegantly removes one control.

Example: Getting token count without completion:

# Request
curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "num_predict": 0,       # 0 means "no generation"
    "messages": [
        {
            "role": "user",
            "content": "why is the sky blue?",
        }
    ]
}'

# Response
{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "done": true,
  "total_duration": 342546000,
  "load_duration": 0,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 342546000, # equals total_duration; only tokenization done
  "eval_count": 0, # no generation performed
  "eval_duration": 0
}

Using num_predict lets clients explicitly request zero generation overhead.

For returning the generated prompt, we need one new field. While debug saw little use, "response": { "include": [...] } remains viable. See earlier comments. Naming is still being flushed out on this.

`num_predict` and SQL query planning

The approach parallels SQL's limit 0. Examples in PostgreSQL:

No limit:

EXPLAIN ANALYZE SELECT generate_series(0, 100);
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 ProjectSet  (cost=0.00..0.52 rows=101 width=4) (actual time=0.008..0.017 rows=101 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
 Planning Time: 0.072 ms
 Execution Time: 0.066 ms

Limit 1:

EXPLAIN ANALYZE SELECT generate_series(0, 100) LIMIT 1;
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.01 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)
   ->  ProjectSet  (cost=0.00..0.52 rows=101 width=4) (actual time=0.006..0.007 rows=1 loops=1)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Planning Time: 0.056 ms
 Execution Time: 0.042 ms

With limit 0:

EXPLAIN ANALYZE SELECT generate_series(0, 100) LIMIT 0;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Limit  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 rows=0 loops=1)
   ->  ProjectSet  (cost=0.00..0.52 rows=101 width=4) (never executed)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (never executed)

"Never executed" matches expected behavior with num_predict: 0. It parses and plans but never executes.

Model loading and num_predict

Setting num_predict: 0 should not affect model loading or eviction. An unloaded model stays unloaded. The eviction timer continues; only non-zero num_predict requests prevent eviction.

To warm up a model, request one token (use num_predict: 1). This is analogous to SQL's SELECT 1 ping.

ParthSareen requested review from jmorganca and BruceMacD December 18, 2024 23:20

Add /template endpoint

1d529d8

ParthSareen force-pushed the parth/templating branch from 151663e to 1d529d8 Compare December 18, 2024 23:23

ParthSareen self-assigned this Dec 18, 2024

ParthSareen commented Dec 19, 2024

View reviewed changes

api/types.go Outdated Show resolved Hide resolved