Skip to content

Commit

Permalink
Update perf_infer_gpu_one.md: fix a typo (huggingface#35441)
Browse files Browse the repository at this point in the history
  • Loading branch information
martin0258 authored and AlanPonnachan committed Jan 1, 2025
1 parent ef34214 commit 9d359e8
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
@@ -462,7 +462,7 @@ generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
```

To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU:
To load a model in 8-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU:

```py
max_memory_mapping = {0: "1GB", 1: "2GB"}

0 comments on commit 9d359e8

Please sign in to comment.