Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate limit download speed on pulling new models #2006

Open
donuts-are-good opened this issue Jan 15, 2024 · 58 comments
Open

Rate limit download speed on pulling new models #2006

donuts-are-good opened this issue Jan 15, 2024 · 58 comments
Assignees
Labels
networking Issues relating to ollama pull and push

Comments

@donuts-are-good
Copy link

Is there interest in implementing a rate limiter in the pull command? I'm open to working on this, this is the syntax I have in mind for now:

ollama pull modelname --someflagname 1024 <-- this would limit to 1024 kbps

I took a look at the code in server/download.go, and I think I can do this with the x/time/rate applied to the downloadChunk method of the blob downloader.

This feature, or something like this that accomplishes this would be quite useful for me. Ollama is able to saturate my network faster than bit torrent or anything else I've tried.

@tkafka
Copy link

tkafka commented Jan 23, 2024

Yes, definitely! Same here - when I download models, everyone in the office gets a really slow internet.

How about rate-limit?

@jukofyork
Copy link

Yeah, same here!

I'm finding ollama pull is really killing my connection and i have to limit myself to just using it at night now...

I assume it's using multiple threads to download multiple chunks at the same time or something, as it seems a lot more lag-inducing than either wget or curl? If so, then it might be good to have control over these parameter(s) too.

@escaroda
Copy link

I would do the same as wget:

‘--limit-rate=amount’

Limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the ‘k’ suffix, or megabytes with the ‘m’ suffix. For example, ‘--limit-rate=20k’ will limit the retrieval rate to 20KB/s. This is useful when, for whatever reason, you don’t want Wget to consume the entire available bandwidth.

This option allows the use of decimal numbers, usually in conjunction with power suffixes; for example, ‘--limit-rate=2.5k’ is a legal value.

Note that Wget implements the limiting by sleeping the appropriate amount of time after a network read that took less time than specified by the rate. Eventually this strategy causes the TCP transfer to slow down to approximately the specified rate. However, it may take some time for this balance to be achieved, so don’t be surprised if limiting the rate doesn’t work well with very small files.

@easp
Copy link
Contributor

easp commented Feb 1, 2024

I think this is coming. I saw either a branch or a pull request to provide rate limiting by one of the maintainers.

@akulbe
Copy link

akulbe commented Feb 16, 2024

I would LOVE to see this implemented. It reliably and repeatedly kills my connection, on anything larger than a 13b model. I think it's the sustained speed (I have a 1G/1G connection, and downloads get up to 115M/s) when it happens.

@BruceMacD
Copy link
Contributor

Behavior here will be improved by #2221, working on getting that unblocked now

@donuts-are-good
Copy link
Author

We want to define an arbitrary download speed limit. It'd be great if #2221 could address that somehow.

@pablo-01
Copy link

Had this issue every day since starting using Ollama few days ago. In my case - Pop!_OS 22.04 LTS - freezes randomly and eventually freezes completely leaving no choice but to hard reset.

@simmonsm
Copy link

I agree. This is definitely a good idea as, for instance, pulling down the 7b 39Gb model without rate limiting is very antisocial network behaviour. I did install and play with the trickle command but could't figure out how to use it with the run olllama command as it isn't that process that needs limiting.

@fermuch
Copy link

fermuch commented Mar 20, 2024

@simmonsm I think trickle wouldn't work anyways since go doesn't use libc (and trickle uses LD_PRELOAD for its magic).

@simmonsm
Copy link

@simmonsm I think trickle wouldn't work anyways since go doesn't use libc (and trickle uses LD_PRELOAD for its magic).

Fair enough. In the meantime I'm using a VM connected via a virtual traffic shaping network switch.

@LagSlug
Copy link

LagSlug commented Apr 14, 2024

You might be able to accomplish this with a docker container

https://stackoverflow.com/questions/25497523/how-can-i-rate-limit-network-traffic-on-a-docker-container

@supercurio
Copy link

I'm downloading a bunch of Llama 3 models at the moment and last night my upstairs neighbor, which whom I'm sharing a 300/100 fiber connection asked for help because he couldn't use the internet anymore. Indeed, I ran a speedtest on another machine connected over Ethernet and the bandwidth left was 1.6Mbit/s download, with whopping ~1000ms of ping latency.

My Ollama instance is running on MacOS as a native app.
For now I found a workaround for my neighbor, using XCode's Network Link Conditioner, but I'm still essentially unable to browse the web on my primary machine when pulling models.

I appreciate that Ollama maximizes the bandwidth to download large models as quickly as possible, but the default behavior does not run with sane parameters at all.

My suggestion as a simple solution that can be implemented quickly:

  • run by default with conservative settings to be a good network citizen. 2, maybe 3. Not more than that.
  • mention a more aggressive preset as command-line argument when downloading via ollama pull or ollama run
  • expose a custom "max concurrent download connections" parameter on command line and API.

Then later, sure - an adaptive algorithm can try to optimize the concurrent connection count based on latency and throughput. But it might never work that great on shared and mobile connections, where the available bandwidth and latency vary based on external parameters.
This github issue suggests a rate limit which would be helpful as well, but selecting an appropriate amount of concurrent connections should do the trick just fine without resorting to manual tuning.
If Ollama is competing with something else trying to use bandwidth, like a neighbor trying to watch Netflix, it should not try to hack TCP's congestion control standards to get all the bandwidth and respect them instead.

I hope this can be addressed shortly.

@mcraveiro
Copy link

Been hit by this issue when downloading large models, literally hogs the entire device. Would be nice to be able to limit the rate in someway a-la bittorrent clients.

@strangehelix
Copy link

Same issue. Downloader uses up the entire bandwidth (the internet becomes unusable) and eventually crashes because of the timeouts. Rate-limiter seems like a critically important feature.

@FeyNyXx
Copy link

FeyNyXx commented May 28, 2024

Same issue. I'm killing my (not exactly mine, but I have to deal with it...) router using ollama pull :<

@metamec
Copy link

metamec commented May 30, 2024

Love Ollama but this is murdering the end user experience for me. I'm having to ctrl+c just to post this comment. I have a 2Gb/s connection too. It's not a limited bandwidth issue. It's simply downloading too many chunks simultaneously, deprioritising internet bandwidth to every other process on the system. (Just realised it's network-wide, not system-wide. When downloading large models, it feels like my home network is being DDoSed.)

@mcraveiro
Copy link

Yes, same here, I can only download models at night. Machine is unusable.

@MihailCosmin
Copy link

Had same problem, at work, not only my computer was getting slow but also all the internet for all other colleagues.
I had to find a solution to rate limit download speed, older solutions (wondershaper, trickle or tc) did not work for me. The only one that worked was FireQOS, in case anyone else needs it.

@LutzFassl
Copy link

+1

@robins
Copy link

robins commented Jul 6, 2024

My linux box (i5) got reliably stuck every single time I pulled a model... so +1 for the --rate-limit feature.

Two solutions, that did help me limp on for now:

  1. As soon as I started the fetch, I used iotop to change the ionice priority (using i) to idle. That made the issue completely go away in that, although the downloads were still fast, the linux system was quite usable. However, this was still frustrating since one had to type the PIDs when trying to set ionice for them (and there were a few)!

  2. Now since Ollama spun up multiple downloads, the ionice tool didn't work for me - IIUC that's because ionice needs to be run for each process. So it ended up being far simpler to just get the parent PID, and then set ionice for each of the child processes, each time I was downloading a model.

pid=`ps -ef | grep "ollama run" | awk '{print $2}'`
sudo ionice -c3 -p `ps -T -p $pid | awk '{print $2}' | grep -v SPID | tr '\r\n' ' '`

@treibholz
Copy link

I use this horrible "workaround", to not consume the whole internet bandwidth, so I can still work on my other machine, while pulling a model:

$ sudo ethtool -s eth0  autoneg on speed 10 duplex full

This negotiates the linkspeed of my network-interface to 10Mbit.

Yes, the pulling machine itself is not useable as well and sometimes the download is interrupted, because (and I'm not kidding you!) there is not enough bandwidth left for DNS. But at least others on my local network are not angry anymore.

@supercurio
Copy link

I'm preparing a patch and will submit a PR to address this soon.

@Netzvamp
Copy link

My solution for now, works fine. This docker-tc can also simulate package loss 😂

version: '3'
services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - 11434:11434
    restart: unless-stopped
    labels:
      - "com.docker-tc.enabled=1"
      - "com.docker-tc.limit=30mbit"

  docker-tc:
    image: lukaszlach/docker-tc
    cap_add:
      - NET_ADMIN
    network_mode: host
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/docker-tc:/var/docker-tc

supercurio added a commit to supercurio/ollama that referenced this issue Jul 13, 2024
…connections

The Ollama server now downloads models using a single connection. This change
addresses the root cause of issue ollama#2006 by following best practices instead of
relying on workarounds. Users have been reporting problems associated with
model downloads since January 2024, describing issues such as "hogging the
entire device", "reliably and repeatedly kills my connection", "freezes
completely leaving no choice but to hard reset", "when I download models,
everyone in the office gets a really slow internet", and "when downloading
large models, it feels like my home network is being DDoSed."

The environment variable `OLLAMA_DOWNLOAD_CONN` can be set to control the
number of concurrent connections with a maximum value of 64 (the previous
default, an aggressive value - unsafe in some conditions). The new default
value is 1, ensuring each Ollama download is given the same priority as other
network activities.

An entry in the FAQ describes how to use `OLLAMA_DOWNLOAD_CONN` for different
use cases. This patch comes with a safe and unproblematic default value.

Changes include updates to the `envconfig/config.go`, `cmd/cmd.go`,
`server/download.go`, and `docs/faq.md` files.
supercurio added a commit to supercurio/ollama that referenced this issue Jul 13, 2024
…ONN setting

The Ollama server now downloads models using a single connection. This change
addresses the root cause of issue ollama#2006 by following best practices instead of
relying on workarounds. Users have been reporting problems associated with
model downloads since January 2024, describing issues such as "hogging the
entire device", "reliably and repeatedly kills my connection", "freezes
completely leaving no choice but to hard reset", "when I download models,
everyone in the office gets a really slow internet", and "when downloading
large models, it feels like my home network is being DDoSed."

The environment variable `OLLAMA_DOWNLOAD_CONN` can be set to control the
number of concurrent connections with a maximum value of 64 (the previous
default, an aggressive value - unsafe in some conditions). The new default
value is 1, ensuring each Ollama download is given the same priority as other
network activities.

An entry in the FAQ describes how to use `OLLAMA_DOWNLOAD_CONN` for different
use cases. This patch comes with a safe and unproblematic default value.

Changes include updates to the `envconfig/config.go`, `cmd/cmd.go`,
`server/download.go`, and `docs/faq.md` files.
@igorschlum
Copy link

@joelanman I agree with you. From the same location, I was able to download a 3K model but not the 405b model. For the 3K model, the 3GB were downloaded without any connection issues, whereas with the 405b model, the download kept stopping after about every 200MB.

@ShayBox
Copy link

ShayBox commented Aug 17, 2024

This reliably crashes my router and causes it to restart, it's too fast.

@numbermaniac
Copy link

I'm having to ctrl+c just to post this comment.

Same here. I literally can't even search Google while it's downloading something. I can download files that are multiple gigabytes in my web browser or in macOS Homebrew and still use the internet just fine, but when Ollama is downloading a model, my entire internet becomes unusable.

It doesn't even work well for Ollama either because the download speed starts at around 6MB/s and then just keeps gradually reducing, going down to like 1MB/s, then 500 KB/s, then 350 KB/s, gradually getting slower and slower, so Ollama is practically sabotaging itself with the way it downloads files.

@supercurio
Copy link

I wonder how to get some progress going on this issue: the description of how badly it affects the user experience should bump its priority to critical in my opinion, and I already submitted a PR with a fix a month ago.

After testing Ollama on a high bandwidth VPS, I appreciated that its aggressive strategy lead to 300MB/s (mega bytes) downloads, so I get the point on keeping the many concurrent connections capability available. Which is the case with my patch.

How to more forward from here?
Routers crashing, people unable to use their computers: that's not the expected result when using any kind of software.

@igorschlum
Copy link

Hi @supercurio (bonjour François), your patch is here:
#5683
I know that there were issues that were more important during those weeks, like CUDA fixes, Memory and function calling.
Ollama can run easily without this patch, but if it's just a matter of approving your patch, @jmorganca could decide to do it.

@supercurio
Copy link

supercurio commented Aug 19, 2024

Salut @igorschlum 😌
All of Ollama's the core functionalities are important, that's for sure.
Downloading model(s) is still the first action every Ollama user will take.

I solved for my individual use case already but I'm hoping to use Ollama as LLM runtime for the app I'm developing at the moment. It's a non-starter until this issue is solved sadly.
Fortunately, llamafile provides a good enough alternative in that case.

@joelanman
Copy link

Is it fixed by this?

@Fluffkin
Copy link

Fluffkin commented Aug 20, 2024

It'll help in many cases, but it's probably still too high for some domestic broadband users with low bandwidth. ¯\(ツ)

@igorschlum
Copy link

@Fluffkin I think it would work for any type of connection, because actually Ollama is downloading 64 files simultaneously and letting very little bandwidth for other computers. With this modification, it will use only one download at a time.

@joelanman
Copy link

@igorschlum no it's changed from 64 to 16, so still a lot of connections

@robins
Copy link

robins commented Aug 20, 2024

My 2c. Unless there a way for customers to "request" for more speed, the default shouldn't be hurting low-end users.

I'd take Git's example here. Albeit downloads are not split (like this tool), but when churning through the repo - even on a 20 core machine - it doesn't spin up an obscene number of processes - only 4 processes.

So yes, we shouldn't cut down from 64 to 1, just to accommodate everyone - but settling for a saner default of 4 threads sounds like a good middle-ground here.... If / when there's a feature to request for more threads, customers with more resources can always go back to 64 concurrent threads!

@mrtysn
Copy link

mrtysn commented Sep 2, 2024

Reporting from Türkiye, I am unable to run ollama pull during the day due to it causing nearly all other connections on my shared Wi-Fi network to almost come to a halt. A download speed rate limit would be greatly appreciated.

@igorschlum
Copy link

@mrtysn what version of Ollama are you using? I agree with @supercurio that a parameter could be added to set the number of concurrent downloads.
What version of Ollama do you use? On which OS?

@mrtysn
Copy link

mrtysn commented Sep 2, 2024

@mrtysn what version of Ollama are you using? I agree with @supercurio that a parameter could be added to set the number of concurrent downloads. What version of Ollama do you use? On which OS?

@igorschlum Apologies for the lack of details.

  • I am on a MacBook Pro M2 Max with Sonoma 14.6.1.
  • My ollama is installed from homebrew, and it is currently on version 0.3.9.

However, I believe I did ~90% of my model pulls while on version 0.3.8, the homebrew formula was updated to 0.3.9 ~2 days ago.

image

@igorschlum
Copy link

@mrtysn I installed Go on my Mac and was able to build Ollama from the source. If you'd like, I can create a tutorial on how to do this from scratch on a Mac. Then, you could change the 4 simultaneous downloads to 1 to see if this fixes the issue you're facing.

@joelanman
Copy link

@igorschlum it's 16, which is still very high

@igorschlum
Copy link

igorschlum commented Sep 2, 2024

@joelanman Sorry, I anticipated that it could be 4 :-)

@mrtysn
Copy link

mrtysn commented Sep 3, 2024

@mrtysn I installed Go on my Mac and was able to build Ollama from the source. If you'd like, I can create a tutorial on how to do this from scratch on a Mac. Then, you could change the 4 simultaneous downloads to 1 to see if this fixes the issue you're facing.

I've installed the pre-requisites. However, unfortunately, I will not be able to attend more to this issue with my current workload. Scheduling model downloads for the night has been an okay workaround so far, and I currently have most of the models that I would like to utilize.

Nevertheless, I still think a CLI flag with reasonable defaults to control the number of concurrent download threads would be helpful for all ollama users, especially new users who would be downloading models for the first time or users from regions with slower internet speeds.

@donuts-are-good
Copy link
Author

Hello all,

I appreciate everyone's input in this thread, and I do hope this eventually gets solved. When I initially created this issue as a place to discuss and outline a solution to the network saturation issue, I assumed it would be welcome with open arms. In the time since, the software has got bigger and more complex and it's no longer a simple solution, and doesn't appear to be a priority.

With that in mind, life must go on. I'm withdrawing the offer to implement this as I no longer have the resources to write and test a feature that isn't a concern of the developers. I don't think anybody's going to worry about it, but I wanted to be clear in my intentions. I look forward to seeing this issue fixed some day.

@ShayBox
Copy link

ShayBox commented Sep 5, 2024

The issue was fixed 2 weeks ago in 0.3.7...

@mdlmarkham
Copy link

I'm still having the issue.

@devrandom
Copy link

it would be best if the number of threads was configurable. most of these issues can be mitigated by setting the number of concurrent downloads to 1.

@augusto-rehfeldt
Copy link

augusto-rehfeldt commented Dec 18, 2024

I'm still having this issue. I'm from Argentina and 16 concurrent connections kills my Internet connection, hangs my machine, and downloads over 1GB never finish.

Any idea about how to rate limit this on Windows?

@jimbothegrey
Copy link

used wondershare to slow down the connection. looks like it is working....

@mrtysn
Copy link

mrtysn commented Jan 10, 2025

used wondershare to slow down the connection. looks like it is working....

@jimbothegrey what might wondershare be? all I'm finding is a file converter

@TiddlyWiddly
Copy link

Still experiencing this on pretty quick connection, knocks all my devices off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
networking Issues relating to ollama pull and push
Projects
None yet
Development

No branches or pull requests