Open
Description
Describe the bug
When I try to serve a llama 3.1 8B-4bit with openllm, it says that "This model's maximum context length is 2048 tokens".
On https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, it says that the maximum context length is 128k tokens.
Why this difference ?
To reproduce
openllm serve llama3.1:8b-4bit
in a python console with openai client installed :
from openai import OpenAI
openai_client = OpenAI(api_key="test", base_url="http://localhost:3000/v1")
openai_client.chat.completions.create(
model='hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4',
messages=[{"role":"user", "content": "This is a test"}],
presence_penalty=0.,
frequency_penalty=0.,
stream=False,
temperature=0.,
max_tokens=2048
)
Logs
On client side :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_utils\_utils.py", line 277, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\resources\chat\completions.py", line 590, in create
return self._post(
^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1240, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 921, in request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1020, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 2048 tokens. However, you requested 2087 tokens (39 in the messages, 2048 in the completion). Please reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}
Environment
System information
bentoml
: 1.3.5
python
: 3.11.8
platform
: Linux-6.2.0-39-generic-x86_64-with-glibc2.37
uid_gid
: 1000:1000
conda
: 24.3.0
in_conda_env
: True
conda_packages
name: pytorch
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- aom=3.9.1=hac33072_0
- bzip2=1.0.8=h5eee18b_5
- ca-certificates=2024.8.30=hbcca054_0
- cairo=1.18.0=hebfffa5_3
- dav1d=1.2.1=hd590300_0
- expat=2.6.3=h5888daf_0
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
- font-ttf-ubuntu=0.83=h77eed37_2
- fontconfig=2.14.2=h14ed4e7_0
- fonts-conda-ecosystem=1=0
- fonts-conda-forge=1=0
- freetype=2.12.1=h267a509_2
- fribidi=1.0.10=h36c2ea0_0
- gettext=0.22.5=he02047a_3
- gettext-tools=0.22.5=he02047a_3
- gmp=6.3.0=hac33072_2
- gnutls=3.8.7=h32866dd_0
- graphite2=1.3.13=h59595ed_1003
- harfbuzz=9.0.0=hda332d3_1
- icu=75.1=he02047a_0
- lame=3.100=h166bdaf_1003
- ld_impl_linux-64=2.38=h1181459_1
- libabseil=20240116.2=cxx17_he02047a_1
- libasprintf=0.22.5=he8f35ee_3
- libasprintf-devel=0.22.5=he8f35ee_3
- libass=0.17.3=h1dc1e6a_0
- libdrm=2.4.123=hb9d3cd8_0
- libexpat=2.6.3=h5888daf_0
- libffi=3.4.4=h6a678d5_0
- libgcc=14.1.0=h77fa898_1
- libgcc-ng=14.1.0=h69a702a_1
- libgettextpo=0.22.5=he02047a_3
- libgettextpo-devel=0.22.5=he02047a_3
- libglib=2.80.3=h315aac3_2
- libgomp=14.1.0=h77fa898_1
- libhwloc=2.11.1=default_hecaa2ac_1000
- libiconv=1.17=hd590300_2
- libidn2=2.3.7=hd590300_0
- libnsl=2.0.1=hd590300_0
- libopenvino=2024.3.0=h2da1b83_0
- libopenvino-auto-batch-plugin=2024.3.0=hb045406_0
- libopenvino-auto-plugin=2024.3.0=hb045406_0
- libopenvino-hetero-plugin=2024.3.0=h5c03a75_0
- libopenvino-intel-cpu-plugin=2024.3.0=h2da1b83_0
- libopenvino-intel-gpu-plugin=2024.3.0=h2da1b83_0
- libopenvino-intel-npu-plugin=2024.3.0=h2da1b83_0
- libopenvino-ir-frontend=2024.3.0=h5c03a75_0
- libopenvino-onnx-frontend=2024.3.0=h07e8aee_0
- libopenvino-paddle-frontend=2024.3.0=h07e8aee_0
- libopenvino-pytorch-frontend=2024.3.0=he02047a_0
- libopenvino-tensorflow-frontend=2024.3.0=h39126c6_0
- libopenvino-tensorflow-lite-frontend=2024.3.0=he02047a_0
- libopus=1.3.1=h7f98852_1
- libpciaccess=0.18=hd590300_0
- libpng=1.6.44=hadc24fc_0
- libprotobuf=4.25.3=h08a7969_0
- libsqlite=3.45.2=h2797004_0
- libstdcxx=14.1.0=hc0a3c3a_1
- libstdcxx-ng=14.1.0=h4852527_1
- libtasn1=4.19.0=h166bdaf_0
- libunistring=0.9.10=h7f98852_0
- libuuid=2.38.1=h0b41bf4_0
- libva=2.22.0=hb711507_0
- libvpx=1.14.1=hac33072_0
- libxcb=1.16=hb9d3cd8_1
- libxcrypt=4.4.36=hd590300_1
- libxml2=2.12.7=he7c6b58_4
- libzlib=1.3.1=h4ab18f5_1
- ncurses=6.4=h6a678d5_0
- nettle=3.9.1=h7ab15ed_0
- ocl-icd=2.3.2=hd590300_1
- openh264=2.4.1=h59595ed_0
- openssl=3.3.2=hb9d3cd8_0
- p11-kit=0.24.1=hc5aa10d_0
- pcre2=10.44=hba22ea6_2
- pip=23.3.1=py311h06a4308_0
- pixman=0.43.2=h59595ed_0
- pthread-stubs=0.4=h36c2ea0_1001
- pugixml=1.14=h59595ed_0
- python=3.11.8=hab00c5b_0_cpython
- readline=8.2=h5eee18b_0
- setuptools=68.2.2=py311h06a4308_0
- snappy=1.2.1=ha2e4443_0
- sqlite=3.45.2=h2c6b66d_0
- svt-av1=2.2.1=h5888daf_0
- tbb=2021.13.0=h84d6215_0
- tk=8.6.13=noxft_h4845f30_101
- wayland=1.23.1=h3e06ad9_0
- wayland-protocols=1.37=hd8ed1ab_0
- wheel=0.41.2=py311h06a4308_0
- x264=1!164.3095=h166bdaf_2
- x265=3.5=h924138e_3
- xorg-fixesproto=5.0=h7f98852_1002
- xorg-kbproto=1.0.7=h7f98852_1002
- xorg-libice=1.1.1=hd590300_0
- xorg-libsm=1.2.4=h7391055_0
- xorg-libx11=1.8.9=hb711507_1
- xorg-libxau=1.0.11=hd590300_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xorg-libxext=1.3.4=h0b41bf4_2
- xorg-libxfixes=5.0.3=h7f98852_1004
- xorg-libxrender=0.9.11=hd590300_0
- xorg-renderproto=0.11.1=h7f98852_1002
- xorg-xextproto=7.3.0=h0b41bf4_1003
- xorg-xproto=7.0.31=h7f98852_1007
- xz=5.4.6=h5eee18b_0
- zlib=1.3.1=h4ab18f5_1
- pip:
- accelerate==0.34.1
- aiohappyeyeballs==2.4.0
- aiohttp==3.10.5
- aiosignal==1.3.1
- aiosqlite==0.20.0
- aniso8601==9.0.1
- annotated-types==0.7.0
- ansi2html==1.9.1
- anyio==4.4.0
- appdirs==1.4.4
- arrow==1.3.0
- asgiref==3.8.1
- attrs==24.2.0
- bentoml==1.3.5
- bitsandbytes==0.43.3
- blinker==1.7.0
- cattrs==23.1.2
- certifi==2024.2.2
- charset-normalizer==3.3.2
- circus==0.18.0
- click==8.1.7
- click-option-group==0.5.6
- cloudpickle==3.0.0
- ctranslate2==4.1.0
- cuda-python==12.6.0
- datasets==3.0.0
- deepmerge==2.0
- deprecated==1.2.14
- dill==0.3.8
- diskcache==5.6.3
- distro==1.9.0
- dulwich==0.22.1
- einops==0.8.0
- enum-compat==0.0.3
- fastapi==0.115.0
- fastcore==1.7.8
- ffmpeg==1.4
- filelock==3.13.4
- flask==3.0.3
- flask-restful==0.3.10
- frozenlist==1.4.1
- fs==2.4.16
- fsspec==2024.3.1
- gguf==0.9.1
- ghapi==1.0.6
- h11==0.14.0
- httpcore==1.0.5
- httptools==0.6.1
- httpx==0.27.2
- httpx-ws==0.6.0
- huggingface-hub==0.24.6
- idna==3.7
- importlib-metadata==6.11.0
- inflection==0.5.1
- inquirerpy==0.3.4
- interegular==0.3.3
- itsdangerous==2.2.0
- jinja2==3.1.2
- jiter==0.5.0
- jsonschema==4.23.0
- jsonschema-specifications==2023.12.1
- lark==1.2.2
- llvmlite==0.43.0
- lm-format-enforcer==0.10.6
- markdown-it-py==3.0.0
- markupsafe==2.1.3
- mdurl==0.1.2
- mistral-common==1.4.1
- mpmath==1.3.0
- msgpack==1.1.0
- msgspec==0.18.6
- multidict==6.1.0
- multiprocess==0.70.16
- mypy-extensions==1.0.0
- nest-asyncio==1.6.0
- networkx==3.2.1
- ninja==1.11.1.1
- numba==0.60.0
- numpy==1.26.4
- nvgpu==0.10.0
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cuda-cupti-cu12==12.1.105
- nvidia-cuda-nvrtc-cu12==12.1.105
- nvidia-cuda-runtime-cu12==12.1.105
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.0.2.54
- nvidia-curand-cu12==10.3.2.106
- nvidia-cusolver-cu12==11.4.5.107
- nvidia-cusparse-cu12==12.1.0.106
- nvidia-ml-py==11.525.150
- nvidia-nccl-cu12==2.20.5
- nvidia-nvjitlink-cu12==12.1.105
- nvidia-nvtx-cu12==12.1.105
- openai==1.41.0
- opencv-python-headless==4.10.0.84
- openllm==0.6.10
- openllm-client==0.5.7
- openllm-core==0.5.7
- opentelemetry-api==1.20.0
- opentelemetry-instrumentation==0.41b0
- opentelemetry-instrumentation-aiohttp-client==0.41b0
- opentelemetry-instrumentation-asgi==0.41b0
- opentelemetry-sdk==1.20.0
- opentelemetry-semantic-conventions==0.41b0
- opentelemetry-util-http==0.41b0
- orjson==3.10.7
- outlines==0.0.46
- packaging==24.0
- pandas==2.2.2
- partial-json-parser==0.2.1.1.post4
- pathlib==1.0.1
- pathspec==0.12.1
- pfzy==0.3.4
- pillow==10.4.0
- pip-requirements-parser==32.0.1
- prometheus-client==0.20.0
- prometheus-fastapi-instrumentator==7.0.0
- prompt-toolkit==3.0.36
- protobuf==5.28.1
- psutil==5.9.8
- py-cpuinfo==9.0.0
- pyairports==2.1.1
- pyaml==24.7.0
- pyarrow==17.0.0
- pycountry==24.6.1
- pydantic==2.9.2
- pydantic-core==2.23.4
- pygments==2.18.0
- pynvml==11.5.0
- pyparsing==3.1.4
- python-dateutil==2.9.0.post0
- python-dotenv==1.0.1
- python-json-logger==2.0.7
- python-multipart==0.0.9
- pytz==2024.1
- pyyaml==6.0.1
- pyzmq==26.2.0
- questionary==2.0.1
- ray==2.36.0
- referencing==0.35.1
- regex==2024.4.16
- requests==2.32.3
- rich==13.8.1
- rpds-py==0.20.0
- safetensors==0.4.3
- schema==0.7.7
- scipy==1.14.1
- sentencepiece==0.2.0
- shellingham==1.5.4
- simple-di==0.1.5
- six==1.16.0
- sniffio==1.3.1
- starlette==0.38.5
- sympy==1.12
- tabulate==0.9.0
- termcolor==2.4.0
- tiktoken==0.7.0
- tokenizers==0.19.1
- tomli-w==1.0.0
- torch==2.4.1
- torch-model-archiver==0.10.0
- torchaudio==2.4.1
- torchserve==0.11.1
- torchvision==0.19.0
- tornado==6.4.1
- tqdm==4.66.5
- transformers==4.44.2
- triton==3.0.0
- typer==0.12.5
- types-python-dateutil==2.9.0.20240316
- typing-extensions==4.11.0
- tzdata==2024.1
- urllib3==2.2.1
- uv==0.4.11
- uvicorn==0.30.6
- uvloop==0.20.0
- vllm==0.6.1.post2
- vllm-flash-attn==2.6.1
- watchfiles==0.24.0
- wcwidth==0.2.13
- websockets==13.0.1
- werkzeug==3.0.2
- wrapt==1.16.0
- wsproto==1.2.0
- xformers==0.0.27.post2
- xxhash==3.5.0
- yarl==1.11.1
- zipp==3.20.2
prefix: /home/ubuntu/miniconda3/envs/pytorch
pip_packages
accelerate==0.34.1
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
aiosqlite==0.20.0
aniso8601==9.0.1
annotated-types==0.7.0
ansi2html==1.9.1
anyio==4.4.0
appdirs==1.4.4
arrow==1.3.0
asgiref==3.8.1
attrs==24.2.0
bentoml==1.3.5
bitsandbytes==0.43.3
blinker==1.7.0
cattrs==23.1.2
certifi==2024.2.2
charset-normalizer==3.3.2
circus==0.18.0
click==8.1.7
click-option-group==0.5.6
cloudpickle==3.0.0
ctranslate2==4.1.0
cuda-python==12.6.0
datasets==3.0.0
deepmerge==2.0
deprecated==1.2.14
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
dulwich==0.22.1
einops==0.8.0
enum-compat==0.0.3
fastapi==0.115.0
fastcore==1.7.8
ffmpeg==1.4
filelock==3.13.4
flask==3.0.3
flask-restful==0.3.10
frozenlist==1.4.1
fs==2.4.16
fsspec==2024.3.1
gguf==0.9.1
ghapi==1.0.6
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.2
httpx-ws==0.6.0
huggingface-hub==0.24.6
idna==3.7
importlib-metadata==6.11.0
inflection==0.5.1
inquirerpy==0.3.4
interegular==0.3.3
itsdangerous==2.2.0
jinja2==3.1.2
jiter==0.5.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
lark==1.2.2
llvmlite==0.43.0
lm-format-enforcer==0.10.6
markdown-it-py==3.0.0
markupsafe==2.1.3
mdurl==0.1.2
mistral-common==1.4.1
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.18.6
multidict==6.1.0
multiprocess==0.70.16
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
numba==0.60.0
numpy==1.26.4
nvgpu==0.10.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==11.525.150
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.1.105
nvidia-nvtx-cu12==12.1.105
openai==1.41.0
opencv-python-headless==4.10.0.84
openllm==0.6.10
openllm-client==0.5.7
openllm-core==0.5.7
opentelemetry-api==1.20.0
opentelemetry-instrumentation==0.41b0
opentelemetry-instrumentation-aiohttp-client==0.41b0
opentelemetry-instrumentation-asgi==0.41b0
opentelemetry-sdk==1.20.0
opentelemetry-semantic-conventions==0.41b0
opentelemetry-util-http==0.41b0
orjson==3.10.7
outlines==0.0.46
packaging==24.0
pandas==2.2.2
partial-json-parser==0.2.1.1.post4
pathlib==1.0.1
pathspec==0.12.1
pfzy==0.3.4
pillow==10.4.0
pip==23.3.1
pip-requirements-parser==32.0.1
prometheus-client==0.20.0
prometheus-fastapi-instrumentator==7.0.0
prompt-toolkit==3.0.36
protobuf==5.28.1
psutil==5.9.8
py-cpuinfo==9.0.0
pyairports==2.1.1
pyaml==24.7.0
pyarrow==17.0.0
pycountry==24.6.1
pydantic==2.9.2
pydantic-core==2.23.4
pygments==2.18.0
pynvml==11.5.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
pytz==2024.1
pyyaml==6.0.1
pyzmq==26.2.0
questionary==2.0.1
ray==2.36.0
referencing==0.35.1
regex==2024.4.16
requests==2.32.3
rich==13.8.1
rpds-py==0.20.0
safetensors==0.4.3
schema==0.7.7
scipy==1.14.1
sentencepiece==0.2.0
setuptools==68.2.2
shellingham==1.5.4
simple-di==0.1.5
six==1.16.0
sniffio==1.3.1
starlette==0.38.5
sympy==1.12
tabulate==0.9.0
termcolor==2.4.0
tiktoken==0.7.0
tokenizers==0.19.1
tomli-w==1.0.0
torch==2.4.1
torch-model-archiver==0.10.0
torchaudio==2.4.1
torchserve==0.11.1
torchvision==0.19.0
tornado==6.4.1
tqdm==4.66.5
transformers==4.44.2
triton==3.0.0
typer==0.12.5
types-python-dateutil==2.9.0.20240316
typing-extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
uv==0.4.11
uvicorn==0.30.6
uvloop==0.20.0
vllm==0.6.1.post2
vllm-flash-attn==2.6.1
watchfiles==0.24.0
wcwidth==0.2.13
websockets==13.0.1
werkzeug==3.0.2
wheel==0.41.2
wrapt==1.16.0
wsproto==1.2.0
xformers==0.0.27.post2
xxhash==3.5.0
yarl==1.11.1
zipp==3.20.2
transformers
version: 4.44.2- Platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.37
- Python version: 3.11.8
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.3
- Accelerate version: 0.34.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: Yes
- GPU type: NVIDIA L4
System information (Optional)
No response
Metadata
Assignees
Labels
No labels