Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model warmup fails after adding Triton indexing kernels #2838

Open
2 of 4 tasks
YaserJaradeh opened this issue Dec 13, 2024 · 10 comments
Open
2 of 4 tasks

Model warmup fails after adding Triton indexing kernels #2838

YaserJaradeh opened this issue Dec 13, 2024 · 10 comments

Comments

@YaserJaradeh
Copy link

System Info

I was using v2.3.1 via docker and everything was working. When I updated to later versions including the latest my TGI doesn't start due to an error:

2024-12-12T14:26:52.973549Z  INFO hf_hub: Token file not found "/data/token"    
2024-12-12T14:26:54.846408Z  INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2024-12-12T14:26:54.846426Z  INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
2024-12-12T14:26:54.846433Z  INFO text_generation_launcher: Sharding model on 2 processes
2024-12-12T14:26:54.931439Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 30821
2024-12-12T14:26:54.931470Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-12-12T14:26:54.931727Z  INFO download: text_generation_launcher: Starting check and download process for microsoft/Phi-3.5-mini-instruct
2024-12-12T14:26:57.914690Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-12-12T14:26:58.250499Z  INFO download: text_generation_launcher: Successfully downloaded weights for microsoft/Phi-3.5-mini-instruct
2024-12-12T14:26:58.251011Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-12-12T14:26:58.251055Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-12-12T14:27:00.870304Z  INFO text_generation_launcher: Using prefix caching = False
2024-12-12T14:27:00.870362Z  INFO text_generation_launcher: Using Attention = flashdecoding
2024-12-12T14:27:06.425419Z  INFO text_generation_launcher: Using prefill chunking = True
2024-12-12T14:27:06.535239Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-12-12T14:27:06.536669Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-12-12T14:27:06.572585Z  INFO shard-manager: text_generation_launcher: Shard ready in 8.307980962s rank=0
2024-12-12T14:27:06.578046Z  INFO shard-manager: text_generation_launcher: Shard ready in 8.308372036s rank=1
2024-12-12T14:27:06.657793Z  INFO text_generation_launcher: Starting Webserver
2024-12-12T14:27:06.739409Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-12T14:27:06.863722Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-12-12T14:27:07.034243Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
    return callback(**use_params)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 132, in Warmup
    batch = self.model.batch_type.from_pb(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 495, in from_pb
    return cls.from_tokenized(pb, tokenizer, batch_tokenized_inputs, dtype, device)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 427, in from_tokenized
    block_tables_to_padded(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/metadata_kernels.py", line 42, in block_tables_to_padded
    triton_block_tables_to_padded[grid](
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/opt/conda/lib/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpx2wgfsg0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpx2wgfsg0/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpx2wgfsg0', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.034772Z ERROR warmup{max_input_length=None max_prefill_tokens=30821 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Command '['/usr/bin/gcc', '/tmp/tmpx2wgfsg0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpx2wgfsg0/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpx2wgfsg0', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.078269Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
    return callback(**use_params)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 132, in Warmup
    batch = self.model.batch_type.from_pb(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 495, in from_pb
    return cls.from_tokenized(pb, tokenizer, batch_tokenized_inputs, dtype, device)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 427, in from_tokenized
    block_tables_to_padded(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/metadata_kernels.py", line 42, in block_tables_to_padded
    triton_block_tables_to_padded[grid](
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/opt/conda/lib/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.078823Z ERROR warmup{max_input_length=None max_prefill_tokens=30821 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
Error: Backend(Warmup(Generation("Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.")))
2024-12-12T14:27:07.117285Z ERROR text_generation_launcher: Webserver Crashed
2024-12-12T14:27:07.117316Z  INFO text_generation_launcher: Shutting down shards
2024-12-12T14:27:07.173251Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-12-12T14:27:07.173312Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-12-12T14:27:07.178761Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-12-12T14:27:07.178820Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-12-12T14:27:08.279806Z  INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: WebserverFailed
2024-12-12T14:27:08.474404Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0

This is my nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:0D:00.0 Off |                    0 |
| N/A   54C    P0             30W /   72W |    1557MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      On  |   00000000:37:00.0 Off |                    0 |
| N/A   55C    P0             28W /   72W |   21989MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      On  |   00000000:4A:00.0 Off |                    0 |
| N/A   39C    P0             27W /   72W |   21659MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      On  |   00000000:61:00.0 Off |                    0 |
| N/A   37C    P0             27W /   72W |   19965MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L4                      On  |   00000000:A0:00.0 Off |                    0 |
| N/A   46C    P8             17W /   72W |       4MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L4                      On  |   00000000:B5:00.0 Off |                    0 |
| N/A   48C    P0             22W /   72W |     193MiB /  23034MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L4                      On  |   00000000:CA:00.0 Off |                    0 |
| N/A   28C    P8             12W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L4                      On  |   00000000:E1:00.0 Off |                    0 |
| N/A   26C    P8             12W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    137174      C   /app/.venv/bin/python                        1548MiB |
|    1   N/A  N/A     13513      C   /opt/conda/bin/python3.11                   21980MiB |
|    2   N/A  N/A     13518      C   /opt/conda/bin/python3.11                   21650MiB |
|    3   N/A  N/A     13523      C   /opt/conda/bin/python3.11                   19956MiB |
|    5   N/A  N/A   2150019      C   /opt/conda/bin/python3.11                     184MiB |
+-----------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

here is the TGI env:

{
	model_id: "microsoft/Phi-3.5-mini-instruct",
	revision: None,
	validation_workers: 2,
	sharded: None,
	num_shard: Some(
	    2,
	),
	quantize: None,
	speculate: None,
	dtype: None,
	kv_cache_dtype: None,
	trust_remote_code: false,
	max_concurrent_requests: 128,
	max_best_of: 2,
	max_stop_sequences: 4,
	max_top_n_tokens: 5,
	max_input_tokens: None,
	max_input_length: None,
	max_total_tokens: None,
	waiting_served_ratio: 0.3,
	max_batch_prefill_tokens: None,
	max_batch_total_tokens: None,
	max_waiting_tokens: 20,
	max_batch_size: None,
	cuda_graphs: None,
	hostname: "06ee66ffa08d",
	port: 3000,
	shard_uds_path: "/tmp/text-generation-server",
	master_addr: "localhost",
	master_port: 29500,
	huggingface_hub_cache: None,
	weights_cache_override: None,
	disable_custom_kernels: false,
	cuda_memory_fraction: 1.0,
	rope_scaling: None,
	rope_factor: None,
	json_output: false,
	otlp_endpoint: None,
	otlp_service_name: "text-generation-inference.router",
	cors_allow_origin: [],
	api_key: None,
	watermark_gamma: None,
	watermark_delta: None,
	ngrok: false,
	ngrok_authtoken: None,
	ngrok_edge: None,
	tokenizer_config_path: None,
	disable_grammar_support: false,
	env: false,
	max_client_batch_size: 4,
	lora_adapters: None,
	usage_stats: On,
	payload_limit: 2000000,
	enable_prefill_logprobs: false,
}

And here is how I'm running container (running it via podman):

podman create --name=tgi_container --security-opt label=disable --label io.podman.compose.config-hash=XXXXXXXX --label io.podman.compose.project=some-deployment --label io.podman.compose.version=1.0.6 --label PODMAN_SYSTEMD_UNIT=podman-compose@some-deployment.service --label com.docker.compose.project=some-deployment --label com.docker.compose.project.working_dir=/data/some-deployment --label com.docker.compose.project.config_files=docker-compose.yml --label com.docker.compose.container-number=1 --label com.docker.compose.service=tgi --device nvidia.com/gpu=4 --device nvidia.com/gpu=5 -e HUGGING_FACE_HUB_TOKEN=hf_XXXXXXX -e FLASH_DECODING=1 -e PREFILL_CHUNKING=1 -e NCCL_DEBUG=INFO -v /data/tgi/data:/data --net some-deployment_api --network-alias tgi --expose 3000 -p 3000:3000 --shm-size 10gb --restart on-failure ghcr.io/huggingface/text-generation-inference:3.0.1 --port 3000 --model-id microsoft/Phi-3.5-mini-instruct --num-shard 2

Which is generated on my system from running it via a docker compose file.

Expected behavior

The TGI server to start correctly and normally as it had before adding the Triton kernels!

@KreshLaDoge
Copy link

KreshLaDoge commented Dec 13, 2024

I have the same problem. I assume that you and others who reported the issues below are using the Docker image, and we have reverted to using Triton indexing kernels. Since the process involves compiling C files with calls to Python, but the Python headers are not available for C, we encounter an error.

In simple terms, I assume that Python.h is not available when compiling these shared object files. After reviewing the Dockerfile, it appears that Python3.11-dev is not included in the final image, which is why Python.h is missing.

Just guessing, my "sure" value is about 0.6 😁🤷🏼‍♂️

It seems to be the same issue as the following issues:

@phymbert
Copy link

It seems to be the same issue as the following issues:

#2835 not related... it's about gpu split from 2 to 4 H100, no any python stacktrace.

But thanks @KreshLaDoge

@YaserJaradeh
Copy link
Author

Update: I was able to get it working by changing the base image to devel to match the builder image
This line here into FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base

I have to rebuild the image which takes time and increases the size of the image but now it works!!

@Narsil
Copy link
Collaborator

Narsil commented Dec 16, 2024

I don't know how to reproduce the issue Phi3.5 works perfectly under 3.0.1.

I everyone here using podman ? I don't see why it should make any difference though..

@Narsil
Copy link
Collaborator

Narsil commented Dec 16, 2024

can also every confirm they are using 3.0.1 and not latest ?

@YaserJaradeh
Copy link
Author

can also every confirm they are using 3.0.1 and not latest ?

I can confirm the issue for me with 3.0.1, 3.0, and 2.4.0

@KreshLaDoge
Copy link

I also had issues with 3.0.1

I suspect that it’s about missing Python.h, which would also explain why it worked for @YaserJaradeh when he changed the base image of Ubuntu to devel variant. But it can be something else.

Currently, I’m forced to assing GPUs to container manually and not through the Nvidia container toolkit, so it might be related if others, experiencing the same issue, are using vGPUs for example 🤷

@Narsil
Copy link
Collaborator

Narsil commented Dec 16, 2024

Currently, I’m forced to assing GPUs to container manually and not through the Nvidia container toolkit,

Can you elaborate ? It might be a potential culprit.

@KreshLaDoge
Copy link

Currently, I’m forced to assing GPUs to container manually and not through the Nvidia container toolkit,

Can you elaborate ? It might be a potential culprit.

Here is example of my docker compose and our way, how we assign GPUs - don't judge me, it has it’s own reasons why I can't use container toolkit 🤷

Anyway, I doubt that anyone else experiencing this issue has similar configuration.

image

@YaserJaradeh
Copy link
Author

I also assign the GPUs manually to my container
Screenshot 2024-12-16 at 9 10 36 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants