Does this project run on an RTX Titan? #14

freemansoft · 2024-07-22T02:58:28Z

I've tried running this project on a Titan RTX on Windows and Linux and it doesn't run for either operating system. model llama3-ChatQA-1.5-8B should work on Tesla hardware because the torch_dtype is float16

System

Titan RTX
Linux Linux hp-z820 6.5.0-44-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Model: NVIDIA/llama3-ChatQA-1.5-8B
torch_dtype: float16

Steps

Start RAG
enable vector database
Load the model
Start the server

Logs

2024-07-30T12:25:39.085749Z  INFO text_generation_launcher: Args {
    model_id: "nvidia/Llama3-ChatQA-1.5-8B",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        4000,
    ),
    max_total_tokens: Some(
        5000,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "project-hybrid-rag",
    port: 9090,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data/",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 0.85,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    disable_usage_stats: false,
    disable_crash_reports: false,
}

and

Polling inference server. Awaiting status 200; trying again in 5s. 
2024-08-04T02:44:46.723599Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4050
2024-08-04T02:44:46.723632Z  INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-08-04T02:44:46.723644Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `nvidia/Llama3-ChatQA-1.5-8B` do not contain malicious code.
2024-08-04T02:44:46.723960Z  INFO download: text_generation_launcher: Starting check and download process for nvidia/Llama3-ChatQA-1.5-8B
2024-08-04T02:44:51.496829Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
Polling inference server. Awaiting status 200; trying again in 5s. 
2024-08-04T02:44:52.230949Z  INFO download: text_generation_launcher: Successfully downloaded weights for nvidia/Llama3-ChatQA-1.5-8B
2024-08-04T02:44:52.231629Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
Polling inference server. Awaiting status 200; trying again in 5s. 
2024-08-04T02:44:57.336095Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: System cpu doesn't support flash/paged attention
2024-08-04T02:44:57.834095Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/core.py", line 723, in main
    return _main(
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/core.py", line 193, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/main.py", line 692, in wrapper
    return callback(**use_params)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 109, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 274, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 229, in serve_inner
    model = get_model_with_lora_adapters(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 1149, in get_model_with_lora_adapters
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 733, in get_model
    return CausalLM.fallback(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 600, in fallback
    raise ValueError("quantization is not available on CPU")
ValueError: quantization is not available on CPU
2024-08-04T02:44:59.241700Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-08-04 02:44:54.845 | INFO     | text_generation_server.utils.import_utils:<module>:73 - Detected system cpu
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.
  warnings.warn("Could not import SGMV kernel from Punica, falling back to loop.")
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py:159: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py:232: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, dout):
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/triton/layernorm.py:508: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/triton/layernorm.py:567: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, dout, *args):
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py:109 in │
│ serve                                                                        │
│                                                                              │
│   106 │   │   raise RuntimeError(                                            │
│   107 │   │   │   "Only 1 can be set between `dtype` and `quantize`, as they │
│   108 │   │   )                                                              │
│ ❱ 109 │   server.serve(                                                      │
│   110 │   │   model_id,                                                      │
│   111 │   │   lora_adapters,                                                 │
│   112 │   │   revision,                                                      │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │             dtype = None                                                 │ │
│ │       json_output = True                                                 │ │
│ │      logger_level = 'INFO'                                               │ │
│ │     lora_adapters = []                                                   │ │
│ │  max_input_tokens = 4000                                                 │ │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B'                        │ │
│ │     otlp_endpoint = None                                                 │ │
│ │ otlp_service_name = 'text-generation-inference.router'                   │ │
│ │          quantize = 'bitsandbytes-nf4'                                   │ │
│ │          revision = None                                                 │ │
│ │            server = <module 'text_generation_server.server' from         │ │
│ │                     '/opt/conda/lib/python3.10/site-packages/text_gener… │ │
│ │     setup_tracing = <function setup_tracing at 0x7bd2d19a3130>           │ │
│ │           sharded = False                                                │ │
│ │         speculate = None                                                 │ │
│ │ trust_remote_code = True                                                 │ │
│ │          uds_path = PosixPath('/tmp/text-generation-server')             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/server.py:274 │
│ in serve                                                                     │
│                                                                              │
│   271 │   │   while signal_handler.KEEP_PROCESSING:                          │
│   272 │   │   │   await asyncio.sleep(0.5)                                   │
│   273 │                                                                      │
│ ❱ 274 │   asyncio.run(                                                       │
│   275 │   │   serve_inner(                                                   │
│   276 │   │   │   model_id,                                                  │
│   277 │   │   │   lora_adapters,                                             │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │             dtype = None                                                 │ │
│ │     lora_adapters = []                                                   │ │
│ │  max_input_tokens = 4000                                                 │ │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B'                        │ │
│ │          quantize = 'bitsandbytes-nf4'                                   │ │
│ │          revision = None                                                 │ │
│ │       serve_inner = <function serve.<locals>.serve_inner at              │ │
│ │                     0x7bd402694430>                                      │ │
│ │           sharded = False                                                │ │
│ │         speculate = None                                                 │ │
│ │ trust_remote_code = True                                                 │ │
│ │          uds_path = PosixPath('/tmp/text-generation-server')             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.10/asyncio/runners.py:44 in run                       │
│                                                                              │
│   41 │   │   events.set_event_loop(loop)                                     │
│   42 │   │   if debug is not None:                                           │
│   43 │   │   │   loop.set_debug(debug)                                       │
│ ❱ 44 │   │   return loop.run_until_complete(main)                            │
│   45 │   finally:                                                            │
│   46 │   │   try:                                                            │
│   47 │   │   │   _cancel_all_tasks(loop)                                     │
│                                                                              │
│ ╭──────────────────────────────── locals ─────────────────────────────────╮  │
│ │ debug = None                                                            │  │
│ │  loop = <_UnixSelectorEventLoop running=False closed=True debug=False>  │  │
│ │  main = <coroutine object serve.<locals>.serve_inner at 0x7bd2c4f1cb30> │  │
│ ╰─────────────────────────────────────────────────────────────────────────╯  │
│                                                                              │
│ /opt/conda/lib/python3.10/asyncio/base_events.py:649 in run_until_complete   │
│                                                                              │
│    646 │   │   if not future.done():                                         │
│    647 │   │   │   raise RuntimeError('Event loop stopped before Future comp │
│    648 │   │                                                                 │
│ ❱  649 │   │   return future.result()                                        │
│    650 │                                                                     │
│    651 │   def stop(self):                                                   │
│    652 │   │   """Stop running the event loop.                               │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │   future = <Task finished name='Task-1'                                  │ │
│ │            coro=<serve.<locals>.serve_inner() done, defined at           │ │
│ │            /opt/conda/lib/python3.10/site-packages/text_generation_serv… │ │
│ │            exception=ValueError('quantization is not available on CPU')> │ │
│ │ new_task = True                                                          │ │
│ │     self = <_UnixSelectorEventLoop running=False closed=True             │ │
│ │            debug=False>                                                  │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/server.py:229 │
│ in serve_inner                                                               │
│                                                                              │
│   226 │   │   │   server_urls = [local_url]                                  │
│   227 │   │                                                                  │
│   228 │   │   try:                                                           │
│ ❱ 229 │   │   │   model = get_model_with_lora_adapters(                      │
│   230 │   │   │   │   model_id,                                              │
│   231 │   │   │   │   lora_adapters,                                         │
│   232 │   │   │   │   revision,                                              │
│                                                                              │
│ ╭──────────────────────────── locals ─────────────────────────────╮          │
│ │     adapter_to_index = {}                                       │          │
│ │                dtype = None                                     │          │
│ │            local_url = 'unix:///tmp/text-generation-server-0'   │          │
│ │        lora_adapters = []                                       │          │
│ │     max_input_tokens = 4000                                     │          │
│ │             model_id = 'nvidia/Llama3-ChatQA-1.5-8B'            │          │
│ │             quantize = 'bitsandbytes-nf4'                       │          │
│ │             revision = None                                     │          │
│ │          server_urls = ['unix:///tmp/text-generation-server-0'] │          │
│ │              sharded = False                                    │          │
│ │            speculate = None                                     │          │
│ │    trust_remote_code = True                                     │          │
│ │             uds_path = PosixPath('/tmp/text-generation-server') │          │
│ │ unix_socket_template = 'unix://{}-{}'                           │          │
│ ╰─────────────────────────────────────────────────────────────────╯          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init │
│ __.py:1149 in get_model_with_lora_adapters                                   │
│                                                                              │
│   1146 │   adapter_to_index: Dict[str, int],                                 │
│   1147 ):                                                                    │
│   1148 │   lora_adapter_ids = [adapter.id for adapter in lora_adapters]      │
│ ❱ 1149 │   model = get_model(                                                │
│   1150 │   │   model_id,                                                     │
│   1151 │   │   lora_adapter_ids,                                             │
│   1152 │   │   revision,                                                     │
│                                                                              │
│ ╭───────────────────── locals ──────────────────────╮                        │
│ │  adapter_to_index = {}                            │                        │
│ │             dtype = None                          │                        │
│ │  lora_adapter_ids = []                            │                        │
│ │     lora_adapters = []                            │                        │
│ │  max_input_tokens = 4000                          │                        │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B' │                        │
│ │          quantize = 'bitsandbytes-nf4'            │                        │
│ │          revision = None                          │                        │
│ │           sharded = False                         │                        │
│ │         speculate = None                          │                        │
│ │ trust_remote_code = True                          │                        │
│ ╰───────────────────────────────────────────────────╯                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init │
│ __.py:733 in get_model                                                       │
│                                                                              │
│    730 │   │   elif sharded:                                                 │
│    731 │   │   │   raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format( │
│    732 │   │   else:                                                         │
│ ❱  733 │   │   │   return CausalLM.fallback(                                 │
│    734 │   │   │   │   model_id,                                             │
│    735 │   │   │   │   revision,                                             │
│    736 │   │   │   │   quantize=quantize,                                    │
│                                                                              │
│ ╭───────────────────────────── locals ─────────────────────────────╮         │
│ │                   _ = {}                                         │         │
│ │         config_dict = {                                          │         │
│ │                       │   'architectures': [                     │         │
│ │                       │   │   'LlamaForCausalLM'                 │         │
│ │                       │   ],                                     │         │
│ │                       │   'attention_bias': False,               │         │
│ │                       │   'attention_dropout': 0.0,              │         │
│ │                       │   'bos_token_id': 128000,                │         │
│ │                       │   'eos_token_id': 128001,                │         │
│ │                       │   'hidden_act': 'silu',                  │         │
│ │                       │   'hidden_size': 4096,                   │         │
│ │                       │   'initializer_range': 0.02,             │         │
│ │                       │   'intermediate_size': 14336,            │         │
│ │                       │   'max_position_embeddings': 8192,       │         │
│ │                       │   ... +14                                │         │
│ │                       }                                          │         │
│ │               dtype = None                                       │         │
│ │    lora_adapter_ids = []                                         │         │
│ │    max_input_tokens = 4000                                       │         │
│ │              method = 'n-gram'                                   │         │
│ │            model_id = 'nvidia/Llama3-ChatQA-1.5-8B'              │         │
│ │          model_type = 'llama'                                    │         │
│ │ quantization_config = None                                       │         │
│ │            quantize = 'bitsandbytes-nf4'                         │         │
│ │            revision = None                                       │         │
│ │             sharded = False                                      │         │
│ │      sliding_window = -1                                         │         │
│ │           speculate = 0                                          │         │
│ │          speculator = None                                       │         │
│ │   trust_remote_code = True                                       │         │
│ ╰──────────────────────────────────────────────────────────────────╯         │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal │
│ _lm.py:600 in fallback                                                       │
│                                                                              │
│   597 │   │   │   dtype = torch.float16 if dtype is None else dtype          │
│   598 │   │   else:                                                          │
│   599 │   │   │   if quantize:                                               │
│ ❱ 600 │   │   │   │   raise ValueError("quantization is not available on CPU │
│   601 │   │   │                                                              │
│   602 │   │   │   device = torch.device("cpu")                               │
│   603 │   │   │   dtype = torch.float32 if dtype is None else dtype          │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │               cls = <class                                               │ │
│ │                     'text_generation_server.models.causal_lm.CausalLM'>  │ │
│ │             dtype = None                                                 │ │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B'                        │ │
│ │          quantize = 'bitsandbytes-nf4'                                   │ │
│ │          revision = None                                                 │ │
│ │        speculator = None                                                 │ │
│ │ trust_remote_code = True                                                 │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: quantization is not available on CPU rank=0
Error: ShardCannotStart
2024-08-04T02:44:59.336935Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-04T02:44:59.336981Z  INFO text_generation_launcher: Shutting down shards
Polling
...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this project run on an RTX Titan? #14

Does this project run on an RTX Titan? #14

freemansoft commented Jul 22, 2024 •

edited

Loading

Does this project run on an RTX Titan? #14

Does this project run on an RTX Titan? #14

Comments

freemansoft commented Jul 22, 2024 • edited Loading

System

Steps

Logs

freemansoft commented Jul 22, 2024 •

edited

Loading