Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this project run on an RTX Titan? #14

Open
freemansoft opened this issue Jul 22, 2024 · 0 comments
Open

Does this project run on an RTX Titan? #14

freemansoft opened this issue Jul 22, 2024 · 0 comments

Comments

@freemansoft
Copy link

freemansoft commented Jul 22, 2024

I've tried running this project on a Titan RTX on Windows and Linux and it doesn't run for either operating system. model llama3-ChatQA-1.5-8B should work on Tesla hardware because the torch_dtype is float16

System

  • Titan RTX
  • Linux Linux hp-z820 6.5.0-44-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • Model: NVIDIA/llama3-ChatQA-1.5-8B
  • torch_dtype: float16

Steps

  1. Start RAG
  2. enable vector database
  3. Load the model
  4. Start the server

Logs

2024-07-30T12:25:39.085749Z  INFO text_generation_launcher: Args {
    model_id: "nvidia/Llama3-ChatQA-1.5-8B",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        4000,
    ),
    max_total_tokens: Some(
        5000,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "project-hybrid-rag",
    port: 9090,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data/",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 0.85,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    disable_usage_stats: false,
    disable_crash_reports: false,
}

and

Polling inference server. Awaiting status 200; trying again in 5s. 
2024-08-04T02:44:46.723599Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4050
2024-08-04T02:44:46.723632Z  INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-08-04T02:44:46.723644Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `nvidia/Llama3-ChatQA-1.5-8B` do not contain malicious code.
2024-08-04T02:44:46.723960Z  INFO download: text_generation_launcher: Starting check and download process for nvidia/Llama3-ChatQA-1.5-8B
2024-08-04T02:44:51.496829Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
Polling inference server. Awaiting status 200; trying again in 5s. 
2024-08-04T02:44:52.230949Z  INFO download: text_generation_launcher: Successfully downloaded weights for nvidia/Llama3-ChatQA-1.5-8B
2024-08-04T02:44:52.231629Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
Polling inference server. Awaiting status 200; trying again in 5s. 
2024-08-04T02:44:57.336095Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: System cpu doesn't support flash/paged attention
2024-08-04T02:44:57.834095Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/core.py", line 723, in main
    return _main(
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/core.py", line 193, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/workbench/.local/lib/python3.10/site-packages/typer/main.py", line 692, in wrapper
    return callback(**use_params)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 109, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 274, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 229, in serve_inner
    model = get_model_with_lora_adapters(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 1149, in get_model_with_lora_adapters
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 733, in get_model
    return CausalLM.fallback(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 600, in fallback
    raise ValueError("quantization is not available on CPU")
ValueError: quantization is not available on CPU
2024-08-04T02:44:59.241700Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-08-04 02:44:54.845 | INFO     | text_generation_server.utils.import_utils:<module>:73 - Detected system cpu
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.
  warnings.warn("Could not import SGMV kernel from Punica, falling back to loop.")
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py:159: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py:232: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, dout):
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/triton/layernorm.py:508: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/conda/lib/python3.10/site-packages/mamba_ssm/ops/triton/layernorm.py:567: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, dout, *args):
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py:109 in │
│ serve                                                                        │
│                                                                              │
│   106 │   │   raise RuntimeError(                                            │
│   107 │   │   │   "Only 1 can be set between `dtype` and `quantize`, as they │
│   108 │   │   )                                                              │
│ ❱ 109 │   server.serve(                                                      │
│   110 │   │   model_id,                                                      │
│   111 │   │   lora_adapters,                                                 │
│   112 │   │   revision,                                                      │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │             dtype = None                                                 │ │
│ │       json_output = True                                                 │ │
│ │      logger_level = 'INFO'                                               │ │
│ │     lora_adapters = []                                                   │ │
│ │  max_input_tokens = 4000                                                 │ │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B'                        │ │
│ │     otlp_endpoint = None                                                 │ │
│ │ otlp_service_name = 'text-generation-inference.router'                   │ │
│ │          quantize = 'bitsandbytes-nf4'                                   │ │
│ │          revision = None                                                 │ │
│ │            server = <module 'text_generation_server.server' from         │ │
│ │                     '/opt/conda/lib/python3.10/site-packages/text_gener… │ │
│ │     setup_tracing = <function setup_tracing at 0x7bd2d19a3130>           │ │
│ │           sharded = False                                                │ │
│ │         speculate = None                                                 │ │
│ │ trust_remote_code = True                                                 │ │
│ │          uds_path = PosixPath('/tmp/text-generation-server')             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/server.py:274 │
│ in serve                                                                     │
│                                                                              │
│   271 │   │   while signal_handler.KEEP_PROCESSING:                          │
│   272 │   │   │   await asyncio.sleep(0.5)                                   │
│   273 │                                                                      │
│ ❱ 274 │   asyncio.run(                                                       │
│   275 │   │   serve_inner(                                                   │
│   276 │   │   │   model_id,                                                  │
│   277 │   │   │   lora_adapters,                                             │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │             dtype = None                                                 │ │
│ │     lora_adapters = []                                                   │ │
│ │  max_input_tokens = 4000                                                 │ │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B'                        │ │
│ │          quantize = 'bitsandbytes-nf4'                                   │ │
│ │          revision = None                                                 │ │
│ │       serve_inner = <function serve.<locals>.serve_inner at              │ │
│ │                     0x7bd402694430>                                      │ │
│ │           sharded = False                                                │ │
│ │         speculate = None                                                 │ │
│ │ trust_remote_code = True                                                 │ │
│ │          uds_path = PosixPath('/tmp/text-generation-server')             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.10/asyncio/runners.py:44 in run                       │
│                                                                              │
│   41 │   │   events.set_event_loop(loop)                                     │
│   42 │   │   if debug is not None:                                           │
│   43 │   │   │   loop.set_debug(debug)                                       │
│ ❱ 44 │   │   return loop.run_until_complete(main)                            │
│   45 │   finally:                                                            │
│   46 │   │   try:                                                            │
│   47 │   │   │   _cancel_all_tasks(loop)                                     │
│                                                                              │
│ ╭──────────────────────────────── locals ─────────────────────────────────╮  │
│ │ debug = None                                                            │  │
│ │  loop = <_UnixSelectorEventLoop running=False closed=True debug=False>  │  │
│ │  main = <coroutine object serve.<locals>.serve_inner at 0x7bd2c4f1cb30> │  │
│ ╰─────────────────────────────────────────────────────────────────────────╯  │
│                                                                              │
│ /opt/conda/lib/python3.10/asyncio/base_events.py:649 in run_until_complete   │
│                                                                              │
│    646 │   │   if not future.done():                                         │
│    647 │   │   │   raise RuntimeError('Event loop stopped before Future comp │
│    648 │   │                                                                 │
│ ❱  649 │   │   return future.result()                                        │
│    650 │                                                                     │
│    651 │   def stop(self):                                                   │
│    652 │   │   """Stop running the event loop.                               │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │   future = <Task finished name='Task-1'                                  │ │
│ │            coro=<serve.<locals>.serve_inner() done, defined at           │ │
│ │            /opt/conda/lib/python3.10/site-packages/text_generation_serv… │ │
│ │            exception=ValueError('quantization is not available on CPU')> │ │
│ │ new_task = True                                                          │ │
│ │     self = <_UnixSelectorEventLoop running=False closed=True             │ │
│ │            debug=False>                                                  │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/server.py:229 │
│ in serve_inner                                                               │
│                                                                              │
│   226 │   │   │   server_urls = [local_url]                                  │
│   227 │   │                                                                  │
│   228 │   │   try:                                                           │
│ ❱ 229 │   │   │   model = get_model_with_lora_adapters(                      │
│   230 │   │   │   │   model_id,                                              │
│   231 │   │   │   │   lora_adapters,                                         │
│   232 │   │   │   │   revision,                                              │
│                                                                              │
│ ╭──────────────────────────── locals ─────────────────────────────╮          │
│ │     adapter_to_index = {}                                       │          │
│ │                dtype = None                                     │          │
│ │            local_url = 'unix:///tmp/text-generation-server-0'   │          │
│ │        lora_adapters = []                                       │          │
│ │     max_input_tokens = 4000                                     │          │
│ │             model_id = 'nvidia/Llama3-ChatQA-1.5-8B'            │          │
│ │             quantize = 'bitsandbytes-nf4'                       │          │
│ │             revision = None                                     │          │
│ │          server_urls = ['unix:///tmp/text-generation-server-0'] │          │
│ │              sharded = False                                    │          │
│ │            speculate = None                                     │          │
│ │    trust_remote_code = True                                     │          │
│ │             uds_path = PosixPath('/tmp/text-generation-server') │          │
│ │ unix_socket_template = 'unix://{}-{}'                           │          │
│ ╰─────────────────────────────────────────────────────────────────╯          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init │
│ __.py:1149 in get_model_with_lora_adapters                                   │
│                                                                              │
│   1146 │   adapter_to_index: Dict[str, int],                                 │
│   1147 ):                                                                    │
│   1148 │   lora_adapter_ids = [adapter.id for adapter in lora_adapters]      │
│ ❱ 1149 │   model = get_model(                                                │
│   1150 │   │   model_id,                                                     │
│   1151 │   │   lora_adapter_ids,                                             │
│   1152 │   │   revision,                                                     │
│                                                                              │
│ ╭───────────────────── locals ──────────────────────╮                        │
│ │  adapter_to_index = {}                            │                        │
│ │             dtype = None                          │                        │
│ │  lora_adapter_ids = []                            │                        │
│ │     lora_adapters = []                            │                        │
│ │  max_input_tokens = 4000                          │                        │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B' │                        │
│ │          quantize = 'bitsandbytes-nf4'            │                        │
│ │          revision = None                          │                        │
│ │           sharded = False                         │                        │
│ │         speculate = None                          │                        │
│ │ trust_remote_code = True                          │                        │
│ ╰───────────────────────────────────────────────────╯                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init │
│ __.py:733 in get_model                                                       │
│                                                                              │
│    730 │   │   elif sharded:                                                 │
│    731 │   │   │   raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format( │
│    732 │   │   else:                                                         │
│ ❱  733 │   │   │   return CausalLM.fallback(                                 │
│    734 │   │   │   │   model_id,                                             │
│    735 │   │   │   │   revision,                                             │
│    736 │   │   │   │   quantize=quantize,                                    │
│                                                                              │
│ ╭───────────────────────────── locals ─────────────────────────────╮         │
│ │                   _ = {}                                         │         │
│ │         config_dict = {                                          │         │
│ │                       │   'architectures': [                     │         │
│ │                       │   │   'LlamaForCausalLM'                 │         │
│ │                       │   ],                                     │         │
│ │                       │   'attention_bias': False,               │         │
│ │                       │   'attention_dropout': 0.0,              │         │
│ │                       │   'bos_token_id': 128000,                │         │
│ │                       │   'eos_token_id': 128001,                │         │
│ │                       │   'hidden_act': 'silu',                  │         │
│ │                       │   'hidden_size': 4096,                   │         │
│ │                       │   'initializer_range': 0.02,             │         │
│ │                       │   'intermediate_size': 14336,            │         │
│ │                       │   'max_position_embeddings': 8192,       │         │
│ │                       │   ... +14                                │         │
│ │                       }                                          │         │
│ │               dtype = None                                       │         │
│ │    lora_adapter_ids = []                                         │         │
│ │    max_input_tokens = 4000                                       │         │
│ │              method = 'n-gram'                                   │         │
│ │            model_id = 'nvidia/Llama3-ChatQA-1.5-8B'              │         │
│ │          model_type = 'llama'                                    │         │
│ │ quantization_config = None                                       │         │
│ │            quantize = 'bitsandbytes-nf4'                         │         │
│ │            revision = None                                       │         │
│ │             sharded = False                                      │         │
│ │      sliding_window = -1                                         │         │
│ │           speculate = 0                                          │         │
│ │          speculator = None                                       │         │
│ │   trust_remote_code = True                                       │         │
│ ╰──────────────────────────────────────────────────────────────────╯         │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal │
│ _lm.py:600 in fallback                                                       │
│                                                                              │
│   597 │   │   │   dtype = torch.float16 if dtype is None else dtype          │
│   598 │   │   else:                                                          │
│   599 │   │   │   if quantize:                                               │
│ ❱ 600 │   │   │   │   raise ValueError("quantization is not available on CPU │
│   601 │   │   │                                                              │
│   602 │   │   │   device = torch.device("cpu")                               │
│   603 │   │   │   dtype = torch.float32 if dtype is None else dtype          │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │               cls = <class                                               │ │
│ │                     'text_generation_server.models.causal_lm.CausalLM'>  │ │
│ │             dtype = None                                                 │ │
│ │          model_id = 'nvidia/Llama3-ChatQA-1.5-8B'                        │ │
│ │          quantize = 'bitsandbytes-nf4'                                   │ │
│ │          revision = None                                                 │ │
│ │        speculator = None                                                 │ │
│ │ trust_remote_code = True                                                 │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: quantization is not available on CPU rank=0
Error: ShardCannotStart
2024-08-04T02:44:59.336935Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-04T02:44:59.336981Z  INFO text_generation_launcher: Shutting down shards
Polling
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant