[BUG] Can't use lighteval to evaluate the nanotron #395

alexchen4ai · 2024-11-19T04:59:03Z

Describe the bug

lighteval nanotron --checkpoint_config_path ../nexatron/examples/tiny_llama3/checkpoints/100000/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_o
verride_template.yaml
/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/flash_attn/ops/triton/layer_norm.py:984: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/flash_attn/ops/triton/layer_norm.py:1043: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
WARNING:lighteval.logging.hierarchical_logger:main: (0, '../nexatron/examples/tiny_llama3_nanoset/checkpoints/100000/config.yaml'), (1, 'examples/nanotron/lighteval_config_override_template.yaml'), (2, '/data/.cache/huggingface'), {
WARNING:lighteval.logging.hierarchical_logger: Load nanotron config {
skip_unused_config_keys set
Skip_null_keys set
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.005991]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:00.006073]
Traceback (most recent call last):
File "/opt/anaconda3/envs/lighteval/bin/lighteval", line 8, in
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/lighteval/main.py", line 67, in cli_evaluate
main_nanotron(args.checkpoint_config_path, args.lighteval_config_path, args.cache_dir)
File "/data/alex_dev/lighteval/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/lighteval/main_nanotron.py", line 57, in main
model_config = get_config_from_file(
^^^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/nanotron/src/nanotron/config/config.py", line 403, in get_config_from_file
config = get_config_from_dict(
^^^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/nanotron/src/nanotron/config/config.py", line 364, in get_config_from_dict
return from_dict(
^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 64, in from_dict
value = build_value(type=field_type, data=field_data, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 99, in build_value
data = from_dict(data_class=type, data=data, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 58, in from_dict
raise UnexpectedDataError(keys=extra_fields)
dacite.exceptions.UnexpectedDataError: can not match "tp_recompute_allgather", "recompute_layer" to any data class field

To Reproduce

lighteval nanotron --checkpoint_config_path ../nanotron/examples/tiny_llama3/checkpoints/100000/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml

Version info

I use the latest nanotron and lighteval using pip install lighteval[nanotron]

The text was updated successfully, but these errors were encountered:

clefourrier · 2024-11-19T08:07:57Z

Hi @alexchen4ai , thanks for the issue! Could you provide your config.yaml file?
From a first glance, this does not seem related to lighteval. @3outeille or @eliebak , is this an error you have already encountered? Did something change in nanotron's config recently?

alexchen4ai · 2024-11-19T08:23:18Z

Hi @alexchen4ai , thanks for the issue! Could you provide your config.yaml file? From a first glance, this does not seem related to lighteval. @3outeille or @eliebak , is this an error you have already encountered? Did something change in nanotron's config recently?

Thanks for the reply. This is the config of my current checkpoint:

checkpoints:
  checkpoint_interval: 100000
  checkpoints_path: checkpoints
  checkpoints_path_is_shared_file_system: false
  resume_checkpoint_path: null
  save_final_state: false
  save_initial_state: false
data_stages:
- data:
    dataset:
      dataset_folder:
      - /dataset1
      dataset_weights:
      - 1
    num_loading_workers: 1
    seed: 42
  name: Stable Training Stage 1
  start_training_step: 1
- data:
    dataset:
      dataset_folder:
      -/dataset2
      dataset_weights:
      - 1
    num_loading_workers: 1
    seed: 42
  name: Stable Training Stage 2
  start_training_step: 1797727
general:
  benchmark_csv_path: null
  consumed_train_samples: 6400000
  ignore_sanity_checks: true
  project: llama3-tiny-training
  run: tiny_llama_debug
  seed: 42
  step: 100000
lighteval: null
logging:
  iteration_step_info_interval: 1
  log_level: info
  log_level_replica: info
model:
  ddp_bucket_cap_mb: 25
  dtype: bfloat16
  init_method:
    std: 0.025
  make_vocab_size_divisible_by: 1
  model_config:
    bos_token_id: 1
    eos_token_id: 1
    hidden_act: silu
    hidden_size: 576
    initializer_range: 0.02
    intermediate_size: 1536
    is_llama_config: true
    max_position_embeddings: 2048
    num_attention_heads: 8
    num_hidden_layers: 30
    num_key_value_heads: 4
    pad_token_id: null
    pretraining_tp: 1
    rms_norm_eps: 1.0e-05
    rope_interleaved: false
    rope_scaling: null
    rope_theta: 100000
    tie_word_embeddings: true
    use_cache: true
    vocab_size: 128256
optimizer:
  accumulate_grad_in_fp32: true
  clip_grad: 1.0
  learning_rate_scheduler:
    learning_rate: 0.0008
    lr_decay_starting_step: null
    lr_decay_steps: 4497000
    lr_decay_style: cosine
    lr_warmup_steps: 3000
    lr_warmup_style: linear
    min_decay_lr: 8.0e-05
  optimizer_factory:
    adam_beta1: 0.9
    adam_beta2: 0.95
    adam_eps: 1.0e-08
    name: adamW
    torch_adam_is_fused: true
  weight_decay: 0.01
  zero_stage: 0
parallelism:
  dp: 2
  expert_parallel_size: 1
  pp: 2
  pp_engine: 1f1b
  recompute_layer: false
  tp: 2
  tp_linear_async_communication: true
  tp_mode: REDUCE_SCATTER
  tp_recompute_allgather: true
profiler: null
s3_upload: null
tokenizer:
  tokenizer_max_length: null
  tokenizer_name_or_path: meta-llama/Llama-3.2-1B
  tokenizer_revision: null
tokens:
  batch_accumulation_per_replica: 1
  limit_test_batches: 0
  limit_val_batches: 10
  micro_batch_size: 32
  sequence_length: 2048
  train_steps: 4500000
  val_check_interval: 10000

For the config of the lighteval:

batch_size: 8
generation: null
logging:
  output_dir: "outputs"
  save_details: false
  push_results_to_hub: false
  push_details_to_hub: false
  push_results_to_tensorboard: false
  public_run: false
  results_org: null
  tensorboard_metric_prefix: "eval"
parallelism:
  dp: 1
  pp: 1
  pp_engine: 1f1b
  tp: 1
  tp_linear_async_communication: false
  tp_mode: ALL_REDUCE
tasks:
  dataset_loading_processes: 8
  max_samples: 10
  multichoice_continuations_start_space: null
  num_fewshot_seeds: null
  tasks: leaderboard|hellaswag|0|0

NathanHB · 2024-11-19T13:02:03Z

Hi @alexchen4ai I would suggest installing nanotron from source.
tp_recompute_all_gather and recompute_layer are not find in the config of your version of nanotron (version 0.4)

alexchen4ai added the bug Something isn't working label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Can't use lighteval to evaluate the nanotron #395

[BUG] Can't use lighteval to evaluate the nanotron #395

alexchen4ai commented Nov 19, 2024 •

edited

Loading

clefourrier commented Nov 19, 2024

alexchen4ai commented Nov 19, 2024 •

edited

Loading

NathanHB commented Nov 19, 2024 •

edited

Loading

[BUG] Can't use lighteval to evaluate the nanotron #395

[BUG] Can't use lighteval to evaluate the nanotron #395

Comments

alexchen4ai commented Nov 19, 2024 • edited Loading

Describe the bug

To Reproduce

Version info

clefourrier commented Nov 19, 2024

alexchen4ai commented Nov 19, 2024 • edited Loading

NathanHB commented Nov 19, 2024 • edited Loading

alexchen4ai commented Nov 19, 2024 •

edited

Loading

alexchen4ai commented Nov 19, 2024 •

edited

Loading

NathanHB commented Nov 19, 2024 •

edited

Loading