Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can't use lighteval to evaluate the nanotron #395

Open
alexchen4ai opened this issue Nov 19, 2024 · 3 comments
Open

[BUG] Can't use lighteval to evaluate the nanotron #395

alexchen4ai opened this issue Nov 19, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@alexchen4ai
Copy link

alexchen4ai commented Nov 19, 2024

Describe the bug

lighteval nanotron --checkpoint_config_path ../nexatron/examples/tiny_llama3/checkpoints/100000/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_o
verride_template.yaml
/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/flash_attn/ops/triton/layer_norm.py:984: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/flash_attn/ops/triton/layer_norm.py:1043: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
WARNING:lighteval.logging.hierarchical_logger:main: (0, '../nexatron/examples/tiny_llama3_nanoset/checkpoints/100000/config.yaml'), (1, 'examples/nanotron/lighteval_config_override_template.yaml'), (2, '/data/.cache/huggingface'), {
WARNING:lighteval.logging.hierarchical_logger: Load nanotron config {
skip_unused_config_keys set
Skip_null_keys set
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.005991]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:00.006073]
Traceback (most recent call last):
File "/opt/anaconda3/envs/lighteval/bin/lighteval", line 8, in
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/lighteval/main.py", line 67, in cli_evaluate
main_nanotron(args.checkpoint_config_path, args.lighteval_config_path, args.cache_dir)
File "/data/alex_dev/lighteval/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/lighteval/main_nanotron.py", line 57, in main
model_config = get_config_from_file(
^^^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/nanotron/src/nanotron/config/config.py", line 403, in get_config_from_file
config = get_config_from_dict(
^^^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/nanotron/src/nanotron/config/config.py", line 364, in get_config_from_dict
return from_dict(
^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 64, in from_dict
value = build_value(type=field_type, data=field_data, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 99, in build_value
data = from_dict(data_class=type
, data=data, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 58, in from_dict
raise UnexpectedDataError(keys=extra_fields)
dacite.exceptions.UnexpectedDataError: can not match "tp_recompute_allgather", "recompute_layer" to any data class field

To Reproduce

lighteval nanotron --checkpoint_config_path ../nanotron/examples/tiny_llama3/checkpoints/100000/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml

Version info

I use the latest nanotron and lighteval using pip install lighteval[nanotron]

@alexchen4ai alexchen4ai added the bug Something isn't working label Nov 19, 2024
@clefourrier
Copy link
Member

Hi @alexchen4ai , thanks for the issue! Could you provide your config.yaml file?
From a first glance, this does not seem related to lighteval. @3outeille or @eliebak , is this an error you have already encountered? Did something change in nanotron's config recently?

@alexchen4ai
Copy link
Author

alexchen4ai commented Nov 19, 2024

Hi @alexchen4ai , thanks for the issue! Could you provide your config.yaml file? From a first glance, this does not seem related to lighteval. @3outeille or @eliebak , is this an error you have already encountered? Did something change in nanotron's config recently?

Thanks for the reply. This is the config of my current checkpoint:

checkpoints:
  checkpoint_interval: 100000
  checkpoints_path: checkpoints
  checkpoints_path_is_shared_file_system: false
  resume_checkpoint_path: null
  save_final_state: false
  save_initial_state: false
data_stages:
- data:
    dataset:
      dataset_folder:
      - /dataset1
      dataset_weights:
      - 1
    num_loading_workers: 1
    seed: 42
  name: Stable Training Stage 1
  start_training_step: 1
- data:
    dataset:
      dataset_folder:
      -/dataset2
      dataset_weights:
      - 1
    num_loading_workers: 1
    seed: 42
  name: Stable Training Stage 2
  start_training_step: 1797727
general:
  benchmark_csv_path: null
  consumed_train_samples: 6400000
  ignore_sanity_checks: true
  project: llama3-tiny-training
  run: tiny_llama_debug
  seed: 42
  step: 100000
lighteval: null
logging:
  iteration_step_info_interval: 1
  log_level: info
  log_level_replica: info
model:
  ddp_bucket_cap_mb: 25
  dtype: bfloat16
  init_method:
    std: 0.025
  make_vocab_size_divisible_by: 1
  model_config:
    bos_token_id: 1
    eos_token_id: 1
    hidden_act: silu
    hidden_size: 576
    initializer_range: 0.02
    intermediate_size: 1536
    is_llama_config: true
    max_position_embeddings: 2048
    num_attention_heads: 8
    num_hidden_layers: 30
    num_key_value_heads: 4
    pad_token_id: null
    pretraining_tp: 1
    rms_norm_eps: 1.0e-05
    rope_interleaved: false
    rope_scaling: null
    rope_theta: 100000
    tie_word_embeddings: true
    use_cache: true
    vocab_size: 128256
optimizer:
  accumulate_grad_in_fp32: true
  clip_grad: 1.0
  learning_rate_scheduler:
    learning_rate: 0.0008
    lr_decay_starting_step: null
    lr_decay_steps: 4497000
    lr_decay_style: cosine
    lr_warmup_steps: 3000
    lr_warmup_style: linear
    min_decay_lr: 8.0e-05
  optimizer_factory:
    adam_beta1: 0.9
    adam_beta2: 0.95
    adam_eps: 1.0e-08
    name: adamW
    torch_adam_is_fused: true
  weight_decay: 0.01
  zero_stage: 0
parallelism:
  dp: 2
  expert_parallel_size: 1
  pp: 2
  pp_engine: 1f1b
  recompute_layer: false
  tp: 2
  tp_linear_async_communication: true
  tp_mode: REDUCE_SCATTER
  tp_recompute_allgather: true
profiler: null
s3_upload: null
tokenizer:
  tokenizer_max_length: null
  tokenizer_name_or_path: meta-llama/Llama-3.2-1B
  tokenizer_revision: null
tokens:
  batch_accumulation_per_replica: 1
  limit_test_batches: 0
  limit_val_batches: 10
  micro_batch_size: 32
  sequence_length: 2048
  train_steps: 4500000
  val_check_interval: 10000

For the config of the lighteval:

batch_size: 8
generation: null
logging:
  output_dir: "outputs"
  save_details: false
  push_results_to_hub: false
  push_details_to_hub: false
  push_results_to_tensorboard: false
  public_run: false
  results_org: null
  tensorboard_metric_prefix: "eval"
parallelism:
  dp: 1
  pp: 1
  pp_engine: 1f1b
  tp: 1
  tp_linear_async_communication: false
  tp_mode: ALL_REDUCE
tasks:
  dataset_loading_processes: 8
  max_samples: 10
  multichoice_continuations_start_space: null
  num_fewshot_seeds: null
  tasks: leaderboard|hellaswag|0|0

@NathanHB
Copy link
Member

NathanHB commented Nov 19, 2024

Hi @alexchen4ai I would suggest installing nanotron from source.
tp_recompute_all_gather and recompute_layer are not find in the config of your version of nanotron (version 0.4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants