We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I am trying to finetune LLava NeXT 34B Hardware 4x NVIDIA A100 80GB
Issue: The training process is OOMing. Am I missing something in setting up distributed training?
I am using the following training config:
export OMP_NUM_THREADS=8 export NCCL_IB_DISABLE=0 export NCCL_IB_GID_INDEX=3 export NCCL_SOCKET_IFNAME=eno1 export NCCL_DEBUG=INFO LLM_VERSION="Qwen/Qwen1.5-32B-Chat" # for 7b model we recommend bs=1, accum=2, 16 nodes, 128 gpus, lr=1e-5, warmup=0.03 # for 72b model we recommend bs=1, accum=1, 32 nodes, 256 gpus, lr=1e-5, warmup=0.03 LLM_VERSION_CLEAN="${LLM_VERSION//\//_}" #VISION_MODEL_VERSION="/home/ashutosh/.cache/huggingface/hub/models--google--siglip-so400m-patch14-384/snapshots/9fdffc58afc957d1a03a25b10dba0329ab15c2a3" VISION_MODEL_VERSION="google/siglip-so400m-patch14-384" VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}" ############### Pretrain ################ BASE_RUN_NAME="llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-34B-Instruct-mlp2x_gelu-pretrain_blip558k_plain" echo "BASE_RUN_NAME: ${BASE_RUN_NAME}" ############### Finetune ################ # Stage 2 PROMPT_VERSION="qwen_1_5" RUN_NAME="llava-next-34b-qwen-aux-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-si_stage_am9-3-epoch-full-tune" PREV_STAGE_CHECKPOINT="lmms-lab/llava-next-qwen-32b" echo "PREV_STAGE_CHECKPOINT: ${PREV_STAGE_CHECKPOINT}" echo "MID_RUN_NAME: ${RUN_NAME}" RANK=${RANK:-0} ADDR=${ADDR:-"127.0.0.1"} PORT=${PORT:-"29502"} NNODES=${NNODES:-4} NUM_GPUS=${NUM_GPUS:-4} ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \ llava/train/train_mem.py \ --deepspeed scripts/zero3.json \ --model_name_or_path $PREV_STAGE_CHECKPOINT \ --version $PROMPT_VERSION \ --data_path="/path/to/data/llava_instruct/llava1_6mix.json" \ --image_folder /path/to/data/llava_data \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres_max_9 \ --image_grid_pinpoints "(1x1),...,(6x6)" \ --mm_patch_merge_type spatial_unpad \ --bf16 True \ --run_name $RUN_NAME \ --output_dir /home/anamika/LLaVA-NeXT/output/$RUN_NAME \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 3000 \ --save_total_limit 5 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 32768 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --frames_upbound 32 \ --quant_type "nf4" \ --lora_enable True \ --lora_r 64 \ --lora_alpha 16 exit 0;
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I am trying to finetune LLava NeXT 34B
Hardware 4x NVIDIA A100 80GB
Issue: The training process is OOMing. Am I missing something in setting up distributed training?
I am using the following training config:
The text was updated successfully, but these errors were encountered: