4-bit model used more RAM than bf16 in HF transformers

System Info

Google Colab
80GB A100 GPU Linux
transformers-4.57.0.dev0
bitsandbytes 0.48.1

Reproduction

It was very weird that I tried to finetune Qwen3-VL-30B-A3B on Google Colab, and found that using 4-bit version actually raised VRAM OOM error, and if I use bf16 it runs fine.

If I added quantization_config=bnb_config it will have OOM

from transformers import BitsAndBytesConfig import torch from transformers import AutoModelForImageTextToText, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4', ) # processor = AutoProcessor.from_pretrained("google/gemma-3-270m") model = AutoModelForImageTextToText.from_pretrained( "Qwen/Qwen3-VL-30B-A3B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"#, quantization_config=bnb_config ) from peft import LoraConfig, get_peft_model peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.0, r=16, bias="none", task_type="CAUSAL_LM", target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", ] ) from trl import SFTConfig training_args = SFTConfig( output_dir="qwen_anti_aesthetics_3b", num_train_epochs=5, per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=2, gradient_checkpointing_kwargs={"use_reentrant": False}, max_length=None, optim="adamw_torch_fused", learning_rate=2e-5, weight_decay=0.001, logging_steps=10, eval_steps=500, logging_strategy="steps", eval_strategy="steps", save_strategy="steps", save_steps=500, bf16=True, warmup_ratio=0.02, push_to_hub=True, report_to="wandb", remove_unused_columns=False, dataloader_num_workers=12, dataloader_prefetch_factor=4, dataloader_pin_memory=True, completion_only_loss=True, lr_scheduler_type="cosine", ) trainer = SFTTrainer( model=model, peft_config=peft_config, args=training_args, train_dataset=train_ds, eval_dataset=test_ds, )

Expected behavior

4bit version should use less, at least not more RAM, than regular model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

4-bit model used more RAM than bf16 in HF transformers #1780

System Info

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

4-bit model used more RAM than bf16 in HF transformers #1780

Description

System Info

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions