Cautious Optimizers (C-Optiom): Improving Training with One Line of Code

AdamW has long been the go-to optimizer for transformer pretraining. For years, the research community has been searching for faster and more stable optimizers, with a focus on achieving only positive outcomes. In this work, we introduce a simple, single-line modification in PyTorch for any momentum-based optimizer. This modification, termed Cautious Optimizer (e.g., C-AdamW and C-Lion), opens the door to improved training performance.

Our theoretical findings reveal that this modification preserves Adam’s Hamiltonian function and retains its convergence guarantees under Lyapunov analysis. Additionally, a new family of optimizers emerges from this insight. Among these, we select the simplest for empirical experiments, achieving up to 1.47x speed-up on Llama and MAE pretraining.

🌟 News

[2025-08-22] Asynced batch update version of C-AdamW, tested on FSDP training up to 1B scale
- Llama3 1B (C-AdamW)
[2025-08-14] Chinchilla Optimal (20x tokens/parameters) on FineWeb-edu with torchtitan
[2025-08-07] Implementing C-AdamW with parallel apply by popular demand
[2025-01-23] PPO (Reinforcement Learning)
[2025-01-14] Post Training experiment on Qwen2.5 1.5B Instruct
[2024-12-03] 🤗🤗🤗 More validation runs on ViTs timm-optim-caution
[2024-12-03] 🤗🤗🤗 Caution implemented in huggingface/pytorch-image-models.
[2024-11-24] Pre-release paper available on arXiv: Cautious Optimizers: Improving Training with One Line of Code.
[2024-11-24] Official implementation of C-Optim released! Experiment with C-AdamW and C-Lion today.

🚀 Implementation

Generic Single-Line Implementation for C-Optim

Pretraining Results

Post Training Results

PPO

📦 Installation

Install Experiment Dependencies

pip install -r requirements.txt

🛠️ Usage

Pretraining Llama on C4

torchrun --standalone --nproc_per_node 1 torchrun_main.py \ --model_config configs/llama_60m.json \ --lr 0.001 \ --batch_size 16 \ --total_batch_size 512 \ --activation_checkpointing \ --num_training_steps 10000 \ --warmup_steps 1000 \ --weight_decay 0 \ --grad_clipping 1.0 \ --dtype bfloat16 \ --eval_every 1000 \ --single_gpu \ --optimizer c-adamw \ --max_length 1024

Pretraining MAE on ImageNet 1K (50 Epochs)

torchrun --standalone --nproc_per_node 4 run_mae.py \ --dataset_name ILSVRC/imagenet-1k \ --output_dir ./vit-mae-c \ --remove_unused_columns False \ --label_names pixel_values \ --mask_ratio 0.75 \ --norm_pix_loss \ --do_train \ --do_eval \ --base_learning_rate 1.5e-4 \ --lr_scheduler_type cosine \ --weight_decay 0.05 \ --num_train_epochs 50 \ --warmup_ratio 0.05 \ --per_device_train_batch_size 256 \ --per_device_eval_batch_size 8 \ --logging_strategy steps \ --logging_steps 10 \ --eval_strategy epoch \ --save_strategy epoch \ --load_best_model_at_end True \ --save_total_limit 3 \ --seed 1337 \ --custom_optim c-adamw \ --trust_remote_code \ --gradient_accumulation_steps 4

Post Training Qwen2.5

torchrun \ --rdzv_id=$JOB_ID \ --rdzv-backend=c10d \ --nnodes=1:8 \ --nproc-per-node=1 \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ post_training.py --model "Qwen/Qwen2.5-1.5B-Instruct" \ --output_dir cautious_1.5b \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 2 \ --max_length 8192 \ --cautious

PPO

accelerate launch ppo_tldr.py \ --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \ --dataset_test_split validation \ --output_dir models/minimal/ppo_tldr \ --learning_rate 3e-6 \ --per_device_train_batch_size 16 \ --gradient_accumulation_steps 4 \ --total_episodes 1000000 \ --model_name_or_path EleutherAI/pythia-1b-deduped \ --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \ --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \ --local_rollout_forward_batch_size 16 \ --missing_eos_penalty 1.0 \ --stop_token eos \ --eval_strategy steps \ --eval_steps 100 \ --custom_optim c_adamw \ --num_gpus 8

📖 Citation

@article{liang2024cautious, title={Cautious optimizers: Improving training with one line of code}, author={Liang, Kaizhao and Chen, Lizhang and Liu, Bo and Liu, Qiang}, journal={arXiv preprint arXiv:2411.16085}, year={2024} }

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
configs		configs
experimental		experimental
images		images
peft_pretraining		peft_pretraining
LICENSE		LICENSE
README.md		README.md
c_adafactor.py		c_adafactor.py
c_adamw.py		c_adamw.py
c_lion.py		c_lion.py
c_muon.py		c_muon.py
lion.py		lion.py
moonlight_toy_train.py		moonlight_toy_train.py
muon.py		muon.py
nanogpt_speedrun.py		nanogpt_speedrun.py
opt_utils.py		opt_utils.py
post_training.py		post_training.py
ppo_tldr.py		ppo_tldr.py
requirements.txt		requirements.txt
run_mae.py		run_mae.py
torchrun_main.py		torchrun_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cautious Optimizers (C-Optiom): Improving Training with One Line of Code

🌟 News

🚀 Implementation

Generic Single-Line Implementation for C-Optim

Pretraining Results

Post Training Results

PPO

📦 Installation

Install Experiment Dependencies

🛠️ Usage

Pretraining Llama on C4

Pretraining MAE on ImageNet 1K (50 Epochs)

Post Training Qwen2.5

PPO

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

kyleliang919/C-Optim

Folders and files

Latest commit

History

Repository files navigation

Cautious Optimizers (C-Optiom): Improving Training with One Line of Code

🌟 News

🚀 Implementation

Generic Single-Line Implementation for C-Optim

Pretraining Results

Post Training Results

PPO

📦 Installation

Install Experiment Dependencies

🛠️ Usage

Pretraining Llama on C4

Pretraining MAE on ImageNet 1K (50 Epochs)

Post Training Qwen2.5

PPO

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages