This is the official implementation of the paper: Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
If you would like to build on top of this project, refer to sglang_soft_thinking_pkg/README.md, or review the differences from SGLang v0.4.6.post1 in sglang_soft_thinking_pkg/change_0.4.6.post1.diff.
Our implementation now includes support for Dirichlet and Gumbel-Softmax noise in Soft Thinking sampling, as detailed in the study LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking. For more details, see scripts/st/qwq32b_gumble.sh.
Relevant parameters:
--add_noise_gumbel_softmax \ --gumbel_softmax_temperature 0.5 --add_noise_dirichlet \ --dirichlet_temperature 1.0 \To set up the virtual environment for SGLang Soft Thinking inference, execute each line in configure.sh:
conda create -n st python=3.11 -y && conda activate st pip install --upgrade pip pip install torch transformers accelerate jsonlines math_verify openai torch_memory_saver pip install flash_attn --no-build-isolation # may take more time (20min). try `pip install flash_attn==2.7.3 --no-build-isolation` if find undefined symbol bug # Install SGLang (0.4.6.post1) tailored for Soft Thinking cd sglang_soft_thinking_pkg pip install -e "python[all]" cd ..We find it hard to reproduce some results across different devices due to precision issues. We recommend installing the environment with Docker by following docker.sh:
# For Docker docker run -it --name h100_zz --gpus all \ --shm-size 32g \ --network host \ -v /.cache:/root/.cache \ -v <path_to_your_workspace>:/workspace \ --env "HF_TOKEN=<huggingface_token>" \ --ipc=host \ lmsysorg/sglang:latest \ /bin/bash docker start -i h100_st cd /workspace/Soft-Thinking bash xx.sh- Clone the repository:
git clone https://github.com/your-repo/soft_thinking.git cd soft_thinking - Set up the environment: Follow the Environment Setup instructions (Docker is recommended).
- Run a baseline test:
bash scripts/baseline/qwq32b.sh
Warning
- Soft thinking yields suboptimal results on smaller models (β€7B or even β€14B). We attribute this to limited hidden sizes, causing the last hidden state to lie in proximity to unrelated embeddings, thereby introducing substantial noise during probability weighting.
- Additionally, precision issues may arise due to insufficient VRAM, which may lead to lower performance. Please try gradually decreasing
--max_running_requests(start from 400 to 1) to keep token usage below 90% during inference or ensure no offloading occurs and the same operator (https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). Please refer to #17 if you have limited GPU memory. If you would like to contribute to deterministic Soft Thinking, please feel free to contact me!!!
Use your own OpenAI key in each script.
export OPENAI_API_KEY=""We use gpt-4.1-2025-04-14 as the LLM judge.
Run the baseline script:
bash scripts/baseline/qwq32b.shFirst, download the model to the models/ directory:
python ./models/download.py --model_name "Qwen/QwQ-32B"Then, run the baseline inference:
export OPENAI_API_KEY="" python run_sglang_softthinking.py \ --dataset "aime2024" \ --model_name "./models/Qwen/QwQ-32B" \ # you can use Qwen/QwQ-32B without downloading to ./models --max_topk 10 \ --max_generated_tokens 32768 \ --temperature 0.6 \ --top_p 0.95 \ --top_k 30 \ --min_p 0.0 \ --after_thinking_temperature 0.6 \ --after_thinking_top_p 0.95 \ --after_thinking_top_k 30 \ --after_thinking_min_p 0.0 \ --early_stopping_entropy_threshold 0.0 \ --early_stopping_length_threshold 256 \ --mem_fraction_static 0.8 \ --start_idx 0 \ --end_idx 100000 \ --num_gpus 8 \ --num_samples 16 \ --use_llm_judge \ --judge_model_name "gpt-4.1-2025-04-14" Run the Soft Thinking script:
bash scripts/st/qwq32b_st_math.shOr directly execute:
export OPENAI_API_KEY="" python run_sglang_softthinking.py \ --dataset "aime2024" \ --model_name "./models/Qwen/QwQ-32B" \ --max_topk 10 \ --max_generated_tokens 32768 \ --temperature 0.6 \ --top_p 0.95 \ --top_k 30 \ --min_p 0.001 \ --after_thinking_temperature 0.6 \ --after_thinking_top_p 0.95 \ --after_thinking_top_k 30 \ --after_thinking_min_p 0.0 \ --early_stopping_entropy_threshold 0.01 \ --early_stopping_length_threshold 256 \ --mem_fraction_static 0.8 \ --start_idx 0 \ --end_idx 100000 \ --num_gpus 8 \ --num_samples 1 \ --enable_soft_thinking \ --use_llm_judge \ --judge_model_name "gpt-4.1-2025-04-14" When running coding benchmarks (HumanEval, MBPP, and LiveCodeBench), start by executing without the --reeval flag. Then, run it again with the --reeval flag for evaluation. This is due to a multiprocessing bug.
We have uploaded results in ./results for reproduction. We use the following hyperparameters:
max_topk: 10min_p: 0.001early_stopping_entropy_threshold: 0.01early_stopping_length_threshold: 256
For optimal results on each benchmark, adjust the following hyperparameters within these ranges:
max_topk: between 5 and 20min_p: between 0.0 and 0.005early_stopping_entropy_threshold: between 0.0 and 0.1early_stopping_length_threshold: between 256 and 1024
Note:
- Results may vary across different devices even with the same hyperparameters, due to differences in computation precision. We use NVIDIA H100 GPUs for all experiments. We recommend using Docker for reproduction.
This project utilizes a modified version of the SGLang library. The licensing structure is as follows:
-
Our Original Code: The code original to this project (i.e. all code outside the
./sglang_soft_thinking_pkgdirectory) is licensed under the MIT License. A copy of the MIT License can be found in the rootLICENCEfile. -
Modified SGLang: The code within the
./sglang_soft_thinking_pkgdirectory is a derivative work ofSGLang(version 0.4.6.post1) and is therefore licensed under Apache License 2.0. The orginal Apache 2.0 license is included in the./sglang_soft_thinking_pkg/LICENSEfile. We have provide achanges_0.4.6.post1.difffile in that directory to show our modifications.
If you use this code, please cite our paper:
@article{zhang2025soft, title={Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space}, author={Zhang, Zhen and He, Xuehai and Yan, Weixiang and Shen, Ao and Zhao, Chenyang and Wang, Shuohang and Shen, Yelong and Wang, Xin Eric}, journal={arXiv preprint arXiv:2505.15778}, year={2025} }