🧠 Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

This is the official implementation of the paper: Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

🛠️ Re-development

If you would like to build on top of this project, refer to sglang_soft_thinking_pkg/README.md, or review the differences from SGLang v0.4.6.post1 in sglang_soft_thinking_pkg/change_0.4.6.post1.diff.

🎲 Soft Thinking with Random Perturbation

Our implementation now includes support for Dirichlet and Gumbel-Softmax noise in Soft Thinking sampling, as detailed in the study LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking. For more details, see scripts/st/qwq32b_gumble.sh.

Relevant parameters:

--add_noise_gumbel_softmax \ --gumbel_softmax_temperature 0.5 --add_noise_dirichlet \ --dirichlet_temperature 1.0 \

⚙️ Environment Setup

To set up the virtual environment for SGLang Soft Thinking inference, execute each line in configure.sh:

conda create -n st python=3.11 -y && conda activate st pip install --upgrade pip pip install torch transformers accelerate jsonlines math_verify openai torch_memory_saver pip install flash_attn --no-build-isolation # may take more time (20min). try `pip install flash_attn==2.7.3 --no-build-isolation` if find undefined symbol bug # Install SGLang (0.4.6.post1) tailored for Soft Thinking cd sglang_soft_thinking_pkg pip install -e "python[all]" cd ..

🐳 Docker

We find it hard to reproduce some results across different devices due to precision issues. We recommend installing the environment with Docker by following docker.sh:

# For Docker docker run -it --name h100_zz --gpus all \ --shm-size 32g \ --network host \ -v /.cache:/root/.cache \ -v <path_to_your_workspace>:/workspace \ --env "HF_TOKEN=<huggingface_token>" \ --ipc=host \ lmsysorg/sglang:latest \ /bin/bash docker start -i h100_st cd /workspace/Soft-Thinking bash xx.sh

🚀 Quick Start

Clone the repository:

git clone https://github.com/your-repo/soft_thinking.git cd soft_thinking

Set up the environment: Follow the Environment Setup instructions (Docker is recommended).
Run a baseline test:
```
bash scripts/baseline/qwq32b.sh
```

🔄 Reproduction Instructions

Warning

Soft thinking yields suboptimal results on smaller models (≤7B or even ≤14B). We attribute this to limited hidden sizes, causing the last hidden state to lie in proximity to unrelated embeddings, thereby introducing substantial noise during probability weighting.
Additionally, precision issues may arise due to insufficient VRAM, which may lead to lower performance. Please try gradually decreasing --max_running_requests (start from 400 to 1) to keep token usage below 90% during inference or ensure no offloading occurs and the same operator (https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). Please refer to #17 if you have limited GPU memory. If you would like to contribute to deterministic Soft Thinking, please feel free to contact me!!!

⚖️ 1. LLM Judge

Use your own OpenAI key in each script.

export OPENAI_API_KEY=""

We use gpt-4.1-2025-04-14 as the LLM judge.

🧪 2. Baseline

Run the baseline script:

bash scripts/baseline/qwq32b.sh

📥 Download the Model

First, download the model to the models/ directory:

python ./models/download.py --model_name "Qwen/QwQ-32B"

🧠 Run Inference

Then, run the baseline inference:

export OPENAI_API_KEY="" python run_sglang_softthinking.py \ --dataset "aime2024" \ --model_name "./models/Qwen/QwQ-32B" \ # you can use Qwen/QwQ-32B without downloading to ./models --max_topk 10 \ --max_generated_tokens 32768 \ --temperature 0.6 \ --top_p 0.95 \ --top_k 30 \ --min_p 0.0 \ --after_thinking_temperature 0.6 \ --after_thinking_top_p 0.95 \ --after_thinking_top_k 30 \ --after_thinking_min_p 0.0 \ --early_stopping_entropy_threshold 0.0 \ --early_stopping_length_threshold 256 \ --mem_fraction_static 0.8 \ --start_idx 0 \ --end_idx 100000 \ --num_gpus 8 \ --num_samples 16 \ --use_llm_judge \ --judge_model_name "gpt-4.1-2025-04-14"

🧠 3. Soft Thinking

Run the Soft Thinking script:

bash scripts/st/qwq32b_st_math.sh

Or directly execute:

export OPENAI_API_KEY="" python run_sglang_softthinking.py \ --dataset "aime2024" \ --model_name "./models/Qwen/QwQ-32B" \ --max_topk 10 \ --max_generated_tokens 32768 \ --temperature 0.6 \ --top_p 0.95 \ --top_k 30 \ --min_p 0.001 \ --after_thinking_temperature 0.6 \ --after_thinking_top_p 0.95 \ --after_thinking_top_k 30 \ --after_thinking_min_p 0.0 \ --early_stopping_entropy_threshold 0.01 \ --early_stopping_length_threshold 256 \ --mem_fraction_static 0.8 \ --start_idx 0 \ --end_idx 100000 \ --num_gpus 8 \ --num_samples 1 \ --enable_soft_thinking \ --use_llm_judge \ --judge_model_name "gpt-4.1-2025-04-14"

When running coding benchmarks (HumanEval, MBPP, and LiveCodeBench), start by executing without the --reeval flag. Then, run it again with the --reeval flag for evaluation. This is due to a multiprocessing bug.

🔍 Hyperparameter Search

We have uploaded results in ./results for reproduction. We use the following hyperparameters:

max_topk: 10
min_p: 0.001
early_stopping_entropy_threshold: 0.01
early_stopping_length_threshold: 256

For optimal results on each benchmark, adjust the following hyperparameters within these ranges:

max_topk: between 5 and 20
min_p: between 0.0 and 0.005
early_stopping_entropy_threshold: between 0.0 and 0.1
early_stopping_length_threshold: between 256 and 1024

Note:

Results may vary across different devices even with the same hyperparameters, due to differences in computation precision. We use NVIDIA H100 GPUs for all experiments. We recommend using Docker for reproduction.

🪪 Licensing

This project utilizes a modified version of the SGLang library. The licensing structure is as follows:

Our Original Code: The code original to this project (i.e. all code outside the ./sglang_soft_thinking_pkg directory) is licensed under the MIT License. A copy of the MIT License can be found in the root LICENCE file.
Modified SGLang: The code within the ./sglang_soft_thinking_pkg directory is a derivative work of SGLang (version 0.4.6.post1) and is therefore licensed under Apache License 2.0. The orginal Apache 2.0 license is included in the ./sglang_soft_thinking_pkg/LICENSE file. We have provide a changes_0.4.6.post1.diff file in that directory to show our modifications.

📜 Citation

If you use this code, please cite our paper:

@article{zhang2025soft, title={Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space}, author={Zhang, Zhen and He, Xuehai and Yan, Weixiang and Shen, Ao and Zhao, Chenyang and Wang, Shuohang and Shen, Yelong and Wang, Xin Eric}, journal={arXiv preprint arXiv:2505.15778}, year={2025} }

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
datasets		datasets
imgs		imgs
models		models
results/results_QwQ32b_math		results/results_QwQ32b_math
scripts		scripts
sglang_soft_thinking_pkg		sglang_soft_thinking_pkg
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
codeeval.py		codeeval.py
configure.sh		configure.sh
convert_livecodebench.py		convert_livecodebench.py
demo.py		demo.py
docker.sh		docker.sh
humanevaleval.py		humanevaleval.py
matheval.py		matheval.py
mbppeval.py		mbppeval.py
readme.md		readme.md
requirements.st.txt		requirements.st.txt
run_sglang_nothinking.py		run_sglang_nothinking.py
run_sglang_softthinking.py		run_sglang_softthinking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

🛠️ Re-development

🎲 Soft Thinking with Random Perturbation

⚙️ Environment Setup

🐳 Docker

🚀 Quick Start

🔄 Reproduction Instructions

⚖️ 1. LLM Judge

🧪 2. Baseline

📥 Download the Model

🧠 Run Inference

🧠 3. Soft Thinking

🔍 Hyperparameter Search

🪪 Licensing

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

eric-ai-lab/Soft-Thinking

Folders and files

Latest commit

History

Repository files navigation

🧠 Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

🛠️ Re-development

🎲 Soft Thinking with Random Perturbation

⚙️ Environment Setup

🐳 Docker

🚀 Quick Start

🔄 Reproduction Instructions

⚖️ 1. LLM Judge

🧪 2. Baseline

📥 Download the Model

🧠 Run Inference

🧠 3. Soft Thinking

🔍 Hyperparameter Search

🪪 Licensing

📜 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages