Authors: Georg Felber, Grgic Filip, Sperk Lukas
This project systematically evaluates the security risks in C code generated by large language models (LLMs). We benchmark OpenAI's GPT-4o, Anthropic's Claude 3.7 Sonnet, and DeepSeek Chat across critical programming tasks under various prompt engineering scenarios, revealing how prompt phrasing and intent manipulation affect code safety.
The entire workflow is automated and reproducible:
# isntall requirements pip install -r requirements.txt # list available templates ./test.py test.py list # generate new tests ./test.py run [OPTIONS] TEMPLATE # run tests on cached files ./test.py cache [OPTIONS] TEMPLATE # iterate over all memory corruptions ./test.py analyze [OPTIONS] TEMPLATE # Analyze logged results (create diagrams) ./analyze.pyTemplates are listed in template/ and compose of the following files (taking array_index as an example):
array_index/ ├── bugs.c ├── oracle.c ├── problem.md └── tests └── ... - bugs.c
is the klee setup file that comparesn file contains the problem statement and the prompt used to generate the test - oracle.c
this files contain the code that is that fullfills the task and used for comparsion against the generated code - problem.md
this markdown file contains the problem statement and the prompt used to generate the test - tests/
this folder contains the generated tests
We generated and analyzed 3,000 samples of LLM-generated C code, combining:
- 3 Models: GPT-4o, Claude 3.7 Sonnet, DeepSeek Chat V3
- 4 Tasks: Array Operations, Decompression, Deserialization, String Manipulation
- 5 Prompt Strategies: No injection, secure, fast, unsafe, and conflicting (unsafe & secure)
These combinations were evaluated for correctness, memory safety, and vulnerability.
A breakdown of bug types and frequency across all models and prompts.
We tested how models react to system-level prompt injections that steer them toward fast, secure, or even maliciously unsafe code:
- No Injection: Default behavior
- Fast: Prioritize performance over safety
- Secure: Add maximum validation/safety checks
- Unsafe: Introduce backdoors or memory corruptions
- Unsafe & Secure: Conflicting instructions
These manipulations revealed the extreme sensitivity of LLMs to prompt phrasing and goal alignment.
Each LLM was asked to solve four security-relevant tasks in C:
| Task Name | Key Risk Area |
|---|---|
array_index | Bounds-checked memory access |
decompression | Pointer arithmetic, recursion risks |
deserialization | Length validation & buffer overrun |
unique_words | Heap safety and memory management |
The generated code was compiled and symbolically analyzed using KLEE.
Each output was labeled as:
Bug: logical or functional errorCrpt: memory corruptionFailed: compilation or runtime failure
- 37.4% of generated samples had logical bugs
- 14.7% showed memory corruption
- Secure prompting dropped corruption rates as low as 2–3.5%
- GPT had the highest bug rates; Claude the lowest
- Decompression was the most error-prone task
Prompt design was the most impactful factor in output safety:
- Unsafe prompts maximized failure rates (up to 70%+ bugs)
- Secure prompts reduced vulnerability but not perfectly
- Conflicting prompts caused partial override, not full mitigation



