Posted on Sep 24

Build High Quality Reasoning Datasets in Minutes Using Natural Prompts

As developers, we've all been there. You have a brilliant idea for a Reasoning model you want to fine-tune, but then reality hits: you need training data. Quality training data. A lot of quality training data.

The traditional path? Weeks of manual data curation, complex prompt engineering, which looks good at first, but turns out to be full of duplicates with very little diversity. You have tried a few of the open source tools, but non of them cut-it. Most of us end up abandoning the project or settling for subpar datasets that produce mediocre models.

What if I told you there's a tool that can generate professional-grade CoT datasets in minutes using natural language prompts?

Enter DeepFabric - and it's about to change how you think about dataset creation.

The Problem: Dataset Creation is Broken

Before DeepFabric, creating CoT datasets meant:

📝 Manual curation: Spending days writing examples by hand
🔧 Complex prompt engineering: Wrestling with intricate templates
💸 Expensive services: Paying premium rates for quality data
🎯 Limited diversity: Struggling to create varied, non-repetitive examples
⚖️ Quality vs. quantity: Choosing between good data or enough data

Most developers either gave up or shipped models trained on insufficient data.

The Solution: DeepFabric's Triple Threat

DeepFabric doesn't just solve the dataset problem - it obliterates it with three different CoT formats that cover every use case:

1. 🔥 Free-text CoT (GSM8K Style)

Perfect for mathematical reasoning and step-by-step problem solving.

deepfabric generate \ --mode tree \ --provider openai \ --model gpt-4o-mini \ --depth 2 \ --degree 2 \ --num-steps 4 \ --topic-prompt "Mathematical word problems and logical reasoning" \ --generation-system-prompt "You are a math tutor creating educational problems" \ --conversation-type cot_freetext \ --dataset-save-as math_reasoning.jsonl

Output format:

{ "question": "Sarah has 24 apples. She gives away 1/3 to her neighbors and keeps 1/4 for herself. How many apples are left?", "chain_of_thought": "First, I need to find 1/3 of 24 apples. 24 ÷ 3 = 8 apples given to neighbors. Next, I need to find 1/4 of 24 apples. 24 ÷ 4 = 6 apples kept for herself. Total apples used: 8 + 6 = 14 apples. Apples left: 24 - 14 = 10 apples.", "final_answer": "10 apples" }

2. 🏗️ Structured CoT (Conversation Based)

Ideal for educational dialogues and systematic problem-solving.

deepfabric generate \ --mode graph \ --provider ollama \ --model qwen3:32b \ --topic-prompt "Computer science algorithms and data structures" \ --conversation-type cot_structured \ --reasoning-style logical \ --dataset-save-as cs_reasoning.jsonl

Output format:

{ "messages": [ {"role": "user", "content": "How would you implement a binary search algorithm?"}, {"role": "assistant", "content": "I'll walk you through implementing binary search step by step..."} ], "reasoning_trace": [ {"step": 1, "reasoning": "Define the search space with left and right pointers"}, {"step": 2, "reasoning": "Calculate middle index to divide the array"}, {"step": 3, "reasoning": "Compare target with middle element"} ], "final_answer": "Binary search works by repeatedly dividing the search interval in half. Start with two pointers: `left = 0` and `right = len(arr) - 1`. While `left <= right`, compute `mid = (left + right) // 2`. If `arr[mid] == target`, return `mid`. If `arr[mid] < target`, move `left = mid + 1`; otherwise move `right = mid - 1`. If the loop ends without finding the target, return -1." }

3. 🚀 Hybrid CoT (Best of Both Worlds)

Combines natural reasoning with structured steps - perfect for complex domains.

deepfabric generate \ --provider gemini \ --model gemini-2.5-flash \ --topic-prompt "Infrastructure outages and their resolution" \ --conversation-type cot_hybrid \ --num-steps 8 \ --dataset-save-as science_hybrid.jsonl

Output format:

{ "question": "A critical production service, 'Order Processing,' is experiencing intermittent failures. Users report orders are occasionally failing to process, leading to lost revenue. The service runs on Kubernetes, utilizing a deployment of 10 pods. Initial monitoring shows CPU and memory usage are within acceptable limits, but error logs indicate sporadic database connection timeouts. You suspect a resource contention issue or a misconfiguration somewhere in the infrastructure. Describe your systematic approach to diagnose and resolve this issue, detailing the tools and techniques you'd employ at each step.", "chain_of_thought": "To systematically troubleshoot this intermittent failure, I will follow a structured approach. First, I will gather more information about the failures, including frequency, error messages, and affected users. Then, I will examine the Kubernetes deployment, focusing on resource limits, liveness/readiness probes, and networking. Next, I will investigate the database connection pool and its configuration. I will also analyze the application code for potential resource leaks or inefficient database queries. Finally, I will implement monitoring and alerting to proactively detect and prevent future issues.", "reasoning_trace": [ { "step_number": 1, "thought": "Gather detailed information about the failures. This includes the exact error messages, timestamps, affected users, and any patterns in the failures (e.g., time of day, specific order types).", "action": "Analyze application logs, Kubernetes events, and user reports to collect failure data." }, { "step_number": 2, "thought": "Examine the Kubernetes deployment configuration. Verify resource limits (CPU, memory) are appropriately set for the pods. Check liveness and readiness probes to ensure pods are healthy and responsive. Inspect networking configuration to rule out connectivity issues.", "action": "Use `kubectl describe deployment order-processing` and `kubectl get pods -o wide` to inspect the deployment and pod configurations. Check network policies and service definitions." }, { "step_number": 3, "thought": "Investigate the database connection pool configuration. Verify the maximum number of connections is sufficient for the workload. Check for connection leaks or inefficient connection management in the application code.", "action": "Examine the application's database connection pool settings. Use database monitoring tools to track connection usage and identify potential leaks. Analyze application code for database query performance and connection handling." }, { "step_number": 4, "thought": "Analyze application code for potential resource leaks or inefficient database queries. Use profiling tools to identify performance bottlenecks and memory leaks. Optimize database queries to reduce load on the database.", "action": "Use profiling tools like `pprof` or Java profilers to analyze application performance. Examine database query logs and use `EXPLAIN` to optimize slow queries." }, { "step_number": 5, "thought": "Implement enhanced monitoring and alerting. Set up alerts for database connection errors, high latency, and resource utilization. Use dashboards to visualize key metrics and identify trends.", "action": "Configure Prometheus and Grafana to monitor database connections, latency, and resource usage. Set up alerts in Alertmanager to notify on-call engineers of critical issues." }, { "step_number": 6, "thought": "Consider implementing a circuit breaker pattern to prevent cascading failures. If the database becomes unavailable, the circuit breaker will prevent the application from overwhelming the database with requests.", "action": "Implement a circuit breaker using a library like Hystrix or Resilience4j in the application code." }, { "step_number": 7, "thought": "Implement connection retry logic with exponential backoff. This will allow the application to gracefully handle temporary database connection issues.", "action": "Implement retry logic using a library like Guava Retry or Spring Retry in the application code." }, { "step_number": 8, "thought": "Conduct load testing to simulate production traffic and identify potential bottlenecks. Use load testing tools like JMeter or Gatling to generate realistic workloads.", "action": "Run load tests with increasing traffic to identify the point at which the database connection timeouts occur." }, { "step_number": 9, "thought": "If the database is the bottleneck, consider scaling the database or optimizing its configuration. This may involve adding more resources to the database server, tuning database parameters, or implementing database sharding.", "action": "Scale the database vertically (add more CPU, memory) or horizontally (add more database replicas). Tune database parameters like `max_connections` and `shared_buffers`." }, { "step_number": 10, "thought": "Document the troubleshooting steps and the resolution. Create a runbook for future incidents. Share the knowledge with the team to prevent similar issues in the future.", "action": "Create a detailed runbook documenting the troubleshooting steps, the root cause, and the resolution. Share the runbook with the team and conduct a post-mortem analysis." } ], "final_answer": "The intermittent failures are likely due to database connection timeouts caused by resource contention or misconfiguration. The systematic approach involves gathering detailed information, examining the Kubernetes deployment, investigating the database connection pool, analyzing application code, implementing enhanced monitoring and alerting, and considering circuit breakers and retry logic. Load testing and database scaling may also be necessary. Finally, documenting the troubleshooting steps and sharing the knowledge with the team is crucial." }

Why Developers Are Going Crazy for DeepFabric

🧠 Smart Topic Generation

DeepFabric doesn't just generate random examples. It creates a hierarchical topic tree or graph-nodes first, ensuring your dataset covers diverse subtopics without redundancy:

Mathematical Reasoning ├── Algebra Problems │ ├── Linear Equations │ └── Quadratic Functions └── Geometry Problems ├── Area Calculations └── Volume Problems

🔧 YAML Configuration = Zero Complexity

No more complex prompt engineering. Just describe what you want:

# cot_config.yaml dataset_system_prompt: "You are a helpful AI that solves problems step-by-step" topic_tree: topic_prompt: "Programming challenges and algorithms" provider: "ollama" model: "qwen3:32b" depth: 3 degree: 3 data_engine: conversation_type: "cot_hybrid" reasoning_style: "logical" instructions: "Create coding problems that require systematic thinking" dataset: creation: num_steps: 50 batch_size: 5

Then run: deepfabric generate cot_config.yaml

🌐 Multi-Provider Freedom

Switch between providers based on your needs:

OpenAI GPT-4 for complex reasoning
Ollama for local, private generation
Gemini for fast bulk creation
Anthropic Claude for nuanced problems

📤 Instant HuggingFace Integration

deepfabric generate config.yaml --hf-repo username/my-cot-dataset

Your dataset is automatically uploaded with a generated dataset card. No manual uploads, no fuss.

Real-World Impact: What Developers Are Building

🎓 Educational AI: Teachers creating personalized math tutoring datasets
🤖 Agent Training: Developers building reasoning agents for complex tasks
📊 Research: ML researchers generating evaluation benchmarks
💼 Enterprise: Companies creating domain-specific reasoning models

The Numbers Don't Lie

⏱️ 95% faster than manual dataset creation
📈 10x more diverse examples per domain
💰 80% cost reduction compared to data labeling services
🎯 Zero prompt engineering required

Ready to Transform Your ML Pipeline?

Getting started takes literally 30 seconds:

# Install pip install deepfabric # Generate your first CoT dataset deepfabric generate \ --topic-prompt "Your domain here" \ --conversation-type cot_freetext \ --num-steps 10 \ --provider openai \ --model gpt-4o-mini # Watch the magic happen ✨

What's Next?

The ML community is moving fast, and quality training data is the bottleneck. DeepFabric removes that bottleneck entirely.

Whether you're building the next breakthrough in reasoning AI or just need better training data for your side project, DeepFabric gives you superpowers.

Stop spending weeks on dataset creation. Start building better models today.

Try DeepFabric Now:

📚 GitHub: https://github.com/lukehinds/deepfabric
📖 Documentation: https://lukehinds.github.io/DeepFabric/
💬 Discord: Join the community for support and sharing datasets

What kind of CoT dataset will you build first? Drop a comment and let's discuss! 🚀

Tags: #MachineLearning #AI #Datasets #ChainOfThought #Python #OpenSource #MLOps #DataScience #DeepLearning #ArtificialIntelligence

DEV Community

Build High Quality Reasoning Datasets in Minutes Using Natural Prompts

The Problem: Dataset Creation is Broken

The Solution: DeepFabric's Triple Threat

1. 🔥 Free-text CoT (GSM8K Style)

2. 🏗️ Structured CoT (Conversation Based)

3. 🚀 Hybrid CoT (Best of Both Worlds)

Why Developers Are Going Crazy for DeepFabric

🧠 Smart Topic Generation

🔧 YAML Configuration = Zero Complexity

🌐 Multi-Provider Freedom

📤 Instant HuggingFace Integration

Real-World Impact: What Developers Are Building

The Numbers Don't Lie

Ready to Transform Your ML Pipeline?

What's Next?

Try DeepFabric Now:

Top comments (0)