text2qna is a Python toolkit and CLI for turning raw documents into instruction-style Q&A datasets for LLM fine-tuning.
- 📑 Chunk PDFs / TXT / HTML / MD into semantically split Markdown (sentence-aware or word-windowed).
- ❓ Generate Q&A pairs per section (supports positive and negative/trick pairs).
- 🎛 Steer style/coverage with
--num-pairs,--negative-ratio, and--extra-prompt.
Where many tools stop at chunking, text2qna goes further: it systematically distills as many Q&A pairs as you need from each section, helping you build robust instruction datasets quickly.
1) Chunk a document to Markdown
text2qna chunk ./paper.pdf \ --backend local \ # Embedding backend: local | openai | ollama --embed-model nomic-embed-text \ # Embedding model name --api-key ollama # OpenAI-compatible API key --embeddings-url http://localhost:11434/api/embeddings \ # Only for Ollama backend --window 500 \ # Word window size --step 400 \ # Word stride / overlap --threshold 0.7 \ # Cosine similarity threshold for breaks --sentence-split false \ # Use sentence splitting (requires nltk) --output ./chunks.md # Output file2) Generate a Q&A dataset from the sections
# Environment defaults (optional but convenient) export TEXT2QNA_API_KEY=your-api-key # OpenAI-compatible API key export TEXT2QNA_BASE_URL=https://api.openai.com/v1 # or http://localhost:11434/v1 (Ollama/other) export TEXT2QNA_MODEL=llama3.2 # Command with all commonly used flags text2qna qna ./chunks.md \ --model llama3.2 \ # Chat model (overrides TEXT2QNA_MODEL) --num-pairs 5 \ # Q&A pairs per section --negative-ratio 0.3 \ # 30% of pairs are negative/trick --extra-prompt "Keep questions concise; answers ≤3 sentences." \ # Style/constraints --output ./dataset.jsonl \ # Output JSONL path --api-key "$TEXT2QNA_API_KEY" \ # Explicit API key (overrides env) --base-url "$TEXT2QNA_BASE_URL" # Explicit endpoint (overrides env)Output (dataset.jsonl)
{"prompt":"What is X...?", "response":"X is ..."} {"prompt":"Is Mars the closest planet to the Sun?", "response":"No. Mars is fourth; Mercury is closest."}pip install text2qnagit clone https://github.com/nikosgiov/text2qna.git cd text2qna pip install -e .pip install "text2qna[pdf]" # PDF text extraction (pdfplumber) pip install "text2qna[local]" # Local sentence-transformers backend pip install "text2qna[nltk]" # Sentence tokenization (punkt data)Combine as needed, e.g.:
pip install "text2qna[pdf,local,nltk]"Python: 3.9–3.12 supported.
You can run text2qna using Docker. We provide two versions of the Docker image:
- Full image (
nikosgiov/text2qna): Includes all dependencies (PDF support, local embeddings, NLTK) - Lite image (
nikosgiov/text2qna-lite): Lightweight version with PDF support but without local embeddings and NLTK. Perfect for API-based usage.
You can find the pre-built Docker images on Docker Hub:
- Full version: nikosgiov/text2qna
- Lite version: nikosgiov/text2qna-lite
# Pull and use the full version docker pull nikosgiov/text2qna:latest # Or pull and use the lite version docker pull nikosgiov/text2qna-lite:latest# Build the full version docker build -t text2qna:local . # Build the lite version docker build -f Dockerfile.lite -t text2qna:local-lite .Create input/output directories in your current directory and place your input files in the input directory:
mkdir -p input outputExamples:
Running the chunking functionality:
docker run --rm \ -v "$(pwd)/input:/app/input" \ -v "$(pwd)/output:/app/output" \ nikosgiov/text2qna \ chunk /app/input/sample.txt \ --backend ollama \ --embed-model nomic-embed-text \ --embeddings-url http://host.docker.internal:11434/api/embeddings \ --output /app/output/chunks.mdRunning the Q&A generation:
docker run --rm \ -v "$(pwd)/input:/app/input" \ -v "$(pwd)/output:/app/output" \ -e TEXT2QNA_API_KEY=your-api-key \ -e TEXT2QNA_BASE_URL=https://api.openai.com/v1 \ text2qna:local \ qna /app/input/chunks.md \ --model gpt-4 \ --output /app/output/dataset.jsonlThe Docker setup includes:
- All optional dependencies (pdf, local, nltk)
- Volume mounts for input/output files
- Environment variable passing for API keys and URLs
If you want to use Ollama's API instead of OpenAI, you'll need to:
- Start Ollama separately:
ollama serve- Run text2qna with Ollama configuration: Just change the URLs to point at your Ollama server. For example:
- Embeddings:
--embeddings-url http://host.docker.internal:11434/api/embeddings - Q&A:
--base-url http://host.docker.internal:11434/v1
Note: There are two ways to connect to Ollama from the Docker container:
- Using
host.docker.internal(works on Docker Desktop for Windows/Mac by default)- Using
localhostwith the--network hostflagFor Q&A with Ollama, you must also set a dummy API key (e.g.
--api-key ollama) since the OpenAI client requires one.
- End-to-end: Go from raw documents → structured sections → instruction-style Q&A.
- Robustness: Support for negative pairs (misleading questions with corrective answers) helps fine-tuning resist falsehoods.
- Flexible embeddings: Choose local (sentence-transformers), OpenAI, or Ollama backends for chunking.
- Configurable: Tune chunk size, overlap, and break sensitivity; steer Q&A tone and constraints with
--extra-prompt. - CLI + API: Use in pipelines or as a library.
-
Input is normalized to Markdown.
-
Text is split into word windows (
window,step) or sentence groups (--sentence-split true, requires NLTK). -
Each adjacent pair of windows is embedded; cosine similarity determines section boundaries.
- Break when
similarity < threshold. - Higher
threshold⇒ more breaks (smaller sections).
- Break when
-
Very short sections are merged forward to avoid tiny fragments.
Defaults:
window=500,step=400,threshold=0.70. Implementation detail:min_section_words=60prevents tiny sections (configurable in code).
-
Markdown → HTML → sections (by
h1/h2/h3). -
For each section:
- Generate N positive Q&A pairs (covering uncaptured aspects).
- Optionally generate M negative pairs (plausible but wrong questions; correct answers explain the error).
-
Duplicate question texts are filtered out; basic retry logic included.
-
Output is JSONL:
{"prompt": "...", "response": "..."}per line.
Section boundaries
When generating Q&A, text2qna treats each Markdown heading (#, ##, ###) as the start of a new section.
A section consists of the heading text plus all following content, stopping just before the next heading of equal or higher level.
Child subsections are not merged into their parent — each heading defines its own independent section.
All commands support --quiet / --verbose for logging.
Convert raw input to Markdown and split into semantically coherent sections.
Usage
text2qna [--quiet|--verbose] chunk <input> [options]Key options
--output <path>: Output Markdown (default:<input>.md)--backend <local|openai|ollama>: Embedding backend (default:local)--embed-model <name>: Embedding model (backend-specific)--embeddings-url <url>: Custom embeddings endpoint (Ollama; default/api/embeddings)--window <int>: Word window size (default:500)--step <int>: Word stride/overlap (default:400)--threshold <float>: Break when adjacent similarity< threshold(default:0.70)--sentence-split <true|false>: Group by sentences (requiresnltkpunkt)
Notes
- Units for
window/stepare words, not characters. - Higher
threshold→ more, smaller sections; lowerthreshold→ fewer, larger sections. - PDF parsing uses
pdfplumber; image-only PDFs may need OCR beforehand.
Create Q&A pairs per Markdown section using an OpenAI-compatible chat API.
Usage
text2qna [--quiet|--verbose] qna <markdown> [options]Key options
--output <path>: JSONL output (default:dataset.jsonl)--model <name>: Chat model (e.g.,gpt-4o-mini,llama3.2)--base-url <url>: OpenAI-compatible base URL (e.g.,http://localhost:11434/v1)--api-key <key>: API key (or setTEXT2QNA_API_KEY)--num-pairs <int>: Pairs per section (default:3)--negative-ratio <float>: Fraction of negative/trick pairs (e.g.,0.3)--extra-prompt <text>: Style/constraints (tone, length caps, etc.)
--quiet/--verbose: Adjust logging.-h,--help: Command help.
Version helper:
python -c "import text2qna; print(text2qna.__version__)"
Everything has CLI flags; env vars provide convenient defaults.
TEXT2QNA_API_KEY— API key for any OpenAI-compatible API (OpenAI, Claude, local models)TEXT2QNA_BASE_URL— Base URL for OpenAI-compatible API endpointTEXT2QNA_MODEL— Default chat model for Q&A (default:llama3.2)TEXT2QNA_EMBED_BACKEND—openai|ollama|local(default:local)TEXT2QNA_EMBED_MODEL— Embedding model for the chosen backendTEXT2QNA_EMBED_URL— Embeddings URL forollama(e.g.,http://localhost:11434/api/embeddings)TEXT2QNA_DEVICE— Device for local embeddings (cpu|cuda|mps)
Note: For backward compatibility,
OPENAI_API_KEYandOPENAI_BASE_URLare also supported butTEXT2QNA_API_KEYandTEXT2QNA_BASE_URLare preferred as they better reflect that any OpenAI-compatible API can be used (OpenAI, Claude, local models, etc).
-
local (default) —
sentence-transformersmodels on your machine. Install:pip install "text2qna[local]" -
openai — Uses the official
openaiSDK’s embeddings endpoint. RequiresOPENAI_API_KEYand optionallyOPENAI_BASE_URL. -
ollama — Calls a local /api/embeddings endpoint (JSON body:
{"model": "...", "prompt": "..."}). Example:http://localhost:11434/api/embeddings
from text2qna.chunker import load_file, to_markdown, semantic_split_markdown from text2qna.embeddings import OllamaEmbeddings raw = load_file("./paper.pdf") md_text = to_markdown(raw) embedder = OllamaEmbeddings( model="mxbai-embed-large", base_url="http://localhost:11434/api/embeddings", ) chunks_md = semantic_split_markdown( md_text, embedder=embedder, window=500, step=400, threshold=0.7, sentence_split=False, ) with open("chunks.md", "w", encoding="utf-8") as f: f.write(chunks_md)from text2qna.qna import create_dataset_from_markdown, save_dataset_jsonl from openai import OpenAI client = OpenAI(api_key="sk-...", base_url="http://localhost:11434/v1") md = open("chunks.md", "r", encoding="utf-8").read() dataset = create_dataset_from_markdown( md_text=md, client=client, model="llama3.2", num_pairs_per_section=5, negative_ratio=0.3, extra_prompt="Keep questions concise and answers under 3 sentences.", ) save_dataset_jsonl(dataset, "dataset.jsonl")Rick & Morty tone
Respond exactly like Rick from Rick and Morty: be rude, impatient, sarcastic, and brutally honest. Keep content factually correct; use informal, unfiltered tone. Use via --extra-prompt or the extra_prompt parameter.
OpenAI embeddings
export TEXT2QNA_API_KEY=your-api-key text2qna chunk notes.txt \ --backend openai \ --embed-model text-embedding-3-smallOllama embeddings
ollama pull mxbai-embed-large text2qna chunk page.html \ --backend ollama \ --embed-model mxbai-embed-large \ --embeddings-url http://localhost:11434/api/embeddingsLocal sentence-transformers
pip install "text2qna[local]" text2qna chunk notes.md \ --backend local \ --embed-model sentence-transformers/all-MiniLM-L6-v2 \ --device cpu- Start simple:
window=500,step=400,threshold=0.7. - Smaller sections (for dense/varied content): increase
threshold(e.g.,0.8–0.9) or reducewindow. - Larger sections (for long uniform prose): decrease
threshold(e.g.,0.5–0.6) or increasewindow. - Sentence alignment:
--sentence-split truecan improve boundaries; requiresnltk+punktdata. - PDFs: Image-only or messy PDFs benefit from OCR first;
pdfplumberreads text layers only. - Style control: Use
--extra-promptto enforce tone, length caps, format (bullets, JSON, etc.).
-
NLTK sentence splitting error
python -c "import nltk; nltk.download('punkt')" -
Ollama embeddings not found: Pull the model locally (e.g.,
ollama pull mxbai-embed-large) and ensure you use the /api/embeddings route. -
Poor PDF text: Run OCR; bad text layers cause garbage input.
-
Missing deps: If you skipped extras, install what you need:
pip install "text2qna[pdf,local,nltk]". -
No Q&A output: Ensure your model name and
--base-urlare correct and accessible; check--verboselogs.
How do window, step, and threshold interact? window/step slide over words to make adjacent spans, which are embedded and compared. A break occurs when cosine_similarity(span_i, span_{i+1}) < threshold. Thus, higher threshold ⇒ more breaks (smaller chunks); lower ⇒ fewer breaks.
Does the tool require OpenAI? No. You can chunk with local or Ollama embeddings. Q&A generation does require an OpenAI-compatible chat API, which can be local (e.g., an Ollama-compatible server) if it matches the API.
What’s in the JSONL? Lines like {"prompt": "...", "response": "..."}. Internal fields (e.g., is_negative) are stripped before writing.
How do I get the version?
python -c "import text2qna; print(text2qna.__version__)"Contributions are welcome!
- Fork & create a feature branch
pip install -e ".[pdf,local,nltk]"- Add tests (if applicable)
- Open a PR with a clear description and examples
Bug reports & feature requests: Issues
MIT © Nikolaos Giovanopoulos
- Keep your API keys secret. Prefer environment variables over hardcoding.
- Review your documents for sensitive information before generating datasets.