Posted on Sep 18

AI: RAG Python Problem

Problem Statement:
Current State: CHAOS
500GB of documents
Hours to find answers
Losing $10K/day in productivity
ChatGPT can't access our private data

Your Solution: RAG System
Instant answers (< 1 second)
100% accurate responses
Secure, private data
Save $300K/year

Your RAG Toolkit
Retrieval: Semantic search
Augmentation: Context injection
Generation: Smart responses

Task 1: Set Up Development Environment

Installing Python Libraries ChromaDB - Vector DB Transformers - ML Models Flask - Web Server OpenAI - LLM API

Purpose: Install all dependencies required for building RAG

Steps:

cd /root && mkdir -p rag-project && cd rag-project
python3 -m venv venv && source venv/bin/activate
pip install uv && uv pip install chromadb sentence-transformers openai flask
echo "READY" > /root/rag-setup-complete.txt

Explainations:

python3 -m venv venv: create a virtual environment in a folder named venv. A venv is a self-contained Python environment so dependencies don’t leak into the system Python.
source venv/bin/activate: activate that environment, so pip and python now refer to the virtual environment instead of the global system install.
pip install uv: installs uv, a modern Python package installer and resolver (much faster than regular pip).
uv pip install ...: uses uv as a drop-in replacement for pip to install packages into the virtual environment:
chromadb: vector database for embeddings (used in RAG pipelines).
sentence-transformers: pretrained transformer models for turning text into embeddings.
openai: OpenAI’s official Python client library.
flask: lightweight web framework for serving APIs.

Task 2: Explore TechCorp's Document Vault

 employee-handbook/ pet-policy.md (CEO's dog!) remote-work-policy.md benefits-overview.md product-specs/ cloudsync-pro.md ($1M product) datavault.md meeting-notes/ q3-planning-meeting.md product-launch-review.md customer-faqs/ general-faqs.md Total: 500GB simulated as focused docs

Purpose: Review all the documents before building RAG system

Steps:

cd /root/techcorp-docsls -lafind . -name "*.md" | wc -lfind . -name "*.md" | wc -l > /root/doc-count.txt

Task 3: Initialize Vector Database

ChromaDB Architecture
Documents → Vectors → Semantic Space
"pet policy" → [0.2,-0.5...]
"remote work" → [0.1,0.8...]
"product" → [0.9,0.3...]
384-dimensional semantic understanding

Purpose: Create AI brain for storing document vectors

Steps:

Create init_vectordb.py

import chromadb from chromadb.config import Settings print(" Initializing AI Brain...") client = chromadb.PersistentClient( path="./chroma_db", settings=Settings(anonymized_telemetry=False) ) collection = client.get_or_create_collection( name="techcorp_docs", metadata={"hnsw:space": "cosine"} ) print(f" Brain Created: {collection.name}") print(f" Memories: {collection.count()}") print(" AI Brain Ready!")

Run it: python init_vectordb.py

Task 4: Learn Document Chunking Strategy

Smart Chunking Strategy
Original Document (2000 chars)
Chunked (500 chars, 100 overlap)
↑ Overlaps preserve context = 40% better accuracy

Purpose: Learn optimal chunking strategy BEFORE processing real documents

Steps:

Create test_chunking.py:

import os print(" DOCUMENT CHUNKING ENGINE") print("="*40) def chunk_text(text, size=500, overlap=100): """Smart chunking with overlap for context preservation""" chunks = [] start = 0 while start < len(text): end = min(start + size, len(text)) chunk = text[start:end] chunks.append(chunk) if end >= len(text): break start += size - overlap return chunks # Process sample document sample_doc = """TechCorp Pet Policy: Employees may bring pets to the office on Fridays. Dogs must be well-behaved and vaccinated. The CEO's golden retriever is the office mascot. Remote Work Policy: Employees can work remotely up to 3 days per week. Core hours are 10 AM - 3 PM in your local timezone. All meetings should be recorded for async collaboration. Benefits Overview: Comprehensive health insurance including dental and vision. 401k matching up to 6% of salary. Unlimited PTO after first year. Annual learning budget of $2,000.""" print(f" Original document: {len(sample_doc)} characters") print("-"*40) chunks = chunk_text(sample_doc, size=500, overlap=100) print(f" Created {len(chunks)} chunks") print("-"*40) for i, chunk in enumerate(chunks, 1): print(f"\nChunk {i} ({len(chunk)} chars):") print(f"Preview: {chunk[:60]}...") # Save verification with open('/root/chunk-test.txt', 'w') as f: f.write(f"CHUNKS:{len(chunks)}") print("\n" + "="*40) print(" Chunking complete!") print(f" Stats: {len(chunks)} chunks from {len(sample_doc)} chars") print(" Ready for vectorization!")

python test_chunking.py

Task 5: Understand How Embeddings Work

Semantic Embedding Transformation
"Dogs allowed Fridays" → AI Model → 384D Vector
[0.23, -0.45, 0.67, ..., 0.12]
Semantic Similarity:
"Pets permitted" ↔ "Dogs allowed" = 92%
"Remote work" ↔ "Dogs allowed" = 18%

Purpose: Learn how AI converts text to math BEFORE processing real documents in Task 6

Steps:

test_embeddings.py:

from sentence_transformers import SentenceTransformer import numpy as np print(" Loading Google's AI Brain (all-MiniLM-L6-v2)...") model = SentenceTransformer('all-MiniLM-L6-v2') print(" Brain loaded! 90M parameters ready!\n") # TechCorp test sentences sentences = [ "Dogs are allowed in the office on Fridays", "Pets can come to work on Furry Fridays", "Remote work policy allows 3 days from home" ] print(" Converting text to vectors...") embeddings = model.encode(sentences) print(f" Created {len(embeddings)} vectors of {len(embeddings[0])} dimensions each!\n") # Calculate semantic similarities sim_1_2 = np.dot(embeddings[0], embeddings[1]) sim_1_3 = np.dot(embeddings[0], embeddings[2]) print(" Semantic Similarity Analysis:") print("="*50) print(f"'Dogs allowed' ←→ 'Pets permitted'") print(f"Similarity: {sim_1_2:.3f} (Very Related! )\n") print(f"'Dogs allowed' ←→ 'Remote work'") print(f"Similarity: {sim_1_3:.3f} (Not Related )\n") # Visualization print(" Similarity Scale:") print("0.0 1.0") print(f" Remote {'' * int(sim_1_3*20)}") print(f" Pets {'' * int(sim_1_2*20)}") # Save results with open('/root/embedding-test.txt', 'w') as f: f.write(f"SIM_PET:{sim_1_2:.3f},SIM_REMOTE:{sim_1_3:.3f}") print("\n You've unlocked semantic understanding!")

python test_embeddings.py

Task 6: Feed the AI Brain

Purpose: Process ALL documents using chunking (Task 4) and embeddings (Task 5) into database (Task 3)

Steps:
ingest_documents.py

import os import chromadb from sentence_transformers import SentenceTransformer from pathlib import Path print("TECHCORP KNOWLEDGE INGESTION SYSTEM") print("="*50) # Initialize systems print("Connecting to AI Brain (from Task 3)...") client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("techcorp_docs") print("Loading Semantic Processor (from Task 5)...") model = SentenceTransformer('all-MiniLM-L6-v2') print("All systems online!\n") # Process documents print("Beginning knowledge transfer...") doc_count = 0 total_chunks = 0 for category in Path('/root/techcorp-docs').iterdir(): if category.is_dir(): print(f"\nProcessing {category.name}:") for doc in category.glob('*.md'): print(f" {doc.name}", end="") with open(doc, 'r') as f: content = f.read() # Apply chunking strategy from Task 4! chunks = [content[i:i+500] for i in range(0, len(content), 400)] for i, chunk in enumerate(chunks): doc_id = f"{doc.stem}_{i}" # Apply embedding from Task 5! embedding = model.encode(chunk).tolist() # Store in database from Task 3! collection.add( ids=[doc_id], embeddings=[embedding], documents=[chunk], metadatas={"file": doc.name, "category": category.name} ) total_chunks += 1 doc_count += 1 print(f" ({len(chunks)} chunks)") print("\n" + "="*50) print(f"INGESTION COMPLETE!") print(f"Statistics:") print(f" • Documents processed: {doc_count}") print(f" • Knowledge chunks: {total_chunks}") print(f" • AI IQ increased: +{doc_count*10} points") print(f"\nValue delivered: $500K in searchable knowledge!") # Save results with open('/root/ingest-complete.txt', 'w') as f: f.write(f"DOCS:{doc_count},CHUNKS:{collection.count()}")

python ingest_documents.py

Task 7: Activate Semantic Search Superpowers

Semantic Search in Action
"Can I bring my dog to work?"
↓
Vector Encoding → [0.23, -0.45, 0.67, ...]
↓
Searching 384D Space...

Top Results (by meaning, not keywords!):

pet-policy.md (95% match) "Dogs allowed on Fridays..."
employee-handbook.md (67% match) "Office policies include..."
benefits.md (23% match) "Health benefits for..." Search time: 0.003 seconds

Purpose: Build semantic search that understands MEANING, not just keywords

Steps:

test_search.py

import chromadb from sentence_transformers import SentenceTransformer print(" TECHCORP SEMANTIC SEARCH ENGINE") print("="*50) # Initialize print(" Connecting to Knowledge Base...") client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("techcorp_docs") print(" Loading AI Understanding...") model = SentenceTransformer('all-MiniLM-L6-v2') print(" Search Engine Ready!\n") # CEO's test queries queries = [ "What is the pet policy at TechCorp?", "Tell me about CloudSync Pro features", "How many days of remote work are allowed?" ] results_file = open('/root/search-results.txt', 'w') for query in queries: print(f" Query: '{query}'") print("-" * 50) results_file.write(f"QUERY:{query}\n") # Convert question to vector query_embedding = model.encode(query).tolist() # Semantic search! results = collection.query( query_embeddings=[query_embedding], n_results=3 ) # Display results print(" Top Results (by semantic similarity):") for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])): relevance = 100 - (i * 15) # Simulated relevance print(f"\n {i+1}. [{meta['category']}] {meta['file']} ({relevance}% match)") print(f" Preview: '{doc[:80]}...'") results_file.write(f"RESULT:{meta['category']}/{meta['file']}\n") print("\n" + "="*50 + "\n") results_file.close() print(" SEARCH TEST COMPLETE!") print(" Notice: Found 'pet policy' even when searching 'bring my dog'!") print(" This is the power of semantic understanding!")

python test_search.py

Task 8: Complete RAG Pipeline Test

Complete RAG Pipeline Flow

RETRIEVAL "Benefits?" → [0.3,-0.2,...] → Top 3 Docs
AUGMENTATION Context + Question → Prompt Engineering "Based on: [docs]... Answer: [question]"
GENERATION LLM + Context → Accurate Answer "TechCorp offers healthcare, 401k..." Total Time: < 1 second | Accuracy: 100%

Purpose: Test all three phases of your RAG pipeline working together

Steps:

test_rag_pipeline.py

import chromadb from sentence_transformers import SentenceTransformer import openai import os print(" TECHCORP RAG PIPELINE TEST") print("="*50) # Initialize all systems print(" Initializing RAG Components...") client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("techcorp_docs") model = SentenceTransformer('all-MiniLM-L6-v2') print(" All systems operational!\n") def test_rag_pipeline(question): """Test the complete RAG Pipeline""" print(f" Question: '{question}'") print("-" * 50) # 1. RETRIEVAL PHASE print("\n PHASE 1: RETRIEVAL") print(" Converting question to vector...") query_embedding = model.encode(question).tolist() print(" Searching knowledge base...") results = collection.query( query_embeddings=[query_embedding], n_results=3 ) print(f" Found {len(results['documents'][0])} relevant documents!") # 2. AUGMENTATION PHASE print("\n PHASE 2: AUGMENTATION") print(" Preparing context for AI...") context = "\n\n".join(results['documents'][0]) # 3. GENERATION PHASE (Simulated) print("\n PHASE 3: GENERATION") print(" AI processing with context...") # Simulated response if "benefits" in question.lower(): answer = "Based on TechCorp documents: Employees enjoy comprehensive health insurance, 401k matching up to 6%, unlimited PTO, and professional development budgets." else: answer = f"Based on the retrieved TechCorp documents, here's the answer to '{question}'..." print(" Response generated!") return { 'question': question, 'sources_used': len(results['documents'][0]), 'answer': answer } # Test the pipeline print("\n" + "="*50) print(" TESTING COMPLETE PIPELINE") print("="*50) test_question = "What are the benefits of working at TechCorp?" result = test_rag_pipeline(test_question) print("\n" + "="*50) print(" PIPELINE RESULTS") print("="*50) print(f" Question: {result['question']}") print(f" Sources Used: {result['sources_used']} documents") print(f" Answer: {result['answer']}") # Performance metrics print("\n PERFORMANCE METRICS:") print(" • Retrieval: 0.012 seconds") print(" • Augmentation: 0.003 seconds") print(" • Generation: 0.234 seconds") print(" • Total: 0.249 seconds") # Save pipeline verification with open('/root/rag-pipeline-test.txt', 'w') as f: f.write(f"PIPELINE:COMPLETE,SOURCES:{result['sources_used']}") print("\n" + "="*50) print(" SUCCESS! RAG Pipeline Working!") print("="*50)

python test_rag_pipeline.py

Task 9: Launch Your AI Assistant

Purpose: Deploy and interact with your complete RAG system via web interface

DEV Community

AI: RAG Python Problem

Top comments (0)