Retrieval-Augmented Generation (RAG) lets you bring your own data to LLMs—and get real answers. I’ll show how I used the open-source nomic-embed-text-v2-moe model for semantic search in a Rails app, while still using OpenAI for generation.
🧠 What is RAG?
RAG (Retrieval-Augmented Generation) enhances LLMs by feeding them relevant chunks of your data before generating a response. Instead of fine-tuning, we give the model useful context.
Here's the basic pipeline:
[ User Question ] ↓ [ Embed the Question (Nomic) ] ↓ [ Vector Search in PgVector ] ↓ [ Retrieve Relevant Chunks ] ↓ [ Assemble Prompt ] ↓ [ Generate Answer with OpenAI ]
🧰 The Stack
- Rails – Backend framework, routes, controllers, and persistence
- Nomic Embedding Model – For semantic understanding of data
- FastAPI – Lightweight Python server to serve embeddings
- PgVector – PostgreSQL extension to store and query vector data
- OpenAI GPT-4 / GPT-3.5 – For the final response generation
🛠 Step 1: Run the Nomic Model Locally (Optional but Fast)
You can run the nomic-embed-text-v2-moe model using sentence-transformers in a Python FastAPI app:
from fastapi import FastAPI, Request from sentence_transformers import SentenceTransformer app = FastAPI() model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe") @app.post("/embed") async def embed(req: Request): data = await req.json() input_text = data["input"] embedding = model.encode(input_text).tolist() return { "embedding": embedding }
This becomes your internal embedding API, replacing OpenAI’s /embeddings
.
📄 Step 2: Chunk and Store Your Data
Split your content into short passages (~100–300 words), embed them via your FastAPI endpoint, and store the results in PostgreSQL with pgvector
.
Add a vector column:
psql -d your_db -c "CREATE EXTENSION IF NOT EXISTS vector;"
class AddEmbeddingToDocuments < ActiveRecord::Migration[7.1] def change add_column :documents, :embedding, :vector, limit: 768 # Nomic v2-moe size end end
🤖 Step 3: Embed User Queries via Nomic
In your Rails controller:
def get_embedding(text) response = Faraday.post("http://localhost:8000/embed", { input: text }.to_json, "Content-Type" => "application/json") JSON.parse(response.body)["embedding"] end
Use the same model for both document and query embeddings.
🔍 Step 4: Perform Vector Search with PgVector
Search your documents for the closest matches using cosine distance:
Document.order("embedding <-> cube(array[?])", query_vector).limit(5)
These top chunks become the context for the LLM.
🧾 Step 5: Build a Smart Prompt for OpenAI
Concatenate the top passages and feed them into OpenAI’s chat API:
client.chat( parameters: { model: "gpt-4", messages: [ { role: "system", content: "You are an assistant answering based on the provided context." }, { role: "user", content: build_contextual_prompt(user_input, top_chunks) } ] } )
✅ Why Use Nomic for Embeddings?
- High-quality, open-source, multilingual
- No token limits — runs locally or self-hosted
- Zero vendor lock-in at the embedding layer
- Great performance on MTEB and real-world retrieval
💡 Why I Still Use OpenAI for the LLM
The generation step is where OpenAI shines. Instead of replacing it prematurely, I decoupled the embedding stage. Now I can experiment, optimize, and even switch LLMs later if needed.
🧠 Takeaways
- RAG doesn’t need to be a heavyweight system.
- Open-source embeddings + OpenAI generation = powerful, flexible hybrid.
- PgVector + Rails makes vector search feel native and hackable.
Top comments (1)
Good!