Posted on Jul 18

How I Built a RAG System in Rails Using Nomic Embeddings and OpenAI

Retrieval-Augmented Generation (RAG) lets you bring your own data to LLMs—and get real answers. I’ll show how I used the open-source nomic-embed-text-v2-moe model for semantic search in a Rails app, while still using OpenAI for generation.

🧠 What is RAG?

RAG (Retrieval-Augmented Generation) enhances LLMs by feeding them relevant chunks of your data before generating a response. Instead of fine-tuning, we give the model useful context.

Here's the basic pipeline:

[ User Question ] ↓ [ Embed the Question (Nomic) ] ↓ [ Vector Search in PgVector ] ↓ [ Retrieve Relevant Chunks ] ↓ [ Assemble Prompt ] ↓ [ Generate Answer with OpenAI ]

🧰 The Stack

Rails – Backend framework, routes, controllers, and persistence
Nomic Embedding Model – For semantic understanding of data
FastAPI – Lightweight Python server to serve embeddings
PgVector – PostgreSQL extension to store and query vector data
OpenAI GPT-4 / GPT-3.5 – For the final response generation

🛠 Step 1: Run the Nomic Model Locally (Optional but Fast)

You can run the nomic-embed-text-v2-moe model using sentence-transformers in a Python FastAPI app:

from fastapi import FastAPI, Request from sentence_transformers import SentenceTransformer app = FastAPI() model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe") @app.post("/embed") async def embed(req: Request): data = await req.json() input_text = data["input"] embedding = model.encode(input_text).tolist() return { "embedding": embedding }

This becomes your internal embedding API, replacing OpenAI’s /embeddings.

📄 Step 2: Chunk and Store Your Data

Split your content into short passages (~100–300 words), embed them via your FastAPI endpoint, and store the results in PostgreSQL with pgvector.

Add a vector column:

psql -d your_db -c "CREATE EXTENSION IF NOT EXISTS vector;"

class AddEmbeddingToDocuments < ActiveRecord::Migration[7.1] def change add_column :documents, :embedding, :vector, limit: 768 # Nomic v2-moe size end end

🤖 Step 3: Embed User Queries via Nomic

In your Rails controller:

def get_embedding(text) response = Faraday.post("http://localhost:8000/embed", { input: text }.to_json, "Content-Type" => "application/json") JSON.parse(response.body)["embedding"] end

Use the same model for both document and query embeddings.

🔍 Step 4: Perform Vector Search with PgVector

Search your documents for the closest matches using cosine distance:

Document.order("embedding <-> cube(array[?])", query_vector).limit(5)

These top chunks become the context for the LLM.

🧾 Step 5: Build a Smart Prompt for OpenAI

Concatenate the top passages and feed them into OpenAI’s chat API:

client.chat( parameters: { model: "gpt-4", messages: [ { role: "system", content: "You are an assistant answering based on the provided context." }, { role: "user", content: build_contextual_prompt(user_input, top_chunks) } ] } )

✅ Why Use Nomic for Embeddings?

High-quality, open-source, multilingual
No token limits — runs locally or self-hosted
Zero vendor lock-in at the embedding layer
Great performance on MTEB and real-world retrieval

💡 Why I Still Use OpenAI for the LLM

The generation step is where OpenAI shines. Instead of replacing it prematurely, I decoupled the embedding stage. Now I can experiment, optimize, and even switch LLMs later if needed.