Quickstart: Retrieval Augmented Generation (RAG)

In this Quickstart you’ll learn how to build a RAG workflow using Together AI in 6 quick steps that can be ran in under 5 minutes! We will leverage the embedding, reranking and inference endpoints.

1. Register for an account

First, register for an account to get an API key. Once you’ve registered, set your account’s API key to an environment variable named TOGETHER_API_KEY:

Shell

export TOGETHER_API_KEY=xxxxx

2. Install your preferred library

Together provides an official library for Python:

Shell

pip install together --upgrade

Python

from together import Together  client = Together(api_key=TOGETHER_API_KEY)

3. Data Processing and Chunking

We will RAG over Paul Grahams latest essay titled Founder Mode. The code below will scrape and load the essay into memory.

Python

import requests from bs4 import BeautifulSoup   def scrape_pg_essay():  url = "https://paulgraham.com/foundermode.html"   try:  # Send GET request to the URL  response = requests.get(url)  response.raise_for_status() # Raise an error for bad status codes   # Parse the HTML content  soup = BeautifulSoup(response.text, "html.parser")   # Paul Graham's essays typically have the main content in a font tag  # You might need to adjust this selector based on the actual HTML structure  content = soup.find("font")   if content:  # Extract and clean the text  text = content.get_text()  # Remove extra whitespace and normalize line breaks  text = " ".join(text.split())  return text  else:  return "Could not find the main content of the essay."   except requests.RequestException as e:  return f"Error fetching the webpage: {e}"   # Scrape the essay pg_essay = scrape_pg_essay() 

Chunk the essay:

Python

# Naive fixed sized chunking with overlaps def create_chunks(document, chunk_size=300, overlap=50):  return [  document[i : i + chunk_size]  for i in range(0, len(document), chunk_size - overlap)  ]   chunks = create_chunks(pg_essay, chunk_size=250, overlap=30) 

4. Generate Vector Index and Perform Retrieval

We will now use bge-large-en-v1.5 to embed the augmented chunks above into a vector index.

Python

from typing import List import numpy as np   def generate_embeddings(  input_texts: List[str],  model_api_string: str, ) -> np.ndarray:  """Generate embeddings from Together python library.   Args:  input_texts: a list of string input texts.  model_api_string: str. An API string for a specific embedding model of your choice.   Returns:  embeddings_list: a list of embeddings. Each element corresponds to the each input text.  """  outputs = client.embeddings.create(  input=input_texts,  model=model_api_string,  )  return np.array([x.embedding for x in outputs.data])   embeddings = generate_embeddings(chunks, "BAAI/bge-large-en-v1.5") 

The function below will help us perform vector search:

Python

def vector_retreival(  query: str,  top_k: int = 5,  vector_index: np.ndarray = None, ) -> List[int]:  """  Retrieve the top-k most similar items from an index based on a query.  Args:  query (str): The query string to search for.  top_k (int, optional): The number of top similar items to retrieve. Defaults to 5.  index (np.ndarray, optional): The index array containing embeddings to search against. Defaults to None.  Returns:  List[int]: A list of indices corresponding to the top-k most similar items in the index.  """   query_embedding = np.array(  generate_embeddings([query], "BAAI/bge-large-en-v1.5")[0]  )   similarity_scores = np.dot(query_embedding, vector_index.T)   return list(np.argsort(-similarity_scores)[:top_k])   top_k_indices = vector_retreival(  query="What are 'skip-level' meetings?",  top_k=5,  vector_index=embeddings, ) top_k_chunks = [chunks[i] for i in top_k_indices] 

We now have a way to retrieve from the vector index given a query.

5. Rerank To Improve Quality

We will use a reranker model to improve retrieved chunk relevance quality:

Python

def rerank(query: str, chunks: List[str], top_k=3) -> List[int]:   response = client.rerank.create(  model="Salesforce/Llama-Rank-V1",  query=query,  documents=chunks,  top_n=top_k,  )   return [result.index for result in response.results]   rerank_indices = rerank(  "What are 'skip-level' meetings?",  chunks=top_k_chunks,  top_k=3, )  reranked_chunks = ""  for index in rerank_indices:  reranked_chunks += top_k_chunks[index] + "\n\n"  print(reranked_chunks) 

6. Call Generative Model - Llama 405b

We will pass the final 3 concatenated chunks into an LLM to get our final answer.

Python

query = "What are 'skip-level' meetings?"  response = client.chat.completions.create(  model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",  messages=[  {"role": "system", "content": "You are a helpful chatbot."},  {  "role": "user",  "content": f"Answer the question: {query}. Use only information provided here: {reranked_chunks}",  },  ], )  response.choices[0].message.content 

If you want to learn more about how to best use open models refer to our docs here!

Getting Started

Inference

Training

Capabilities

Examples

Guides

❓ Frequently Asked Questions

Quickstart: Retrieval Augmented Generation (RAG)

1. Register for an account

2. Install your preferred library

3. Data Processing and Chunking

4. Generate Vector Index and Perform Retrieval

5. Rerank To Improve Quality

6. Call Generative Model - Llama 405b

Getting Started

Inference

Training

Capabilities

Examples

Guides

❓ Frequently Asked Questions

​1. Register for an account

​2. Install your preferred library

​3. Data Processing and Chunking

​4. Generate Vector Index and Perform Retrieval

​5. Rerank To Improve Quality

​6. Call Generative Model - Llama 405b

1. Register for an account

2. Install your preferred library

3. Data Processing and Chunking

4. Generate Vector Index and Perform Retrieval

5. Rerank To Improve Quality

6. Call Generative Model - Llama 405b