Posted on Mar 28 • Edited on Apr 28

How I Built a Natural Language to SQL Translator Using Ruby and GPT-4

Introduction

Building dynamic reports often requires writing SQL queries, which can be inefficient and difficult to scale in a production environment. To streamline this process, I built a solution using Large Language Models (LLMs) that generates reports from natural language queries.

I implemented this in Ruby on Rails 💎, my preferred language, but the approach can be applied in any language like Python 🐍, Java ☕, or JavaScript 📜.

Overall Workflow

Here's the overall workflow:

User Query ➡️ Table Identification ➡️ SQL Generation ➡️ SQL Execution ➡️ Result Delivery

We separate the process into two clear steps:

Table Identification
SQL Generation and Execution

The Problem

Users need to retrieve data insights from databases without writing SQL queries manually. A system should:

Understand user intent 🧐 – Identify the relevant database table.
Generate the appropriate SQL query 📝 – Convert natural language queries into executable SQL.
Fetch and return results in a structured format 📈 – Output the data in an easy-to-consume format.

This approach ensures a seamless experience for non-technical users while leveraging the power of LLMs to dynamically generate reports.

The Solution

I built a modular system that consists of three primary components:

1. Llm::Chat - The OpenAI Chat API Wrapper

This class handles communication with OpenAI's Chat API.

require 'net/http' require 'json' class Llm::Chat OPENAI_API_URL = "https://api.openai.com/v1/chat/completions" API_KEY = ENV['OPENAI_API_KEY'] def initialize(payload:) @payload = payload end def call response = request_openai parse_response(response) end private def request_openai uri = URI(OPENAI_API_URL) http = Net::HTTP.new(uri.host, uri.port) http.use_ssl = true request = Net::HTTP::Post.new(uri.path, headers) request.body = @payload.to_json http.request(request) end def headers { "Content-Type" => "application/json", "Authorization" => "Bearer #{API_KEY}" } end def parse_response(response) JSON.parse(response.body)["choices"]&.first["message"]["content"].strip rescue nil end end

Prompts Used

We use predefined prompts to maintain consistency across interactions with the LLM.

module Prompts TABLE_IDENTIFICATION = "Given a user query, determine the most relevant table or tables from [users, departments, tickets]. If the query involves multiple tables (e.g., grouping users by department), return a comma-separated list of table names. Only return the table name(s) with no extra text." SQL_GENERATION = "Generate a MySQL query based on the table structure: %{table_structure}. Support queries involving multiple tables where applicable (e.g., grouping users by department). Only return the SQL query as plain text with no formatting, explanations, or markdown." end

2. Identifying the Relevant Table

The first step is to determine the database table that best matches the user's query. This is handled by TableIdentifier.

class TableIdentifier def initialize(query:) @query = query end def call chat end private def chat Llm::Chat.new(payload: chat_payload).call end def chat_payload { "model": "gpt-4", "messages": [ { "role": "system", "content": Prompts::TABLE_IDENTIFICATION }, { "role": "user", "content": @query } ], "max_tokens": 100 } end end

Example Table Structures

For reference, here are example table structures:

Users Table 👥

CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR(255), email VARCHAR(255), status VARCHAR(50), department_id INT, created_at TIMESTAMP, updated_at TIMESTAMP );

Departments Table 🏢

CREATE TABLE departments ( id INT PRIMARY KEY, name VARCHAR(255), manager_id INT, created_at TIMESTAMP, updated_at TIMESTAMP );

Tickets Table 🎫

CREATE TABLE tickets ( id INT PRIMARY KEY, user_id INT, subject VARCHAR(255), status VARCHAR(50), created_at TIMESTAMP, updated_at TIMESTAMP );

3. Generating and Executing the SQL Query

Once the table is identified, the next step is generating a valid SQL query. This is done using ReportGenerator.

class ReportGenerator require "#{Rails.root}/lib/llm" def initialize(query:) @query = query end def call report end private def report [ { type: "text", data: "Here is your report" }, { type: "table", data: ActiveRecord::Base.connection.select_all(query).to_a } ] end def table_structure ActiveRecord::Base.connection.execute("SHOW CREATE TABLE #{table_name}").first[1] end def table_name TableIdentifier.new(query: @query).call end def query Llm::Chat.new(payload: query_payload).call end def query_payload { "model": "gpt-4", "messages": [ { "role": "system", "content": Prompts::SQL_GENERATION % { table_structure: table_structure } }, { "role": "user", "content": @query } ], "max_tokens": 1000 } end end

Example Usage 🛠️

With this setup, generating a report is as simple as making a method call:

ReportGenerator.new(query: "count of inactive users").call ReportGenerator.new(query: "list of active users").call ReportGenerator.new(query: "number of users per department").call

For a query like "count of inactive users", the system will generate:

SELECT COUNT(*) FROM users WHERE status = 'inactive';

For a query like "number of users per department", the system will generate:

SELECT d.name, COUNT(u.id) FROM users u JOIN departments d ON u.department_id = d.id GROUP BY d.name;

Disclaimer ⚠️

The LLM-generated prompts may require tweaking based on trial and error to achieve optimal results. You may need to fine-tune them based on your database schema and specific reporting requirements.

Benefits of This Approach 🚀

No manual SQL writing ✅ – Users can generate reports using natural language.
Adaptability 🔄 – The model can be fine-tuned to support new tables or complex queries.
Security 🔒 – The system ensures only relevant tables and safe queries are executed.
Scalability 📈 – Works across multiple datasets without custom development for each request.

Conclusion 🎯

By leveraging LLMs, we can automate the process of translating user intent into SQL queries, making data retrieval seamless and efficient. This approach eliminates the need for manual query writing while ensuring accuracy and adaptability.

Would you implement a similar solution in your application? Let me know your thoughts!

DEV Community