Skip to content

Commit 5bf8772

Browse files
authored
add option to use user-defined SQL table info (langchain-ai#1347)
Currently, table information is gathered through SQLAlchemy as complete table DDL and a user-selected number of sample rows from each table. This PR adds the option to use user-defined table information instead of automatically collecting it. This will use the provided table information and fall back to the automatic gathering for tables that the user didn't provide information for. Off the top of my head, there are a few cases where this can be quite useful: - The first n rows of a table are uninformative, or very similar to one another. In this case, hand-crafting example rows for a table such that they provide the good, diverse information can be very helpful. Another approach we can think about later is getting a random sample of n rows instead of the first n rows, but there are some performance considerations that need to be taken there. Even so, hand-crafting the sample rows is useful and can guarantee the model sees informative data. - The user doesn't want every column to be available to the model. This is not an elegant way to fulfill this specific need since the user would have to provide the table definition instead of a simple list of columns to include or ignore, but it does work for this purpose. - For the developers, this makes it a lot easier to compare/benchmark the performance of different prompting structures for providing table information in the prompt. These are cases I've run into myself (particularly cases 1 and 3) and I've found these changes useful. Personally, I keep custom table info for a few tables in a yaml file for versioning and easy loading. Definitely open to other opinions/approaches though!
1 parent 924bba5 commit 5bf8772

File tree

2 files changed

+146
-1
lines changed

2 files changed

+146
-1
lines changed

docs/modules/chains/examples/sqlite.ipynb

Lines changed: 126 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -434,6 +434,131 @@
434434
"db_chain.run(\"What are some example tracks by Bach?\")"
435435
]
436436
},
437+
{
438+
"cell_type": "markdown",
439+
"id": "ef94e948",
440+
"metadata": {},
441+
"source": [
442+
"### Custom Table Info\n",
443+
"In some cases, it can be useful to provide custom table information instead of using the automatically generated table definitions and the first `sample_rows_in_table_info` sample rows. For example, if you know that the first few rows of a table are uninformative, it could help to manually provide example rows that are more diverse or provide more information to the model. It is also possible to limit the columns that will be visible to the model if there are unnecessary columns. \n",
444+
"\n",
445+
"This information can be provided as a dictionary with table names as the keys and table information as the values. For example, let's provide a custom definition and sample rows for the Track table with only a few columns:"
446+
]
447+
},
448+
{
449+
"cell_type": "code",
450+
"execution_count": 16,
451+
"id": "2ad33ab1",
452+
"metadata": {},
453+
"outputs": [],
454+
"source": [
455+
"custom_table_info = {\n",
456+
" \"Track\": \"\"\"CREATE TABLE Track (\n",
457+
"\t\"TrackId\" INTEGER NOT NULL, \n",
458+
"\t\"Name\" NVARCHAR(200) NOT NULL,\n",
459+
"\t\"Composer\" NVARCHAR(220),\n",
460+
"\tPRIMARY KEY (\"TrackId\")\n",
461+
")\n",
462+
"\n",
463+
"SELECT * FROM 'Track' LIMIT 3;\n",
464+
"TrackId Name Composer\n",
465+
"1 For Those About To Rock (We Salute You) Angus Young, Malcolm Young, Brian Johnson\n",
466+
"2 Balls to the Wall None\n",
467+
"3 My favorite song ever The coolest composer of all time\"\"\"\n",
468+
"}"
469+
]
470+
},
471+
{
472+
"cell_type": "code",
473+
"execution_count": 17,
474+
"id": "db144352",
475+
"metadata": {},
476+
"outputs": [
477+
{
478+
"name": "stdout",
479+
"output_type": "stream",
480+
"text": [
481+
"\n",
482+
"CREATE TABLE \"Playlist\" (\n",
483+
"\t\"PlaylistId\" INTEGER NOT NULL, \n",
484+
"\t\"Name\" NVARCHAR(120), \n",
485+
"\tPRIMARY KEY (\"PlaylistId\")\n",
486+
")\n",
487+
"\n",
488+
"SELECT * FROM 'Playlist' LIMIT 2;\n",
489+
"PlaylistId Name\n",
490+
"1 Music\n",
491+
"2 Movies\n",
492+
"\n",
493+
"CREATE TABLE Track (\n",
494+
"\t\"TrackId\" INTEGER NOT NULL, \n",
495+
"\t\"Name\" NVARCHAR(200) NOT NULL,\n",
496+
"\t\"Composer\" NVARCHAR(220),\n",
497+
"\tPRIMARY KEY (\"TrackId\")\n",
498+
")\n",
499+
"\n",
500+
"SELECT * FROM 'Track' LIMIT 3;\n",
501+
"TrackId Name Composer\n",
502+
"1 For Those About To Rock (We Salute You) Angus Young, Malcolm Young, Brian Johnson\n",
503+
"2 Balls to the Wall None\n",
504+
"3 My favorite song ever The coolest composer of all time\n"
505+
]
506+
}
507+
],
508+
"source": [
509+
"db = SQLDatabase.from_uri(\n",
510+
" \"sqlite:///../../../../notebooks/Chinook.db\",\n",
511+
" include_tables=['Track', 'Playlist'],\n",
512+
" sample_rows_in_table_info=2,\n",
513+
" custom_table_info=custom_table_info)\n",
514+
"\n",
515+
"print(db.table_info)"
516+
]
517+
},
518+
{
519+
"cell_type": "markdown",
520+
"id": "5fc6f507",
521+
"metadata": {},
522+
"source": [
523+
"Note how our custom table definition and sample rows for `Track` overrides the `sample_rows_in_table_info` parameter. Tables that are not overriden by `custom_table_info`, in this example `Playlist`, will have their table info gathered automatically as usual."
524+
]
525+
},
526+
{
527+
"cell_type": "code",
528+
"execution_count": 18,
529+
"id": "dfbda4e6",
530+
"metadata": {},
531+
"outputs": [
532+
{
533+
"name": "stdout",
534+
"output_type": "stream",
535+
"text": [
536+
"\n",
537+
"\n",
538+
"\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n",
539+
"What are some example tracks by Bach? \n",
540+
"SQLQuery:\u001b[32;1m\u001b[1;3m SELECT Name, Composer FROM Track WHERE Composer LIKE '%Bach%' LIMIT 5;\u001b[0m\n",
541+
"SQLResult: \u001b[33;1m\u001b[1;3m[('American Woman', 'B. Cummings/G. Peterson/M.J. Kale/R. Bachman'), ('Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Johann Sebastian Bach'), ('Aria Mit 30 Veränderungen, BWV 988 \"Goldberg Variations\": Aria', 'Johann Sebastian Bach'), ('Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude', 'Johann Sebastian Bach'), ('Toccata and Fugue in D Minor, BWV 565: I. Toccata', 'Johann Sebastian Bach')]\u001b[0m\n",
542+
"Answer:\u001b[32;1m\u001b[1;3m Some example tracks by Bach are 'American Woman', 'Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Aria Mit 30 Veränderungen, BWV 988 \"Goldberg Variations\": Aria', 'Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude', and 'Toccata and Fugue in D Minor, BWV 565: I. Toccata'.\u001b[0m\n",
543+
"\u001b[1m> Finished chain.\u001b[0m\n"
544+
]
545+
},
546+
{
547+
"data": {
548+
"text/plain": [
549+
"' Some example tracks by Bach are \\'American Woman\\', \\'Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace\\', \\'Aria Mit 30 Veränderungen, BWV 988 \"Goldberg Variations\": Aria\\', \\'Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude\\', and \\'Toccata and Fugue in D Minor, BWV 565: I. Toccata\\'.'"
550+
]
551+
},
552+
"execution_count": 18,
553+
"metadata": {},
554+
"output_type": "execute_result"
555+
}
556+
],
557+
"source": [
558+
"db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)\n",
559+
"db_chain.run(\"What are some example tracks by Bach?\")"
560+
]
561+
},
437562
{
438563
"cell_type": "markdown",
439564
"id": "c12ae15a",
@@ -542,7 +667,7 @@
542667
"name": "python",
543668
"nbconvert_exporter": "python",
544669
"pygments_lexer": "ipython3",
545-
"version": "3.9.1"
670+
"version": "3.10.9"
546671
}
547672
},
548673
"nbformat": 4,

langchain/sql_database.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ def __init__(
2020
ignore_tables: Optional[List[str]] = None,
2121
include_tables: Optional[List[str]] = None,
2222
sample_rows_in_table_info: int = 3,
23+
custom_table_info: Optional[dict] = None,
2324
):
2425
"""Create engine from database URI."""
2526
self._engine = engine
@@ -49,6 +50,21 @@ def __init__(
4950

5051
self._sample_rows_in_table_info = sample_rows_in_table_info
5152

53+
self._custom_table_info = custom_table_info
54+
if self._custom_table_info:
55+
if not isinstance(self._custom_table_info, dict):
56+
raise TypeError(
57+
"table_info must be a dictionary with table names as keys and the "
58+
"desired table info as values"
59+
)
60+
# only keep the tables that are also present in the database
61+
intersection = set(self._custom_table_info).intersection(self._all_tables)
62+
self._custom_table_info = dict(
63+
(table, self._custom_table_info[table])
64+
for table in self._custom_table_info
65+
if table in intersection
66+
)
67+
5268
self._metadata = metadata or MetaData()
5369
self._metadata.reflect(bind=self._engine)
5470

@@ -99,6 +115,10 @@ def get_table_info(self, table_names: Optional[List[str]] = None) -> str:
99115

100116
tables = []
101117
for table in meta_tables:
118+
if self._custom_table_info and table.name in self._custom_table_info:
119+
tables.append(self._custom_table_info[table.name])
120+
continue
121+
102122
# add create table command
103123
create_table = str(CreateTable(table).compile(self._engine))
104124

0 commit comments

Comments
 (0)