Skip to content

Commit a990df7

Browse files
committed
feat(samples): provide guide for ANN vector store end-to-end usage in Jupyter Notebook
This change provides a copy+pastable guide for using the ANN algorithms inside the Jupyter Notebook. Updates #94
1 parent 5a25f91 commit a990df7

File tree

2 files changed

+148
-17
lines changed

2 files changed

+148
-17
lines changed

noxfile.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424

2525
DEFAULT_PYTHON_VERSION = "3.10"
2626
CURRENT_DIRECTORY = pathlib.Path(__file__).parent.absolute()
27-
LINT_PATHS = ["src", "tests", "noxfile.py"]
27+
LINT_PATHS = ["samples", "src", "tests", "noxfile.py"]
2828

2929

3030
nox.options.sessions = [

samples/langchain_quick_start.ipynb

Lines changed: 147 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,10 @@
194194
{
195195
"cell_type": "markdown",
196196
"metadata": {
197-
"collapsed": false
197+
"collapsed": false,
198+
"jupyter": {
199+
"outputs_hidden": false
200+
}
198201
},
199202
"source": [
200203
"2. Enable the APIs for Spanner and Vertex AI within your project."
@@ -512,16 +515,12 @@
512515
"id": "z9uLV3bs4noo"
513516
},
514517
"source": [
515-
"# **Use case 2: Spanner as Vector Store**"
516-
]
517-
},
518-
{
519-
"cell_type": "markdown",
520-
"metadata": {
521-
"id": "duVsSeMcgEWl"
522-
},
523-
"source": [
524-
"Now, let's learn how to put all of the documents we just loaded into a vector store so that we can use vector search to answer our user's questions!"
518+
"# **Use case 2: Spanner as Vector Store**\n",
519+
"Google Cloud Spanner supports 2 different algorithms that we have added vector store capabilities to:\n",
520+
"* K-Nearest Neighbors (KNN)\n",
521+
"* Approximate Nearest Neighbors (ANN)\n",
522+
"\n",
523+
"When your dataset is small, the K-Nearest Neighbors (KNN) algorithm works well, but with large datasets, you shall need to use Approximate Nearest Neighbors (ANN) because the latency and cost of a KNN search increases.\nHowever, we shall exhibit how to use both!"
525524
]
526525
},
527526
{
@@ -530,9 +529,10 @@
530529
"id": "jfH8oQJ945Ko"
531530
},
532531
"source": [
533-
"### Create Your Vector Store table\n",
532+
"### K-Nearest Neighbors (KNN) based vector store\n",
534533
"\n",
535-
"Based on the documents that we loaded before, we want to create a table with a vector column as our vector store. We will start it by intializing a vector table by calling the `init_vectorstore_table` function from our `SpannerVectorStore`. As you can see we list all of the columns for our metadata.\n"
534+
"When the dataset is small, this algorithm is ideal\n",
535+
"Based on the documents that we loaded before, we want to create a table with a vector column as our vector store using the kNN algorithm. We will start it by intializing a vector table by calling the `init_vectorstore_table` function from our `SpannerVectorStore`. As you can see we list all of the columns for our metadata.\n"
536536
]
537537
},
538538
{
@@ -695,6 +695,136 @@
695695
" )"
696696
]
697697
},
698+
{
699+
"cell_type": "markdown",
700+
"metadata": {
701+
"id": "jfH8oQJ945Ko"
702+
},
703+
"source": [
704+
"### Approximate Nearest Neighbors (ANN) based vector store\n",
705+
"\n",
706+
"For this task, we shall pull in documents from a popular HackerNews post, insert them into our ANN based vector store and then use ANN to find the most relevent comments/content\n\nTo create vector embeddings, we shall be using Google's Vertex AI gecko-003 model and then for all related queries, vectorize the query using our embedding service to then perform the search.\n\n",
707+
"Cloud Spanner allows for 3 different algorithms to be created with the vector search index and correspondingly used for the search:\n",
708+
"* APPROX_COSINE\n",
709+
"* APPROX_DOT_PRODUCT\n",
710+
"* APPROX_EUCLIDEAN_DISTANCE\n",
711+
"\n\nIn this exhibit, we shall be using using `APPROX_COSINE`\n",
712+
"Our steps shall comprise:\n* Creating the text embedding service\n* Initializing the ANN vector store\n* Loading data from a popular HackerNews post\n* Adding the documents to the vector store\n* Searching by similarity_search, similarity_search_by_vector, max_marginal_relevance_search_with_score_by_vector\n* Deleting the inserted documents\nAll the above using the langchain.VectorStore interfface.\n\n"
713+
]
714+
},
715+
{
716+
"cell_type": "code",
717+
"execution_count": null,
718+
"metadata": {},
719+
"outputs": [],
720+
"source": [
721+
"import os\n",
722+
"import uuid\n",
723+
"\n",
724+
"from langchain_community.document_loaders import HNLoader\n",
725+
"from langchain_google_vertexai.embeddings import VertexAIEmbeddings\n",
726+
"from langchain_google_spanner.vector_store import (\n",
727+
" DistanceStrategy,\n",
728+
" QueryParameters,\n",
729+
" SpannerVectorStore,\n",
730+
" TableColumn,\n",
731+
" VectorSearchIndex,\n",
732+
")\n",
733+
"\n",
734+
"embeddings_service = VertexAIEmbeddings(\n",
735+
" model_name=\"textembedding-gecko@003\", project=project_id\n",
736+
")\n",
737+
"table_name_ANN = \"hnn_articles\"\n",
738+
"embedding_vector_size = 768\n",
739+
"vector_index_name = \"titles_index\"\n",
740+
"title_embedding_column = TableColumn(\n",
741+
" name=\"title_embedding\", type=\"ARRAY<FLOAT64>\", is_null=True\n",
742+
")\n",
743+
"\n",
744+
"\n",
745+
"def main():\n",
746+
" SpannerVectorStore.init_vector_store_table(\n",
747+
" instance_id=instance_id,\n",
748+
" database_id=database_id,\n",
749+
" table_name=table_name_ANN,\n",
750+
" vector_size=embedding_vector_size,\n",
751+
" id_column=\"row_id\",\n",
752+
" metadata_columns=[\n",
753+
" TableColumn(name=\"metadata\", type=\"JSON\", is_null=True),\n",
754+
" TableColumn(name=\"title\", type=\"STRING(MAX)\", is_null=False),\n",
755+
" ],\n",
756+
" embedding_column=title_embedding_column,\n",
757+
" secondary_indexes=[\n",
758+
" VectorSearchIndex(\n",
759+
" index_name=vector_index_name,\n",
760+
" columns=[title_embedding_column.name],\n",
761+
" nullable_column=True,\n",
762+
" num_branches=1000,\n",
763+
" tree_depth=3,\n",
764+
" distance_type=DistanceStrategy.COSINE,\n",
765+
" num_leaves=100000,\n",
766+
" ),\n",
767+
" ],\n",
768+
" )\n",
769+
"\n",
770+
" # 0. Create the handle to the vector store.\n",
771+
" db = SpannerVectorStore(\n",
772+
" instance_id=instance_id,\n",
773+
" database_id=google_database,\n",
774+
" table_name=table_name_ANN,\n",
775+
" id_column=\"row_id\",\n",
776+
" ignore_metadata_columns=[],\n",
777+
" embedding_service=embeddings_service,\n",
778+
" embedding_column=title_embedding_column,\n",
779+
" metadata_json_column=\"metadata\",\n",
780+
" vector_index_name=vector_index_name,\n",
781+
" query_parameters=QueryParameters(\n",
782+
" algorithm=QueryParameters.NearestNeighborsAlgorithm.APPROXIMATE_NEAREST_NEIGHBOR,\n",
783+
" distance_strategy=DistanceStrategy.COSINE,\n",
784+
" ),\n",
785+
" )\n",
786+
"\n",
787+
" # 1. Add the documents, loaded in from the HackerNews post.\n",
788+
" loader = HNLoader(\"https://news.ycombinator.com/item?id=42797260\")\n",
789+
" inserted_docs = loader.load()\n",
790+
" docs = inserted_docs.copy()\n",
791+
" ids = [str(uuid.uuid4()) for _ in range(len(docs))]\n",
792+
" db.add_documents(documents=docs, ids=ids)\n",
793+
" print(\"n_docs\", len(docs))\n",
794+
"\n",
795+
" # 2. Use similarity_search.\n",
796+
" docs = db.similarity_search(\n",
797+
" \"Open source software\",\n",
798+
" k=2,\n",
799+
" )\n",
800+
" print(\"by similarity_search\", docs)\n",
801+
"\n",
802+
" # 3. Search by vector similarity.\n",
803+
" embeds = embeddings_service.embed_query(\n",
804+
" \"Open source software\",\n",
805+
" )\n",
806+
" docs = db.similarity_search_by_vector(\n",
807+
" embeds,\n",
808+
" k=3,\n",
809+
" )\n",
810+
" print(\"by direct vector_search\", docs)\n",
811+
"\n",
812+
" # 4. Search by max_marginal_relevance_search_with_score_by_vector.\n",
813+
" docs = db.max_marginal_relevance_search_with_score_by_vector(\n",
814+
" embeds,\n",
815+
" k=3,\n",
816+
" )\n",
817+
" print(\"by max_marginal_relevance_search\", docs)\n",
818+
"\n",
819+
" # 5. Delete the inserted docs.\n",
820+
" deleted = db.delete(documents=inserted_docs)\n",
821+
" print(\"deleted\", deleted)\n",
822+
"\n",
823+
"\n",
824+
"if __name__ == \"__main__\":\n",
825+
" main()"
826+
]
827+
},
698828
{
699829
"cell_type": "markdown",
700830
"metadata": {
@@ -1058,7 +1188,8 @@
10581188
"toc_visible": true
10591189
},
10601190
"kernelspec": {
1061-
"display_name": "Python 3",
1191+
"display_name": "Python 3 (ipykernel)",
1192+
"language": "python",
10621193
"name": "python3"
10631194
},
10641195
"language_info": {
@@ -1071,9 +1202,9 @@
10711202
"name": "python",
10721203
"nbconvert_exporter": "python",
10731204
"pygments_lexer": "ipython3",
1074-
"version": "3.11.7"
1205+
"version": "3.10.11"
10751206
}
10761207
},
10771208
"nbformat": 4,
1078-
"nbformat_minor": 0
1209+
"nbformat_minor": 4
10791210
}

0 commit comments

Comments
 (0)