|
194 | 194 | { |
195 | 195 | "cell_type": "markdown", |
196 | 196 | "metadata": { |
197 | | - "collapsed": false |
| 197 | + "collapsed": false, |
| 198 | + "jupyter": { |
| 199 | + "outputs_hidden": false |
| 200 | + } |
198 | 201 | }, |
199 | 202 | "source": [ |
200 | 203 | "2. Enable the APIs for Spanner and Vertex AI within your project." |
|
512 | 515 | "id": "z9uLV3bs4noo" |
513 | 516 | }, |
514 | 517 | "source": [ |
515 | | - "# **Use case 2: Spanner as Vector Store**" |
516 | | - ] |
517 | | - }, |
518 | | - { |
519 | | - "cell_type": "markdown", |
520 | | - "metadata": { |
521 | | - "id": "duVsSeMcgEWl" |
522 | | - }, |
523 | | - "source": [ |
524 | | - "Now, let's learn how to put all of the documents we just loaded into a vector store so that we can use vector search to answer our user's questions!" |
| 518 | + "# **Use case 2: Spanner as Vector Store**\n", |
| 519 | + "Google Cloud Spanner supports 2 different algorithms that we have added vector store capabilities to:\n", |
| 520 | + "* K-Nearest Neighbors (KNN)\n", |
| 521 | + "* Approximate Nearest Neighbors (ANN)\n", |
| 522 | + "\n", |
| 523 | + "When your dataset is small, the K-Nearest Neighbors (KNN) algorithm works well, but with large datasets, you shall need to use Approximate Nearest Neighbors (ANN) because the latency and cost of a KNN search increases.\nHowever, we shall exhibit how to use both!" |
525 | 524 | ] |
526 | 525 | }, |
527 | 526 | { |
|
530 | 529 | "id": "jfH8oQJ945Ko" |
531 | 530 | }, |
532 | 531 | "source": [ |
533 | | - "### Create Your Vector Store table\n", |
| 532 | + "### K-Nearest Neighbors (KNN) based vector store\n", |
534 | 533 | "\n", |
535 | | - "Based on the documents that we loaded before, we want to create a table with a vector column as our vector store. We will start it by intializing a vector table by calling the `init_vectorstore_table` function from our `SpannerVectorStore`. As you can see we list all of the columns for our metadata.\n" |
| 534 | + "When the dataset is small, this algorithm is ideal\n", |
| 535 | + "Based on the documents that we loaded before, we want to create a table with a vector column as our vector store using the kNN algorithm. We will start it by intializing a vector table by calling the `init_vectorstore_table` function from our `SpannerVectorStore`. As you can see we list all of the columns for our metadata.\n" |
536 | 536 | ] |
537 | 537 | }, |
538 | 538 | { |
|
695 | 695 | " )" |
696 | 696 | ] |
697 | 697 | }, |
| 698 | + { |
| 699 | + "cell_type": "markdown", |
| 700 | + "metadata": { |
| 701 | + "id": "jfH8oQJ945Ko" |
| 702 | + }, |
| 703 | + "source": [ |
| 704 | + "### Approximate Nearest Neighbors (ANN) based vector store\n", |
| 705 | + "\n", |
| 706 | + "For this task, we shall pull in documents from a popular HackerNews post, insert them into our ANN based vector store and then use ANN to find the most relevent comments/content\n\nTo create vector embeddings, we shall be using Google's Vertex AI gecko-003 model and then for all related queries, vectorize the query using our embedding service to then perform the search.\n\n", |
| 707 | + "Cloud Spanner allows for 3 different algorithms to be created with the vector search index and correspondingly used for the search:\n", |
| 708 | + "* APPROX_COSINE\n", |
| 709 | + "* APPROX_DOT_PRODUCT\n", |
| 710 | + "* APPROX_EUCLIDEAN_DISTANCE\n", |
| 711 | + "\n\nIn this exhibit, we shall be using using `APPROX_COSINE`\n", |
| 712 | + "Our steps shall comprise:\n* Creating the text embedding service\n* Initializing the ANN vector store\n* Loading data from a popular HackerNews post\n* Adding the documents to the vector store\n* Searching by similarity_search, similarity_search_by_vector, max_marginal_relevance_search_with_score_by_vector\n* Deleting the inserted documents\nAll the above using the langchain.VectorStore interfface.\n\n" |
| 713 | + ] |
| 714 | + }, |
| 715 | + { |
| 716 | + "cell_type": "code", |
| 717 | + "execution_count": null, |
| 718 | + "metadata": {}, |
| 719 | + "outputs": [], |
| 720 | + "source": [ |
| 721 | + "import os\n", |
| 722 | + "import uuid\n", |
| 723 | + "\n", |
| 724 | + "from langchain_community.document_loaders import HNLoader\n", |
| 725 | + "from langchain_google_vertexai.embeddings import VertexAIEmbeddings\n", |
| 726 | + "from langchain_google_spanner.vector_store import (\n", |
| 727 | + " DistanceStrategy,\n", |
| 728 | + " QueryParameters,\n", |
| 729 | + " SpannerVectorStore,\n", |
| 730 | + " TableColumn,\n", |
| 731 | + " VectorSearchIndex,\n", |
| 732 | + ")\n", |
| 733 | + "\n", |
| 734 | + "embeddings_service = VertexAIEmbeddings(\n", |
| 735 | + " model_name=\"textembedding-gecko@003\", project=project_id\n", |
| 736 | + ")\n", |
| 737 | + "table_name_ANN = \"hnn_articles\"\n", |
| 738 | + "embedding_vector_size = 768\n", |
| 739 | + "vector_index_name = \"titles_index\"\n", |
| 740 | + "title_embedding_column = TableColumn(\n", |
| 741 | + " name=\"title_embedding\", type=\"ARRAY<FLOAT64>\", is_null=True\n", |
| 742 | + ")\n", |
| 743 | + "\n", |
| 744 | + "\n", |
| 745 | + "def main():\n", |
| 746 | + " SpannerVectorStore.init_vector_store_table(\n", |
| 747 | + " instance_id=instance_id,\n", |
| 748 | + " database_id=database_id,\n", |
| 749 | + " table_name=table_name_ANN,\n", |
| 750 | + " vector_size=embedding_vector_size,\n", |
| 751 | + " id_column=\"row_id\",\n", |
| 752 | + " metadata_columns=[\n", |
| 753 | + " TableColumn(name=\"metadata\", type=\"JSON\", is_null=True),\n", |
| 754 | + " TableColumn(name=\"title\", type=\"STRING(MAX)\", is_null=False),\n", |
| 755 | + " ],\n", |
| 756 | + " embedding_column=title_embedding_column,\n", |
| 757 | + " secondary_indexes=[\n", |
| 758 | + " VectorSearchIndex(\n", |
| 759 | + " index_name=vector_index_name,\n", |
| 760 | + " columns=[title_embedding_column.name],\n", |
| 761 | + " nullable_column=True,\n", |
| 762 | + " num_branches=1000,\n", |
| 763 | + " tree_depth=3,\n", |
| 764 | + " distance_type=DistanceStrategy.COSINE,\n", |
| 765 | + " num_leaves=100000,\n", |
| 766 | + " ),\n", |
| 767 | + " ],\n", |
| 768 | + " )\n", |
| 769 | + "\n", |
| 770 | + " # 0. Create the handle to the vector store.\n", |
| 771 | + " db = SpannerVectorStore(\n", |
| 772 | + " instance_id=instance_id,\n", |
| 773 | + " database_id=google_database,\n", |
| 774 | + " table_name=table_name_ANN,\n", |
| 775 | + " id_column=\"row_id\",\n", |
| 776 | + " ignore_metadata_columns=[],\n", |
| 777 | + " embedding_service=embeddings_service,\n", |
| 778 | + " embedding_column=title_embedding_column,\n", |
| 779 | + " metadata_json_column=\"metadata\",\n", |
| 780 | + " vector_index_name=vector_index_name,\n", |
| 781 | + " query_parameters=QueryParameters(\n", |
| 782 | + " algorithm=QueryParameters.NearestNeighborsAlgorithm.APPROXIMATE_NEAREST_NEIGHBOR,\n", |
| 783 | + " distance_strategy=DistanceStrategy.COSINE,\n", |
| 784 | + " ),\n", |
| 785 | + " )\n", |
| 786 | + "\n", |
| 787 | + " # 1. Add the documents, loaded in from the HackerNews post.\n", |
| 788 | + " loader = HNLoader(\"https://news.ycombinator.com/item?id=42797260\")\n", |
| 789 | + " inserted_docs = loader.load()\n", |
| 790 | + " docs = inserted_docs.copy()\n", |
| 791 | + " ids = [str(uuid.uuid4()) for _ in range(len(docs))]\n", |
| 792 | + " db.add_documents(documents=docs, ids=ids)\n", |
| 793 | + " print(\"n_docs\", len(docs))\n", |
| 794 | + "\n", |
| 795 | + " # 2. Use similarity_search.\n", |
| 796 | + " docs = db.similarity_search(\n", |
| 797 | + " \"Open source software\",\n", |
| 798 | + " k=2,\n", |
| 799 | + " )\n", |
| 800 | + " print(\"by similarity_search\", docs)\n", |
| 801 | + "\n", |
| 802 | + " # 3. Search by vector similarity.\n", |
| 803 | + " embeds = embeddings_service.embed_query(\n", |
| 804 | + " \"Open source software\",\n", |
| 805 | + " )\n", |
| 806 | + " docs = db.similarity_search_by_vector(\n", |
| 807 | + " embeds,\n", |
| 808 | + " k=3,\n", |
| 809 | + " )\n", |
| 810 | + " print(\"by direct vector_search\", docs)\n", |
| 811 | + "\n", |
| 812 | + " # 4. Search by max_marginal_relevance_search_with_score_by_vector.\n", |
| 813 | + " docs = db.max_marginal_relevance_search_with_score_by_vector(\n", |
| 814 | + " embeds,\n", |
| 815 | + " k=3,\n", |
| 816 | + " )\n", |
| 817 | + " print(\"by max_marginal_relevance_search\", docs)\n", |
| 818 | + "\n", |
| 819 | + " # 5. Delete the inserted docs.\n", |
| 820 | + " deleted = db.delete(documents=inserted_docs)\n", |
| 821 | + " print(\"deleted\", deleted)\n", |
| 822 | + "\n", |
| 823 | + "\n", |
| 824 | + "if __name__ == \"__main__\":\n", |
| 825 | + " main()" |
| 826 | + ] |
| 827 | + }, |
698 | 828 | { |
699 | 829 | "cell_type": "markdown", |
700 | 830 | "metadata": { |
|
1058 | 1188 | "toc_visible": true |
1059 | 1189 | }, |
1060 | 1190 | "kernelspec": { |
1061 | | - "display_name": "Python 3", |
| 1191 | + "display_name": "Python 3 (ipykernel)", |
| 1192 | + "language": "python", |
1062 | 1193 | "name": "python3" |
1063 | 1194 | }, |
1064 | 1195 | "language_info": { |
|
1071 | 1202 | "name": "python", |
1072 | 1203 | "nbconvert_exporter": "python", |
1073 | 1204 | "pygments_lexer": "ipython3", |
1074 | | - "version": "3.11.7" |
| 1205 | + "version": "3.10.11" |
1075 | 1206 | } |
1076 | 1207 | }, |
1077 | 1208 | "nbformat": 4, |
1078 | | - "nbformat_minor": 0 |
| 1209 | + "nbformat_minor": 4 |
1079 | 1210 | } |
0 commit comments