googleapis
diff --git a/‎noxfile.py‎
Lines changed: 1 addition & 1 deletion b/‎noxfile.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎samples/langchain_quick_start.ipynb‎
Lines changed: 147 additions & 16 deletions b/‎samples/langchain_quick_start.ipynb‎
Lines changed: 147 additions & 16 deletions
@@ -24,7 +24,7 @@
 
 DEFAULT_PYTHON_VERSION = "3.10"
 CURRENT_DIRECTORY = pathlib.Path(__file__).parent.absolute()
-LINT_PATHS = ["src", "tests", "noxfile.py"]
+LINT_PATHS = ["samples", "src", "tests", "noxfile.py"]
 
 
 nox.options.sessions = [
 
@@ -194,7 +194,10 @@
  {
  "cell_type": "markdown",
  "metadata": {
- "collapsed": false
+ "collapsed": false,
+ "jupyter": {
+ "outputs_hidden": false
+ }
  },
  "source": [
  "2. Enable the APIs for Spanner and Vertex AI within your project."
@@ -512,16 +515,12 @@
  "id": "z9uLV3bs4noo"
  },
  "source": [
- "# **Use case 2: Spanner as Vector Store**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "duVsSeMcgEWl"
- },
- "source": [
- "Now, let's learn how to put all of the documents we just loaded into a vector store so that we can use vector search to answer our user's questions!"
+ "# **Use case 2: Spanner as Vector Store**\n",
+ "Google Cloud Spanner supports 2 different algorithms that we have added vector store capabilities to:\n",
+ "* K-Nearest Neighbors (KNN)\n",
+ "* Approximate Nearest Neighbors (ANN)\n",
+ "\n",
+ "When your dataset is small, the K-Nearest Neighbors (KNN) algorithm works well, but with large datasets, you shall need to use Approximate Nearest Neighbors (ANN) because the latency and cost of a KNN search increases.\nHowever, we shall exhibit how to use both!"
  ]
  },
  {
@@ -530,9 +529,10 @@
  "id": "jfH8oQJ945Ko"
  },
  "source": [
- "### Create Your Vector Store table\n",
+ "### K-Nearest Neighbors (KNN) based vector store\n",
  "\n",
- "Based on the documents that we loaded before, we want to create a table with a vector column as our vector store. We will start it by intializing a vector table by calling the `init_vectorstore_table` function from our `SpannerVectorStore`. As you can see we list all of the columns for our metadata.\n"
+ "When the dataset is small, this algorithm is ideal\n",
+ "Based on the documents that we loaded before, we want to create a table with a vector column as our vector store using the kNN algorithm. We will start it by intializing a vector table by calling the `init_vectorstore_table` function from our `SpannerVectorStore`. As you can see we list all of the columns for our metadata.\n"
  ]
  },
  {
@@ -695,6 +695,136 @@
  " )"
  ]
  },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jfH8oQJ945Ko"
+ },
+ "source": [
+ "### Approximate Nearest Neighbors (ANN) based vector store\n",
+ "\n",
+ "For this task, we shall pull in documents from a popular HackerNews post, insert them into our ANN based vector store and then use ANN to find the most relevent comments/content\n\nTo create vector embeddings, we shall be using Google's Vertex AI gecko-003 model and then for all related queries, vectorize the query using our embedding service to then perform the search.\n\n",
+ "Cloud Spanner allows for 3 different algorithms to be created with the vector search index and correspondingly used for the search:\n",
+ "* APPROX_COSINE\n",
+ "* APPROX_DOT_PRODUCT\n",
+ "* APPROX_EUCLIDEAN_DISTANCE\n",
+ "\n\nIn this exhibit, we shall be using using `APPROX_COSINE`\n",
+ "Our steps shall comprise:\n* Creating the text embedding service\n* Initializing the ANN vector store\n* Loading data from a popular HackerNews post\n* Adding the documents to the vector store\n* Searching by similarity_search, similarity_search_by_vector, max_marginal_relevance_search_with_score_by_vector\n* Deleting the inserted documents\nAll the above using the langchain.VectorStore interfface.\n\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import uuid\n",
+ "\n",
+ "from langchain_community.document_loaders import HNLoader\n",
+ "from langchain_google_vertexai.embeddings import VertexAIEmbeddings\n",
+ "from langchain_google_spanner.vector_store import (\n",
+ " DistanceStrategy,\n",
+ " QueryParameters,\n",
+ " SpannerVectorStore,\n",
+ " TableColumn,\n",
+ " VectorSearchIndex,\n",
+ ")\n",
+ "\n",
+ "embeddings_service = VertexAIEmbeddings(\n",
+ " model_name=\"textembedding-gecko@003\", project=project_id\n",
+ ")\n",
+ "table_name_ANN = \"hnn_articles\"\n",
+ "embedding_vector_size = 768\n",
+ "vector_index_name = \"titles_index\"\n",
+ "title_embedding_column = TableColumn(\n",
+ " name=\"title_embedding\", type=\"ARRAY<FLOAT64>\", is_null=True\n",
+ ")\n",
+ "\n",
+ "\n",
+ "def main():\n",
+ " SpannerVectorStore.init_vector_store_table(\n",
+ " instance_id=instance_id,\n",
+ " database_id=database_id,\n",
+ " table_name=table_name_ANN,\n",
+ " vector_size=embedding_vector_size,\n",
+ " id_column=\"row_id\",\n",
+ " metadata_columns=[\n",
+ " TableColumn(name=\"metadata\", type=\"JSON\", is_null=True),\n",
+ " TableColumn(name=\"title\", type=\"STRING(MAX)\", is_null=False),\n",
+ " ],\n",
+ " embedding_column=title_embedding_column,\n",
+ " secondary_indexes=[\n",
+ " VectorSearchIndex(\n",
+ " index_name=vector_index_name,\n",
+ " columns=[title_embedding_column.name],\n",
+ " nullable_column=True,\n",
+ " num_branches=1000,\n",
+ " tree_depth=3,\n",
+ " distance_type=DistanceStrategy.COSINE,\n",
+ " num_leaves=100000,\n",
+ " ),\n",
+ " ],\n",
+ " )\n",
+ "\n",
+ " # 0. Create the handle to the vector store.\n",
+ " db = SpannerVectorStore(\n",
+ " instance_id=instance_id,\n",
+ " database_id=google_database,\n",
+ " table_name=table_name_ANN,\n",
+ " id_column=\"row_id\",\n",
+ " ignore_metadata_columns=[],\n",
+ " embedding_service=embeddings_service,\n",
+ " embedding_column=title_embedding_column,\n",
+ " metadata_json_column=\"metadata\",\n",
+ " vector_index_name=vector_index_name,\n",
+ " query_parameters=QueryParameters(\n",
+ " algorithm=QueryParameters.NearestNeighborsAlgorithm.APPROXIMATE_NEAREST_NEIGHBOR,\n",
+ " distance_strategy=DistanceStrategy.COSINE,\n",
+ " ),\n",
+ " )\n",
+ "\n",
+ " # 1. Add the documents, loaded in from the HackerNews post.\n",
+ " loader = HNLoader(\"https://news.ycombinator.com/item?id=42797260\")\n",
+ " inserted_docs = loader.load()\n",
+ " docs = inserted_docs.copy()\n",
+ " ids = [str(uuid.uuid4()) for _ in range(len(docs))]\n",
+ " db.add_documents(documents=docs, ids=ids)\n",
+ " print(\"n_docs\", len(docs))\n",
+ "\n",
+ " # 2. Use similarity_search.\n",
+ " docs = db.similarity_search(\n",
+ " \"Open source software\",\n",
+ " k=2,\n",
+ " )\n",
+ " print(\"by similarity_search\", docs)\n",
+ "\n",
+ " # 3. Search by vector similarity.\n",
+ " embeds = embeddings_service.embed_query(\n",
+ " \"Open source software\",\n",
+ " )\n",
+ " docs = db.similarity_search_by_vector(\n",
+ " embeds,\n",
+ " k=3,\n",
+ " )\n",
+ " print(\"by direct vector_search\", docs)\n",
+ "\n",
+ " # 4. Search by max_marginal_relevance_search_with_score_by_vector.\n",
+ " docs = db.max_marginal_relevance_search_with_score_by_vector(\n",
+ " embeds,\n",
+ " k=3,\n",
+ " )\n",
+ " print(\"by max_marginal_relevance_search\", docs)\n",
+ "\n",
+ " # 5. Delete the inserted docs.\n",
+ " deleted = db.delete(documents=inserted_docs)\n",
+ " print(\"deleted\", deleted)\n",
+ "\n",
+ "\n",
+ "if __name__ == \"__main__\":\n",
+ " main()"
+ ]
+ },
  {
  "cell_type": "markdown",
  "metadata": {
@@ -1058,7 +1188,8 @@
  "toc_visible": true
  },
  "kernelspec": {
- "display_name": "Python 3",
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
  "name": "python3"
  },
  "language_info": {
@@ -1071,9 +1202,9 @@
  "name": "python",
  "nbconvert_exporter": "python",
  "pygments_lexer": "ipython3",
- "version": "3.11.7"
+ "version": "3.10.11"
  }
  },
  "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 4
 }