DEV Community

Cover image for Google Search with Structured Data Extraction
Ranjan Dailata
Ranjan Dailata

Posted on

Google Search with Structured Data Extraction

Introduction

In this blog post, you will be presented with the mechanism on how to perform or accomplish a Google search with the structured data extraction using the Google Gemini Pro Large Language Model.

Hands-on

  1. Please head over to the Google Colab
  2. Make sure to login to the Google Cloud and get the Project Id and Location Info.
  3. Use the below code for Vertex AI initialization purposes.
import sys # Additional authentication is required for Google Colab if "google.colab" in sys.modules: # Authenticate user to Google Cloud from google.colab import auth auth.authenticate_user() PROJECT_ID = "<<project_id>>" # @param {type:"string"} LOCATION = "<<location>>" # @param {type:"string"} if "google.colab" in sys.modules: # Define project information PROJECT_ID = PROJECT_ID LOCATION = LOCATION # Initialize Vertex AI import vertexai vertexai.init(project=PROJECT_ID, location=LOCATION) 
Enter fullscreen mode Exit fullscreen mode

We are going to make use of the open source packages like html2text beautifulsoup4 for the web scraping.

!pip install requests html2text beautifulsoup4 
Enter fullscreen mode Exit fullscreen mode

Let's work against the search query.

search_query = """Sea food near Googleplex 1600 Amphitheatre Parkway Mountain View, CA 94043 United States""" 
Enter fullscreen mode Exit fullscreen mode

Here's the code for accomplishing the simple web scraping.

import requests from bs4 import BeautifulSoup import html2text def scrape_website(url): try: # Send an HTTP request to the URL response = requests.get(url) # Check if the request was successful (status code 200) if response.status_code == 200: return html2text.html2text(response.text) else: print(f"Failed to retrieve content. Status code: {response.status_code}") except Exception as e: print(f"An error occurred: {e}") 
Enter fullscreen mode Exit fullscreen mode

For the demonstration purposes, let's do a programmatic Google search and extract the results.

url = f'https://www.google.com/search?q={search_query}' print(url) google_search_content = scrape_website(url) 
Enter fullscreen mode Exit fullscreen mode

Now let's focus on how to get the structured response with our own schema. Here's the code snippet for the same.

schema = """ { "places": [ { "name": "", "rating": <<float>>, "price": "", "category": "", "address": "", "city": "", "state": "", "zip": "", "country": "", "phone": "", "website": "" } ] } """ 
Enter fullscreen mode Exit fullscreen mode

Time for us to deep dive into the Google Gemini Pro usages. Here's the code snippet which is responsible for querying the Gemini Pro model for getting the highly structured response as we expect.

import vertexai from vertexai.preview.generative_models import GenerativeModel, Part def google_search_formated_response(content, max_output_tokens=7815): model = GenerativeModel("gemini-pro") schema = """ { "places": [ { "name": "", "rating": <<float>>, "price": "", "category": "", "address": "", "city": "", "state": "", "zip": "", "country": "", "phone": "", "website": "" } ] } """ responses = model.generate_content( f"""Format the below response to the following JSON schema. Here's the content: {content} """, generation_config={ "max_output_tokens": max_output_tokens, "temperature": 0, "top_p": 1 }, stream=True, ) formated_response = [] for response in responses: text = response.candidates[0].content.parts[0].text print(text) formated_response.append(text) return formated_response formated_response = google_search_formated_response(google_search_content) 
Enter fullscreen mode Exit fullscreen mode

Structured-Google-Response

Top comments (2)

Collapse
 
hilmanski profile image
hil • Edited

How long did it take for you?

Collapse
 
ranjancse profile image
Ranjan Dailata

2 to 3 seconds. I believe, with the local LLMs, things can be significantly faster.