Try BigQuery DataFrames
Use this quickstart to perform the following analysis and machine learning (ML) tasks by using the BigQuery DataFrames API in a BigQuery notebook:
- Create a DataFrame over the bigquery-public-data.ml_datasets.penguinspublic dataset.
- Calculate the average body mass of a penguin.
- Create a linear regression model.
- Create a DataFrame over a subset of the penguin data to use as training data.
- Clean up the training data.
- Set the model parameters.
- Fit the model.
- Score the model.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-  In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Roles required to select or create a project - Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-  Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
 
-  In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Roles required to select or create a project - Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-  Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
 
-  Verify that billing is enabled for your Google Cloud project. 
- Verify that the BigQuery API is enabled. - If you created a new project, the BigQuery API is automatically enabled. 
Required permissions
To create and run notebooks, you need the following Identity and Access Management (IAM) roles:
- BigQuery User (roles/bigquery.user)
- Notebook Runtime User (roles/aiplatform.notebookRuntimeUser)
- Code Creator (roles/dataform.codeCreator)
Create a notebook
Follow the instructions in Create a notebook from the BigQuery editor to create a new notebook.
Try BigQuery DataFrames
Try BigQuery DataFrames by following these steps:
- Create a new code cell in the notebook.
- Add the following code to the code cell: - import bigframes.pandas as bpd # Set BigQuery DataFrames options # Note: The project option is not required in all environments. # On BigQuery Studio, the project ID is automatically detected. bpd.options.bigquery.project = your_gcp_project_id # Use "partial" ordering mode to generate more efficient queries, but the # order of the rows in DataFrames may not be deterministic if you have not # explictly sorted it. Some operations that depend on the order, such as # head() will not function until you explictly order the DataFrame. Set the # ordering mode to "strict" (default) for more pandas compatibility. bpd.options.bigquery.ordering_mode = "partial" # Create a DataFrame from a BigQuery table query_or_table = "bigquery-public-data.ml_datasets.penguins" df = bpd.read_gbq(query_or_table) # Efficiently preview the results using the .peek() method. df.peek()
- Modify the - bpd.options.bigquery.project = your_gcp_project_idline to specify your Google Cloud project ID. For example,- bpd.options.bigquery.project = "myProjectID".
- Run the code cell. - The code returns a - DataFrameobject with data about penguins.
- Create a new code cell in the notebook and add the following code: - # Use the DataFrame just as you would a pandas DataFrame, but calculations # happen in the BigQuery query engine instead of the local system. average_body_mass = df["body_mass_g"].mean() print(f"average_body_mass: {average_body_mass}")
- Run the code cell. - The code calculates the average body mass of the penguins and prints it to the Google Cloud console. 
- Create a new code cell in the notebook and add the following code: - # Create the Linear Regression model from bigframes.ml.linear_model import LinearRegression # Filter down to the data we want to analyze adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"] # Drop the columns we don't care about adelie_data = adelie_data.drop(columns=["species"]) # Drop rows with nulls to get our training data training_data = adelie_data.dropna() # Pick feature columns and label column X = training_data[ [ "island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex", ] ] y = training_data[["body_mass_g"]] model = LinearRegression(fit_intercept=False) model.fit(X, y) model.score(X, y)
- Run the code cell. - The code returns the model's evaluation metrics. 
Clean up
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Continue learning how to use BigQuery DataFrames.
- Learn how to visualize graphs using BigQuery DataFrames.
- Learn how to use a BigQuery DataFrames notebook.