distinct - Delete duplicate rows from a BigQuery table

Distinct - Delete duplicate rows from a BigQuery table

To delete duplicate rows from a BigQuery table, you need to use SQL queries to identify and remove duplicates. Here's a general approach to achieve this:

  1. Identify Duplicates: Use a query to identify duplicate rows based on specific columns.

  2. Create a Temporary Table: Use the results of the query to create a temporary table that holds only unique rows.

  3. Replace the Original Table: Replace the original table with the new table that contains only unique rows.

Detailed Steps

1. Identify Duplicates

First, identify duplicates by writing a query that groups rows and counts occurrences. This will help you understand which rows are duplicated.

SELECT column1, column2, COUNT(*) AS cnt FROM `your_project.your_dataset.your_table` GROUP BY column1, column2 HAVING cnt > 1 

Replace column1 and column2 with the columns that define duplicates.

2. Create a Temporary Table with Unique Rows

Create a new table with only the unique rows. This involves using a query to select distinct rows and then inserting those into a new table.

CREATE OR REPLACE TABLE `your_project.your_dataset.your_table_unique` AS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY column1) AS row_num FROM `your_project.your_dataset.your_table` ) WHERE row_num = 1 

In this query:

  • ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY column1) assigns a unique number to each row within each partition of duplicates.
  • WHERE row_num = 1 selects only the first row of each duplicate set.

3. Replace the Original Table

After creating the table with unique rows, you can replace the original table with this new table.

Option A: Using BigQuery Console or UI

  1. Navigate to the BigQuery console.
  2. Go to the dataset containing the table.
  3. Delete the original table.
  4. Rename the new table to the original table's name.

Option B: Using SQL

-- Drop the original table DROP TABLE `your_project.your_dataset.your_table`; -- Rename the new table to the original table's name CREATE OR REPLACE TABLE `your_project.your_dataset.your_table` AS SELECT * FROM `your_project.your_dataset.your_table_unique`; 

Considerations

  • Backup Your Data: Always back up your data before performing delete operations.
  • Test Queries: Test your queries on a smaller subset of your data to ensure they work as expected.
  • Permissions: Ensure you have the necessary permissions to create, delete, and replace tables in BigQuery.

By following these steps, you can effectively remove duplicate rows from a BigQuery table while preserving unique records.

Examples

  1. How to remove duplicate rows from a BigQuery table using a DELETE statement?

    • Description: Remove duplicates by using a subquery to identify and delete rows with duplicates based on specific columns.
    • Code:
      DELETE FROM `project.dataset.table` WHERE rowid NOT IN ( SELECT MIN(rowid) FROM ( SELECT rowid, column1, column2 FROM `project.dataset.table` ) GROUP BY column1, column2 ) 
      • Explanation: Deletes rows that are not the first occurrence (based on rowid) of duplicates for column1 and column2.
  2. How to delete duplicate rows from a BigQuery table while keeping the latest record?

    • Description: Retain the most recent record by using ROW_NUMBER() to identify and delete older duplicates.
    • Code:
      WITH RankedRows AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY timestamp_column DESC) AS rn FROM `project.dataset.table` ) DELETE FROM `project.dataset.table` WHERE EXISTS ( SELECT 1 FROM RankedRows WHERE RankedRows.rowid = `project.dataset.table`.rowid AND RankedRows.rn > 1 ) 
      • Explanation: Keeps the latest row based on timestamp_column and deletes older duplicates.
  3. How to find and remove duplicate rows based on multiple columns in BigQuery?

    • Description: Remove duplicates by grouping on multiple columns and keeping only one row per group.

    • Code:

      CREATE OR REPLACE TABLE `project.dataset.new_table` AS SELECT column1, column2, MAX(timestamp_column) AS latest_timestamp FROM `project.dataset.table` GROUP BY column1, column2 
      DELETE FROM `project.dataset.table` WHERE rowid NOT IN ( SELECT rowid FROM `project.dataset.new_table` ) 
      • Explanation: Creates a new table with distinct rows based on column1 and column2, and then deletes the duplicates from the original table.
  4. How to use BigQuery Standard SQL to delete duplicate rows from a table?

    • Description: Utilize standard SQL to handle duplicate row removal in BigQuery.
    • Code:
      WITH Deduped AS ( SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY timestamp_column DESC) AS rn FROM `project.dataset.table` ) WHERE rn = 1 ) CREATE OR REPLACE TABLE `project.dataset.table` AS SELECT * FROM Deduped 
      • Explanation: Creates a deduplicated version of the table with the latest records and replaces the original table.
  5. How to delete duplicates while preserving the original table structure in BigQuery?

    • Description: Maintain the original table structure by creating a deduplicated version of the table and replacing it.

    • Code:

      CREATE OR REPLACE TABLE `project.dataset.new_table` AS SELECT DISTINCT column1, column2, column3 FROM `project.dataset.table` 
      DROP TABLE `project.dataset.table`; RENAME TABLE `project.dataset.new_table` TO `project.dataset.table`; 
      • Explanation: Creates a new table with unique rows and replaces the original table with it.
  6. How to delete duplicate rows based on a condition in BigQuery?

    • Description: Delete rows based on a condition applied to duplicate records.

    • Code:

      CREATE OR REPLACE TABLE `project.dataset.filtered_table` AS SELECT * FROM `project.dataset.table` WHERE column1 NOT IN ( SELECT column1 FROM `project.dataset.table` GROUP BY column1 HAVING COUNT(*) > 1 ) 
      DROP TABLE `project.dataset.table`; RENAME TABLE `project.dataset.filtered_table` TO `project.dataset.table`; 
      • Explanation: Creates a table with rows that do not meet the duplication condition and replaces the original table.
  7. How to delete duplicate rows while keeping the first occurrence in BigQuery?

    • Description: Use window functions to identify and delete all but the first occurrence of duplicates.

    • Code:

      CREATE OR REPLACE TABLE `project.dataset.deduped_table` AS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY timestamp_column ASC) AS rn FROM `project.dataset.table` ) WHERE rn = 1 
      DROP TABLE `project.dataset.table`; RENAME TABLE `project.dataset.deduped_table` TO `project.dataset.table`; 
      • Explanation: Creates a deduplicated table keeping the first occurrence of each duplicate group.
  8. How to remove duplicates by comparing against another BigQuery table?

    • Description: Remove rows from one table based on duplicates found in another table.

    • Code:

      CREATE OR REPLACE TABLE `project.dataset.table_without_duplicates` AS SELECT * FROM `project.dataset.table` WHERE NOT EXISTS ( SELECT 1 FROM `project.dataset.other_table` WHERE `project.dataset.table`.column1 = `project.dataset.other_table`.column1 AND `project.dataset.table`.column2 = `project.dataset.other_table`.column2 ) 
      DROP TABLE `project.dataset.table`; RENAME TABLE `project.dataset.table_without_duplicates` TO `project.dataset.table`; 
      • Explanation: Creates a new table excluding rows that match duplicates in another table.
  9. How to handle duplicates in BigQuery with custom SQL logic?

    • Description: Use custom SQL logic to identify and handle duplicate rows based on specific requirements.

    • Code:

      WITH RankedRows AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY column3 DESC) AS rn FROM `project.dataset.table` ) CREATE OR REPLACE TABLE `project.dataset.new_table` AS SELECT * FROM RankedRows WHERE rn = 1 
      DROP TABLE `project.dataset.table`; RENAME TABLE `project.dataset.new_table` TO `project.dataset.table`; 
      • Explanation: Uses custom ranking logic to handle duplicates and replace the original table with the deduplicated version.
  10. How to optimize the deletion of duplicate rows in BigQuery?

    • Description: Optimize the deduplication process for better performance on large datasets.

    • Code:

      CREATE OR REPLACE TABLE `project.dataset.optimized_table` AS SELECT column1, column2, MAX(timestamp_column) AS latest_timestamp FROM `project.dataset.table` GROUP BY column1, column2 
      DELETE FROM `project.dataset.table` WHERE EXISTS ( SELECT 1 FROM `project.dataset.optimized_table` WHERE `project.dataset.table`.column1 = `project.dataset.optimized_table`.column1 AND `project.dataset.table`.column2 = `project.dataset.optimized_table`.column2 AND `project.dataset.table`.timestamp_column < `project.dataset.optimized_table`.latest_timestamp ) 
      • Explanation: Creates an optimized table with unique records and deletes duplicates by comparing against the optimized version.

More Tags

collections touchableopacity miniconda spotfire sharepoint-clientobject legend-properties logcat angularjs-ng-repeat gesture-recognition vue-resource

More Programming Questions

More Chemical reactions Calculators

More Geometry Calculators

More Statistics Calculators

More Various Measurements Units Calculators