Skip to content

Conversation

@RahulDas-dev
Copy link

This merge request adds a new [duplicated()] method to the DataFrame class that identifies duplicate rows within a DataFrame. This functionality is essential for data cleaning and exploration workflows.

Resolve the issue - #667

Features

  • Identifies duplicate rows in a DataFrame based on specified columns
  • Returns a Series of boolean values marking duplicate entries
  • Supports flexible options for handling duplicates:
    • keep: 'first' - Mark duplicates except for the first occurrence (default)
    • keep: 'last'- Mark duplicates except for the last occurrence
    • keep: false - Mark all duplicates
      Allows focusing on specific columns with the subset option

Implementation Details

  • Optimized to handle large datasets efficiently with a hash-based approach
  • Comprehensive input validation for better error handling
  • Well-documented with JSDoc comments and examples
// Create a DataFrame with duplicate rows const df = new DataFrame({ 'A': [1, 2, 2, 3, 3], 'B': ['a', 'b', 'b', 'c', 'c'] }); // Find duplicates keeping first occurrence (default) const dups = df.duplicated(); // Returns: [false, false, true, false, true] // Find duplicates keeping last occurrence const dupsLast = df.duplicated({ keep: 'last' }); // Returns: [false, true, false, true, false] // Find duplicates based on specific columns const dupsSubset = df.duplicated({ subset: ['B'] }); // Returns: [false, false, true, false, true] 
Signed-off-by: rahuldas-dev <r.das699@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant