Skip to content

wr.catalog.sanitize_dataframe_columns_names does not sanitize enough #1119

@kailukowiak

Description

@kailukowiak

Issue

wr.catalog.sanitize_dataframe_columns_names does not correctly sanitize CammelCase to snake_case. It also allows duplicate columns which would be problematic in Athena.

Example

import pandas as pd import awswrangler as wr print(wr.__version__) # 2.13.0 df = pd.DataFrame({'SomeLongStringWithMultipleWords': [1, 2, 3], 'sinceEpoch': [4, 5, 6], 'SinceEpoch': [7, 8, 9], 'since_epoch': [7, 8, 9]}) df_clean = wr.catalog.sanitize_dataframe_columns_names(df) print(df_clean) # somelongstringwithmultiplewords sinceepoch sinceepoch since_epoch # 0 1 4 7 7 # 1 2 5 8 8 # 2 3 6 9 9 print(df_clean["sinceepoch"]) # sinceepoch sinceepoch # 0 4 7 # 1 5 8 # 2 6 9 

First issue

Based on the documentation for this function, I would have expected each column name to be snake_case.

Second issue

This is more of a judgement call and I could see reasons for not adding it, however, to live up to the function name, if duplicate columns would either throw an error or have a suffix added to them. I'd appreciate it if this was at least optional.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions