- Notifications
You must be signed in to change notification settings - Fork 722
Closed
Labels
enhancementNew feature or requestNew feature or requestminor releaseWill be addressed in the next minor releaseWill be addressed in the next minor releaseready to release
Milestone
Description
Issue
wr.catalog.sanitize_dataframe_columns_names does not correctly sanitize CammelCase to snake_case. It also allows duplicate columns which would be problematic in Athena.
Example
import pandas as pd import awswrangler as wr print(wr.__version__) # 2.13.0 df = pd.DataFrame({'SomeLongStringWithMultipleWords': [1, 2, 3], 'sinceEpoch': [4, 5, 6], 'SinceEpoch': [7, 8, 9], 'since_epoch': [7, 8, 9]}) df_clean = wr.catalog.sanitize_dataframe_columns_names(df) print(df_clean) # somelongstringwithmultiplewords sinceepoch sinceepoch since_epoch # 0 1 4 7 7 # 1 2 5 8 8 # 2 3 6 9 9 print(df_clean["sinceepoch"]) # sinceepoch sinceepoch # 0 4 7 # 1 5 8 # 2 6 9 First issue
Based on the documentation for this function, I would have expected each column name to be snake_case.
Second issue
This is more of a judgement call and I could see reasons for not adding it, however, to live up to the function name, if duplicate columns would either throw an error or have a suffix added to them. I'd appreciate it if this was at least optional.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestminor releaseWill be addressed in the next minor releaseWill be addressed in the next minor releaseready to release