Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Exploding multiple columns
#1
I'm importing an Excel file into Python using Pandas, this is a test dataset I'm using to get the code working (the actual data I'm trying to read in is very large):

[Image: o6ajuTu.png]

Basically, my input data will have certain columns (Items, Type and Price) which contain lists of items separated by pipes. What I want to do is explode this data to get a result like this:

[Image: x4w62nB.png]

I've read the data into a dataframe (df) and this is the code I'm using to reformat and explode the data:

explodecolumns = ['Items', 'Type', 'Price'] df[explodecolumns] = df[explodecolumns].apply(lambda x: x.str.replace(r'\s*\|\s*', '|', regex=True).str.split('|')) dfexploded = df.explode(explodecolumns).reset_index(drop=True)
This works fine if I input the first two lines, where the lists are perfectly aligned, but of course it throws an error if I include the bottom two which are the more general case. My source data won't necessarily have all the data populated (eg the third line only has three entries in the "Items" column but four elements in the "Type" column) and might have no data at all in some cells.

I've been trying to figure out if there's a simple way to do this, does anyone know?

P.S. I think this code should create my test dataset if that helps:

data = {'Cart_ID': [923328, 923549, 924028, 901983], 'Date': ['2023-07-04 00:00:00', '2023-06-22 00:00:00', '2024-11-13 00:00:00', '2021-03-09 00:00:00'], 'Items': ['Apples', 'Bran | Soup', '| | Peas', 'Carrots | | Sauce'], 'Type': ['Fruit', 'Cereal | Cans', 'Pots | | | Fruit', float("nan")], 'Bag': ['Plastic', 'Paper', 'Paper', 'Plastic'], 'Weight': [2.65, 8.4, 21.31, 10.2], 'Price': ['3.8', '4.1 | 2.15', '33.4 | 1.1 | 4.2 | 0.43', '0.67 | 9.4 | 3.21']}
Reply
#2
Using the data you supplied, if we split date['Items'] or data['Type'] on '|' we get a list of 6 elements.

But if we split data['Price'] we get a list of 10 elements. This will cause problems: 6 items with 10 prices!

And why you have float("nan") in there God only knows!

import pandas as pd data = {'Cart_ID': [923328, 923549, 924028, 901983], 'Date': ['2023-07-04 00:00:00', '2023-06-22 00:00:00', '2024-11-13 00:00:00', '2021-03-09 00:00:00'], 'Items': ['Apples', 'Bran | Soup', '| | Peas', 'Carrots | | Sauce'], 'Type': ['Fruit', 'Cereal | Cans', 'Pots | | | Fruit', float("nan")], 'Bag': ['Plastic', 'Paper', 'Paper', 'Plastic'], 'Weight': [2.65, 8.4, 21.31, 10.2], 'Price': ['3.8', '4.1 | 2.15', '33.4 | 1.1 | 4.2 | 0.43', '0.67 | 9.4 | 3.21']} # all the lists for data[key] are 4 elements long for key in data.keys(): print(len(data[key])) keys = ['Items', 'Type', 'Price'] # if we split e.g. 'Bran | Soup' we need to associate the correct which is messy # data['Cart_ID'], data['Date'], data['Bag'], data['Weight'] for each item in the sublist a new_data = {} keys = ['Items', 'Type', 'Price'] for key in keys: newlist = [] for i in range(4): print(f'key = {key}, i = {i} ... ') # float("nan") could be a spanner in these works! if type(data[key][i]) == float: newlist.append(data[key][i]) #data[key] = newlist elif not '|' in data[key][i]: newlist.append(data[key][i]) elif '|' in data[key][i]: a = data[key][i].split() for b in a: if not b == '|': print(f'b = {b}') newlist.append(b) new_data['new' + key] = newlist
Try this:

# this will not work because all the lists must be the same length partdf = pd.DataFrame.from_dict(new_data, orient='columns') # this will work, but there are 3 prices associated with None and Sauce has NaN as Type partdf = pd.DataFrame.from_dict(new_data, orient='index')
I get this:

Output:
0 1 2 3 4 5 6 7 8 9 newItems Apples Bran Soup Peas Carrots Sauce None None None None newType Fruit Cereal Cans Pots Fruit NaN None None None None newPrice 3.8 4.1 2.15 33.4 1.1 4.2 0.43 0.67 9.4 3.21
I think your data is faulty!

I amended the data then it works:

import pandas as pd # data2['Price'] has 10 elements so all lists must contain 10 items, even if the list is technically shorter data2 = {'Cart_ID': [923328, 923549, 924028, 901983, 901984, 901985, 901986, 901987, 901988, 901989], 'Date': ['2023-07-04 00:00:00', '2023-06-22 00:00:00', '2024-11-13 00:00:00', '2021-03-09 00:00:00', '2023-07-04 00:00:00', '2023-06-22 00:00:00', '2024-11-13 00:00:00', '2021-03-09 00:00:00', '2023-06-22 00:00:00', '2024-11-13 00:00:00'], 'Items': ['Apples', 'Bran | Soup', '| | Peas', 'Carrots | | Tomato-Ketchup', '包子 | 饺子 | 卷饼 | 油条'], 'Type': ['Fruit', 'Cereal | Cans', 'Pots | | | Fruit', 'Sauces', '早饭 | 午饭 | 晚饭 | 随便'], 'Bag': ['Plastic', 'Paper', 'Paper', 'Plastic', 'Plastic', 'Paper', 'Paper', 'Plastic', 'Silicon', 'Hemp'], 'Weight': [2.65, 8.4, 21.31, 10.2, 2.65, 8.4, 21.31, 10.2, 12, 13], 'Price': ['3.8', '4.1 | 2.15', '33.4 | 1.1 | 4.2 | 0.43', '0.67 | 9.4', '| 3.21']} keys = ['Items', 'Type', 'Price'] # the lists for data2[key] are not all the same length # but after splitting elements we will have 10 elements in each list for key in data.keys(): print(key, len(data[key])) keys = ['Items', 'Type', 'Price'] new_data = {} for key in keys: newlist = [] for i in range(5): print(f'key = {key}, i = {i} ... ') # float("nan") could be a spanner in these works! if type(data2[key][i]) == float: newlist.append(data2[key][i]) #data[key] = newlist elif not '|' in data2[key][i]: newlist.append(data2[key][i]) elif '|' in data2[key][i]: a = data2[key][i].split() for b in a: if not b == '|': print(f'b = {b}') newlist.append(b) new_data['new' + key] = newlist # now the lists are all 10 elements long for key in new_data.keys(): print(key, len(new_data[key]))
Now these both work:

# this will works because all the lists are the same length partdf = pd.DataFrame.from_dict(new_data, orient='columns')
Gives:

Output:
newItems newType newPrice 0 Apples Fruit 3.8 1 Bran Cereal 4.1 2 Soup Cans 2.15 3 Peas Pots 33.4 4 Carrots Fruit 1.1 5 Tomato-Ketchup Sauces 4.2 6 包子 早饭 0.43 7 饺子 午饭 0.67 8 卷饼 晚饭 9.4 9 油条 随便 3.21
And this:

# now each Price is associated with an item partdf = pd.DataFrame.from_dict(new_data, orient='index')
Gives:

Output:
0 1 2 3 ... 6 7 8 9 newItems Apples Bran Soup Peas ... 包子 饺子 卷饼 油条 newType Fruit Cereal Cans Pots ... 早饭 午饭 晚饭 随便 newPrice 3.8 4.1 2.15 33.4 ... 0.43 0.67 9.4 3.21
And all prices are associated with an item. Customers may dislike paying something for nothing!

You could also post your Excel files for reference.
Reply
#3
If it helps anyone I eventually designed code that works in the general case:

df = dfimport['Sheet6'].copy().fillna('') explodecolumns = ['Items', 'Type', 'Price'] df[explodecolumns] = df[explodecolumns].apply(lambda x: x.str.replace(r'\s*\|\s*', '|', regex=True).str.split('|')) # Padding lists in rows for the explode function df['Length'] = df[explodecolumns].map(len).max(axis=1) def pad_list(lst, pad_amount, filler): if len(lst) < pad_amount: return lst + [filler] * (pad_amount - len(lst)) return lst for col in explodecolumns: df[col] = df.apply(lambda row: pad_list(row[col], row['Length'], ''), axis=1) df.drop(['Length'], axis=1, inplace=True) dfexploded = df.copy().explode(explodecolumns) dfexploded['ItemNum'] = dfexploded.groupby(dfexploded.index).cumcount() + 1 dfexploded.reset_index(drop=True, inplace=True)
Might not be the most efficient way but it works.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to check multiple columns value within range SamLiu 2 3,016 Mar-13-2023, 09:32 AM
Last Post: SamLiu
  How to move multiple columns to initial position SriRajesh 4 3,822 Jul-02-2022, 10:34 AM
Last Post: deanhystad
  df column aggregate and group by multiple columns SriRajesh 0 2,120 May-06-2022, 02:26 PM
Last Post: SriRajesh
  Split single column to multiple columns SriRajesh 1 2,422 Jan-07-2022, 06:43 PM
Last Post: jefsummers
  Apply fillna to multiple columns in dataframe rraillon 2 5,133 Aug-05-2021, 01:11 PM
Last Post: rraillon
  Pandas: how to split one row of data to multiple rows and columns in Python GerardMoussendo 4 11,715 Feb-22-2021, 06:51 PM
Last Post: eddywinch82
  How to fill parameter with data from multiple columns CSV file greenpine 1 3,288 Dec-21-2020, 06:50 PM
Last Post: Larz60+
  convert string into multiple columns in python3 VAN 2 3,981 Sep-26-2020, 11:14 PM
Last Post: scidam
  How to melt dataframe multiple columns to one column Mekala 1 4,618 Sep-24-2020, 08:32 PM
Last Post: scidam
  Dividing certain columns in multiple CSV files NatesM 4 7,242 Dec-20-2019, 07:59 PM
Last Post: ichabod801

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.