conditional groupby and aggregation on the conditioned group values

harrshu · (This post was last modified: Oct-20-2019, 07:32 AM by harrshu.)

ID City State Code Revenue
12 Indore MP 420 1567290
12 Indore MP 420 2009761
17 Mumbai MH 630 1567290
15 Mumbai MH 514 2009761
13 Indore MP 420 5289016
12 Mumbai MH 630 8904379

I am a newbie to Python
My data is as above. My problem statement is to group this data by keys-ID,City,State,Code and for like values of the keys the Revenue should be REvenue[i]+Revenue[j] E.g.: for ID=12, City=Indore, State=MP, Code=420 Revenue varies in the first two rows. So I wish to make the revenue as addition of both revenues for both the rows. Similarly for other rows where other keys match the revenue should be addition of therespective keys. I am unable to do this with the regular groupby+sum() on pandas . Not gettingthe expected output. Can someone please guide me with a generic code. Thank you

If the revenues are different for the like values of keys then the revenue should be as it is

DeaD_EyE · (This post was last modified: Oct-20-2019, 10:24 AM by DeaD_EyE.)

If your input is in csv-format, you could use the csv module.
The main point by grouping is, that you have to sort it before.
The functions sorted and itertools.groupby have key functions, which are called.
A helper function is operator.itemgetter to get items by index or name.

from operator import itemgetter data = [(1,10), (2, 0), (3, -10), (4, -20)] print(sorted(data, key=itemgetter(1))) # sort by second item

Output:
[(4, -20), (3, -10), (2, 0), (1, 10)]

Now let's do it with data for grouping:

from operator import itemgetter from itertools import groupby data = [ (10,20,'foo'), (10,30,'bar'), (10,10,'fizz'), (1,3,'bar'), (1,3,'foo'), ] getter = itemgetter(2) # getter for third field sorted_list = sorted(data, key=getter) print(sorted_list)

Output:
[(10, 30, 'bar'), (1, 3, 'bar'), (10, 10, 'fizz'), (10, 20, 'foo'), (1, 3, 'foo')]

Next step is the grouping:

from operator import itemgetter from itertools import groupby data = [ (10,20,'foo'), (10,30,'bar'), (10,10,'fizz'), (1,3,'bar'), (1,3,'foo'), ] getter = itemgetter(2) # getter for third field sorted_list = sorted(data, key=getter) grouped = groupby(sorted_list, getter) for group, elements in grouped: print('Group ->', group) for element in elements: print(element)

Output:
Group -> bar (10, 30, 'bar') (1, 3, 'bar') Group -> fizz (10, 10, 'fizz') Group -> foo (10, 20, 'foo') (1, 3, 'foo')

If you don't sort, following happens:

from operator import itemgetter from itertools import groupby data = [ (10,20,'foo'), (10,30,'bar'), (10,10,'fizz'), (1,3,'bar'), (1,3,'foo'), ] getter = itemgetter(2) # getter for third field grouped = groupby(data, getter) for group, elements in grouped: print('Group ->', group) for element in elements: print(element)

Output:
Group -> foo (10, 20, 'foo') Group -> bar (10, 30, 'bar') Group -> fizz (10, 10, 'fizz') Group -> bar (1, 3, 'bar') Group -> foo (1, 3, 'foo')

Now the whole implementation with some tricks:

#!/usr/bin/env python3 """ Group a csv file by a given field. """ import sys from argparse import ArgumentParser from itertools import groupby from operator import itemgetter from pathlib import Path from csv import reader, writer def group_csv_reader(file, fields): with file.open() as fd: csv = reader(fd) header = next(csv) for field in fields: if field not in header: raise ValueError(f'Only following fields are valid to group: {header}') indicies = [header.index(field) for field in fields] getter = itemgetter(*indicies) # the itemgetter can get one or more items from a sequence or mapping yield header csv = sorted(csv, key=getter) # remove sorted, to see the effect # this will consome the whole csv file # and sort it. After the sorting # the whole content of the file is in memory # if the file is smaller than 1GiB, # it's ok not to be memory efficient # now the file is closed and # csv still contains the sorted rows by field csv = groupby(csv, key=getter) # groupby itself does not sort # it just yield the group and rows for group, rows in csv: for row in rows: yield row def main(): parser = ArgumentParser(description=__doc__) parser.add_argument('file', type=Path, help='CSV input file to process') # file is a Path object you can work with parser.add_argument('fields', nargs='+', help='Fieldname(s) to group by') parser.add_argument('-o', type=Path, help='Output to a file') # o is a boolean # one or more fieldname args = parser.parse_args() if args.o: fout = args.o.open('w') else: fout = sys.stdout # now make a csv_writer, wich uses currently sys.stdout csv_writer = writer(fout) # catching the error, if a field was # given which does not exist try: iterator = group_csv_reader(args.file, args.fields) except ValueError as e: print(e, file=sys.stderr) fout.close() return 1 for row in iterator: csv_writer.writerow(row) return 0 if __name__ == '__main__': sys.exit(main())

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Get an average of the unique values of a column with group by condition and assign it	klllmmm	0	2,866	Feb-17-2024, 05:53 PM Last Post: klllmmm
	Add a row per group and depending of values	Menthix	0	1,346	Mar-20-2023, 11:20 AM Last Post: Menthix
	Class-Aggregation and creating a list/dictionary	IoannisDem	1	3,273	Oct-03-2021, 05:16 PM Last Post: Yoriz
	What is the better way of avoiding duplicate records after aggregation in pandas ?	jagasrik	0	2,447	Aug-30-2020, 05:26 PM Last Post: jagasrik
	Using groupby on non-categorical values	namy77	1	2,567	Feb-04-2020, 11:58 PM Last Post: scidam
	Aggregation json by nested elements	Omri	1	3,509	Sep-05-2018, 04:45 PM Last Post: Larz60+
	Diffusion limited aggregation	pythonic	0	3,484	Jun-11-2018, 09:45 PM Last Post: pythonic

conditional groupby and aggregation on the conditioned group values

User Panel Messages

Announcements