How to remove rows for categorical columns that has three or less combination of duplicates in an R data frame?



In Data Analysis, we sometimes decide the size of the data or sample size based on our thoughts and this might result in removing some part of the data. One such thing could be removing three or less duplicate combinations of categorical columns and it can be done with the help of filter function of dplyr package by grouping with group_by function.

Example1

 Live Demo

Consider the below data frame −

set.seed(121) x1<−sample(LETTERS[1:6],20,replace=TRUE) x2<−sample(c("Male","Female"),20,replace=TRUE) x3<−rpois(20,5) df1<−data.frame(x1,x2,x3) df1

Output

x1 x2 x3 1 D Female 5 2 D Female 2 3 D Male 7 4 D Female 8 5 A Male 6 6 C Female 7 7 A Female 3 8 C Female 1 9 C Female 7 10 E Male 2 11 D Female 3 12 E Female 6 13 F Female 3 14 D Female 4 15 A Male 4 16 E Male 4 17 B Female 8 18 B Female 7 19 C Female 5 20 A Female 9

Loading dplyr package and removing categorical columns that has three or less combination of duplicates −

Example

library(dplyr) df1%>%group_by(x1,x2)%>%filter(n()>=4) # A tibble: 9 x 3 # Groups: x1, x2 [2]

Output

x1 x2 x3 <chr> <chr> <int> 1 D Female 5 2 D Female 2 3 D Female 8 4 C Female 7 5 C Female 1 6 C Female 7 7 D Female 3 8 D Female 4 9 C Female 5

Example2

 Live Demo

y1<−sample(c("S1","S2","S3","S4","S5","S6"),20,replace=TRUE) y2<−sample(c("Winter","Summer"),20,replace=TRUE) y3<−rnorm(20,3) df2<−data.frame(y1,y2,y3) df2

Output

y1 y2 y3 1 S1 Winter 2.683082 2 S4 Summer 1.141916 3 S6 Winter 3.371681 4 S2 Winter 3.191187 5 S3 Summer 2.195504 6 S5 Summer 2.631736 7 S3 Winter 3.303605 8 S6 Summer 3.074344 9 S5 Summer 2.663724 10 S5 Winter 2.281991 11 S6 Summer 4.174418 12 S4 Winter 6.081246 13 S4 Summer 3.202913 14 S2 Winter 5.557243 15 S2 Winter 3.747462 16 S2 Winter 2.621571 17 S2 Summer 3.909743 18 S5 Winter 2.325663 19 S5 Summer 3.749852 20 S5 Winter 2.331191

Example

df2%>%group_by(y1,y2)%>%filter(n()>=4) # A tibble: 4 x 3 # Groups: y1, y2 [1]

Output

y1 y2 y3 <chr> <chr> <dbl> 1 S2 Winter 3.19 2 S2 Winter 5.56 3 S2 Winter 3.75 4 S2 Winter 2.62
Updated on: 2021-02-08T12:55:16+05:30

387 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements