 
  Data Structure Data Structure
 Networking Networking
 RDBMS RDBMS
 Operating System Operating System
 Java Java
 MS Excel MS Excel
 iOS iOS
 HTML HTML
 CSS CSS
 Android Android
 Python Python
 C Programming C Programming
 C++ C++
 C# C#
 MongoDB MongoDB
 MySQL MySQL
 Javascript Javascript
 PHP PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to find the mean of each variable using dplyr by factor variable with ignoring the NA values in R?
If there are NA’s in our data set for multiple values of numerical variables with the grouping variable then using na.rm = FALSE needs to be performed multiple times to find the mean or any other statistic for each of the variables with the mean function. But we can do it with summarise_all function of dplyr package that will result in the mean of all numerical variables in just two lines of code.
Example
Loading dplyr package −
> library(dplyr)
Consider the ToothGrowth data set in base R −
> str(ToothGrowth) 'data.frame': 60 obs. of 3 variables: $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ... $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ... $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ... > grouping_by_supp <- ToothGrowth %>% group_by(supp) > grouping_by_supp %>% summarise_each(funs(mean(., na.rm = TRUE))) # A tibble: 2 x 3 supp len dose <fct> <dbl> <dbl> 1 OJ 20.7 1.17 2 VC 17.0 1.17
Consider the mtcars data set in base R −
> str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : Factor w/ 3 levels "four","six","eight": 2 2 1 2 3 2 3 1 1 2 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > grouping_by_cyl <- mtcars %>% group_by(cyl) > grouping_by_cyl %>% summarise_each(funs(mean(., na.rm = TRUE))) # A tibble: 3 x 11 cyl mpg disp hp drat wt qsec vs am gear carb <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 four 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55 2 six 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43 3 eight 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
Consider the CO2 data set in base R −
> str(CO2) Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 84 obs. of 5 variables: $ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ... $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ... $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ... $ conc : num 95 175 250 350 500 675 1000 95 175 250 ... $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ... - attr(*, "formula")=Class 'formula' language uptake ~ conc | Plant .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> - attr(*, "outer")=Class 'formula' language ~Treatment * Type .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> - attr(*, "labels")=List of 2 ..$ x: chr "Ambient carbon dioxide concentration" ..$ y: chr "CO2 uptake rate" - attr(*, "units")=List of 2 ..$ x: chr "(uL/L)" ..$ y: chr "(umol/m^2 s)" > grouping_by_Type <- CO2 %>% group_by(Type) > grouping_by_Type %>% summarise_all(funs(mean(., na.rm = TRUE))) # A tibble: 2 x 5 Type Plant Treatment conc uptake <fct> <dbl> <dbl> <dbl> <dbl> 1 Quebec NA NA 435 33.5 2 Mississippi NA NA 435 20.9
Warning messages
- In mean.default(Plant, na.rm = TRUE) − argument is not numeric or logical− returning NA
- In mean.default(Plant, na.rm = TRUE) − argument is not numeric or logical− returning NA
- In mean.default(Treatment, na.rm = TRUE) − argument is not numeric or logical− returning NA
- In mean.default(Treatment, na.rm = TRUE) − argument is not numeric or logical − returning NA
Here, we are getting some warning messages because the variable Plant and Treatment are not numerical.
Advertisements
 