Data analytics using R
Advantages of R over other Programming Languages
• Python needs third party extensions and support for data visualization
and statistical computing. However, R does not require any such
support extensively
• If you need to run statistical calculations in your application, learn and
deploy R. It easily integrates with programming languages such as
Java, C++, Python and Ruby
• If you need to use re-usable libraries to solve a complex problem,
leverage the 2000+ free libraries provided by R.
• R is free. It is available under the terms of the Free Software
Foundation’s GNU General Public License in source code form
• It is available for Windows, Mac and a wide variety of Unix platforms
(including FreeBSD, Linux, etc.).
Advantages of R over other Programming Languages
• R has excellent tools for creating graphics such as bar charts, scatter
plots, multipanel lattice charts, etc.
• It has an object oriented and functional programming structure along
with support from a robust and vibrant community
• R has a flexible analysis tool kit, which makes it easy to access data in
various formats, manipulate it (transform, merge, aggregate, etc.), and
subject it to traditional and modern statistical models (such as
regression, ANOVA, tree models, etc.)
• R can easily import data from MS Excel, MS Access, MySQL, SQLite,
Oracle etc. It can easily connect to databases using ODBC (Open
Database Connectivity Protocol) and ROracle package.
R Basics
R as a calculator
• R can be used as a calculator.
• You can just type the calculations on the prompt. After typing
these, you should press Return to execute the calculation.
• 2+1 # add
• 2-1 # subtract
• 2*1 # multiply
• 2/1 # divide
• 2^2 # potency
• 2+2*3
• (2+2)*3
R contains many mathematical function
• log(10) # natural logarithm, 2.3
• log2(8) # 3
• exp(2.3) # 9.97
• sin(10) # -0.54
• sqrt(9) # squre root, 3
Is equal
==
Is larger than
>
Is larger than or equal to
>=
Smaller than or equal to
<=
Isnot equal to
!=
Examples
• 3==3 # TRUE
• 2!=3 # TRUE
• 2<=3 # TRUE
Logical operators
Basic operators are
& # and
| # or (press Alt Gr and < simultaneously)
Examples
2==3 | 3==3 # TRUE (if either is true then print TRUE)
2==3 & 3==3 # FALSE (another statement is FALSE, so ->FALSE)
Data Input
Data Types
R has a wide variety of data types including
Scalars
Vectors (numerical, character, logical)
Matrices
Data Frames
Lists
Vectors
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Refer to elements of a vector using subscripts.
a[c(2,4)] # 2nd and 4th elements of vector
Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.
The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,dimnames=list(char_vector_rownames,
char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
matrix should be filled by columns (the default). dimnames provides optional labels for the
columns and rows.
Matrices
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames,
cnames))
#Identify rows, columns or elements using subscripts.
x[,4] # 4th column of matrix
x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
Arrays
Arrays are similar to matrices but can have more than two
dimensions. See help(array) for details.
Data frames
A data frame is more general than a matrix, in that different columns can have
different modes (numeric, character, factor, etc.).
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed") #variable names
Data frames
There are a variety of ways to identify the elements of a dataframe .
myframe[3:5] # columns 3,4,5 of dataframe
myframe[c("ID","Age")] # columns ID and Age from dataframe
myframe$X1 # variable x1 in the dataframe
Lists
An ordered collection of objects (components). A list allows you to gather a variety of (possibly
unrelated) objects under one name.
# example of a list with 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
# example of a list containing two lists
v <- c(list1,list2)
Lists
Identify elements of a list using the [[]] convention.
mylist[[2]] # 2nd component of the list
Factors
Tell R that a variable is nominal by making it a factor. The factor stores the nominal
values as a vector of integers in the range [ 1... k ] (where k is the number of
unique values in the nominal variable), and an internal vector of character strings
(the original values) mapped to these integers.
# variable gender with 20 "male" entries and
# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
Types of Data in Statistics
Numerical (discrete and continuous)
Categorical
Ordinal
Numerical Data is again of two types –
• Discrete
• Continuous.
Discrete data – It represents items that can be
counted. Basically, they take on possible values that
can be listed out. The list of possible values may be
fixed or it may go to infinity.
Continuous Data – It represents measurements.
Also, their possible values cannot be counted.
Although, it can only be described using intervals
on the real number line.
Categorical Data
Categorical Data is used to represent characteristics that
are present in the data such as a person’s gender, marital
status, hometown.
For example, in a given group of males and females, males
can be represented as 0 and females can be represented as
1. Therefore, we have two classes of distinct
characteristics.
Ordinal Data is similar to categorical data with the only
difference that the data is ordered.
For example, Rating a restaurant on a scale of 0 to 4 gives
us ordinal data. They are often treated as categorical. We
have to order the groups whenever it is required to create
graphs and charts.