Data analytics using  R
Advantages of R over other Programming Languages
• Python needs third party extensions and support for data visualization
 and statistical computing. However, R does not require any such
 support extensively
• If you need to run statistical calculations in your application, learn and
 deploy R. It easily integrates with programming languages such as
 Java, C++, Python and Ruby
• If you need to use re-usable libraries to solve a complex problem,
 leverage the 2000+ free libraries provided by R.
• R is free. It is available under the terms of the Free Software
 Foundation’s GNU General Public License in source code form
• It is available for Windows, Mac and a wide variety of Unix platforms
 (including FreeBSD, Linux, etc.).
 Advantages of R over other Programming Languages
• R has excellent tools for creating graphics such as bar charts, scatter
 plots, multipanel lattice charts, etc.
• It has an object oriented and functional programming structure along
 with support from a robust and vibrant community
• R has a flexible analysis tool kit, which makes it easy to access data in
 various formats, manipulate it (transform, merge, aggregate, etc.), and
 subject it to traditional and modern statistical models (such as
 regression, ANOVA, tree models, etc.)
• R can easily import data from MS Excel, MS Access, MySQL, SQLite,
 Oracle etc. It can easily connect to databases using ODBC (Open
 Database Connectivity Protocol) and ROracle package.
R Basics
 R as a calculator
• R can be used as a calculator. 
• You can just type the calculations on the prompt. After typing
 these, you should press Return to execute the calculation. 
• 2+1 # add 
• 2-1 # subtract 
• 2*1 # multiply 
• 2/1 # divide 
• 2^2 # potency
• 2+2*3
• (2+2)*3
 R contains many mathematical function
• log(10) # natural logarithm, 2.3 
• log2(8) # 3 
• exp(2.3) # 9.97 
• sin(10) # -0.54 
• sqrt(9) # squre root, 3
Is equal
 == 
 Is larger than 
> 
Is larger than or equal to 
 >= 
Smaller than or equal to 
<= 
Isnot equal to
 !=  
Examples 
• 3==3 # TRUE 
• 2!=3 # TRUE 
• 2<=3 # TRUE
 Logical operators
Basic operators are
 & # and 
 |  # or (press Alt Gr and < simultaneously) 
Examples 
 2==3 | 3==3     # TRUE (if either is true then print TRUE) 
2==3 & 3==3    # FALSE (another statement is FALSE, so ->FALSE)
Data Input
 Data Types
R has a wide variety of data types including
 Scalars
 Vectors (numerical, character, logical)
 Matrices
 Data Frames
 Lists
 Vectors
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Refer to elements of a vector using subscripts.
a[c(2,4)] # 2nd and 4th elements of vector
 Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.
The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,dimnames=list(char_vector_rownames,
char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
matrix should be filled by columns (the default). dimnames provides optional labels for the
columns and rows.
 Matrices
# generates 5 x 4 numeric matrix
 y<-matrix(1:20, nrow=5,ncol=4)
# another example
 cells <- c(1,26,24,68)
 rnames <- c("R1", "R2")
 cnames <- c("C1", "C2")
 mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames,
 cnames))
#Identify rows, columns or elements using subscripts.
x[,4] # 4th column of matrix
 x[3,] # 3rd row of matrix
 x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
 Arrays
Arrays are similar to matrices but can have more than two
dimensions. See help(array) for details.
 Data frames
A data frame is more general than a matrix, in that different columns can have
different modes (numeric, character, factor, etc.).
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed") #variable names
 Data frames
There are a variety of ways to identify the elements of a dataframe .
myframe[3:5] # columns 3,4,5 of dataframe
myframe[c("ID","Age")] # columns ID and Age from dataframe
myframe$X1 # variable x1 in the dataframe
 Lists
An ordered collection of objects (components). A list allows you to gather a variety of (possibly
unrelated) objects under one name.
# example of a list with 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
# example of a list containing two lists
v <- c(list1,list2)
 Lists
Identify elements of a list using the [[]] convention.
mylist[[2]] # 2nd component of the list
 Factors
Tell R that a variable is nominal by making it a factor. The factor stores the nominal
values as a vector of integers in the range [ 1... k ] (where k is the number of
unique values in the nominal variable), and an internal vector of character strings
(the original values) mapped to these integers.
# variable gender with 20 "male" entries and
# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
Types of Data in Statistics
Numerical (discrete and continuous)
Categorical
Ordinal
Numerical Data is again of two types –
• Discrete
• Continuous.
Discrete data – It represents items that can be
counted. Basically, they take on possible values that
can be listed out. The list of possible values may be
fixed or it may go to infinity.
Continuous Data – It represents measurements.
Also, their possible values cannot be counted.
Although, it can only be described using intervals
on the real number line.
Categorical Data
Categorical Data is used to represent characteristics that
are present in the data such as a person’s gender, marital
status, hometown.
For example, in a given group of males and females, males
can be represented as 0 and females can be represented as
1. Therefore, we have two classes of distinct
characteristics.
Ordinal Data is similar to categorical data with the only
difference that the data is ordered.
For example, Rating a restaurant on a scale of 0 to 4 gives
us ordinal data. They are often treated as categorical. We
have to order the groups whenever it is required to create
graphs and charts.