Basics of R programming for analytics [Autosaved] (1).pdf

Basics of R programming for analytics Course code – PGP 207 PGP MCB 2023-25 Term: II

What is R  R is a statistical programming environment  Statistical Programming Environment = Where you can both write code and do data analysis  Different from SPSS or SAS or other Statistical Packages  You can use for more than just data analyses  R stores everything in the form of objects  You can combine R with other writing environments such as LaTeX and Markdown to write reports

Why Use R? •It is a great resource for data analysis, data visualization, data science and machine learning •It provides many statistical techniques (such as statistical tests, classification, clustering and data reduction) •It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc •It works on different platforms (Windows, Mac, Linux) •It is open-source and free •It has a large community support •It has many packages (libraries of functions) that can be used to solve different problems

Obtaining R  The best way to obtain R is to visit the CRAN Website  http://cran.r-project.org  You will need Internet access to download the files  Installation of R depends on the platform you have:  Select the appropriate binary version  A binary version = is the machine coded version that will directly install R

Obtaining additional R Packages  For Working with R you will need additional packages  These packages are combination of data and functions  The packages are kept in package repositories  To Use packages you will have to install and then call them  Installing: use install.packages(“name of the package”, repos = “”, dep = T)  To Use Packages, use library(name of the package), also require(name of the package) [Use either]

Using R with an IDE  Always a good idea to use R with an integrated development environment (IDE)  Integrated Development Environment will help you to write codes, and view the outputs at the same time  You can also browse the objects, data, and graphs in the IDE  The IDE used in these set of exercises is RStudio  RStudio is free and open, and you can download from http://rstudio.com  Download the RStudio Desktop version for your use in these modules  Install R First and then RStudio

Your Set up to get Started Source window: used to edit a script and run it. Console window: used to run a particular packages or to run particular command. Workspace window: it stores all the variables used during execution of command under the environment tab Plots and File window: the file tab is used to track the working directories The plot tabs show all the graphical output

What can we put in [>] and take out [<] from R?  From Spreadsheets [ > ]  Source Code Files [ > ]  From other Software [ > ]  Text Based Data [ > ] [ < ]  Tables of Data [ > ] [ < ]  Images [ < ]  Dump Files [ < ]

Assignment 1 Find the answers to log2(2^5) and log(exp(1)*exp(1)).

Data frame in R studio ID <- c(1,2,3,4,5) Name <- c(“Ramesh”, “Kaushik”, “Chaitali”, “Hardik”, “Komal”) English <- c(45,65,72,80,57) Hindi <- c(65,78,56,45,48) Science <- c(45,55,68,74,63) So_Science <- c(58,69,63,77,52) Math <- c(88,63,59,70,76) Stu_marks<- data.frame(Name,English, Hindi, Science, So_Science, Math) View(Stu_marks) # extracting single column from given dataframe Stu_marks$Math Stu_marks$Hindi

Create new data frame with Column : name Computer_app EVS Enter the cmd: New_df_name<-merge(df1, df2, by = “names”) View(New_df_name)

Packages in R 1. A collection of R functions, complied code and sample data. 2. Stored under a directory called library in the R environment. By default, R installs set of packages. To see the number of packages installs in R enter the command in console window: > library() > fraction (firstVar/secondVar)

Introduction to R script An R script is a plain text file in which you can store your R code. Script allows you to show your work to others and also reproduce and modify the results How to set working directories? In the console window write: > getwd() the current working directory is shown in the output How to set our current working directory? > setwd() How to read and store “csv” file in R? Type the following command on console window: file_name = read.csv(“file_name.csv”) To view the file enter the command: View(filename)

How to create dataframe in R? > names <- c(“Rohit”, “Dhoni”, “Virat”, “Hardik”, “KL Rahul”, “Bumrah”) > played <- c(45,49,47,47,40,25) > won <- c(22,21,14,9,9,8) > lost <- c(12,13,14,8,19,6) > y <- c(2008, 2004,2007, 2009, 2010,2010) >cricket_players <- data.frame(names, played, won, lost, y) > View(cricket_players) You can access the parts of data frame by the following cmd: > cricket_players$names > cricket_players$won

Suppose we want to find the ratio between no. of games played and won: > ratio <- cricket_players$won/cricket_players$played The ratio is stored in the new variable name called “victory” > cricket_players$victory <- ratio To reduce the number of digits after decimal in victory column: > options (digits=2) > View(cricket_players)  mean(cricket_players$played) > plot (cricket_players$names, cricket_players$played)

Inputting a Source File A source file contains all the codes that you will need to run your analyses. This is used to input data and commands to R. You ask R to run your codes by typing: source(“file.R”) Remember to save the code with the extension “.R

Code to read data from console to R mylar <- scan(“”, what = “numeric) ▪ Reads directly from console ▪ Saves the numbers to a variable

Code to read data from text files  Write the read.csv() code example  Comma separated value files (csv)  Need to indicate if you have a header  Here we have set the variable names manually mydata<- read.csv(“DOB.csv”, header = T, sep = “ , ”) names(mydata) <- c (“Id”, “Time”, “DOB”)

SUGGESTED TEXT BOOKS  Hands- On Programming with R Write Your Own Functions and Simulations, Mumbai Shroff Publishers & Distributors  Chambers, John M., Software for Data Analysis Programming With R, USA Springers  Grolemund, Garrett., Hands- On Programming with R Write Your Own Functions And Simulations, Mumbai Shroff Publishers E-Resources • https://www.tutorialspoint.com/r/index.htm • https://www.w3schools.com/r/r_intro.asp • https://www.javatpoint.com/r-tutorial

Comments in R Comments can be used to explain R code, and to make it more readable. It can also be used to prevent execution when testing alternative code. Comments starts with a #. When executing code, R will ignore anything that starts with #. Example: This example uses a comment before a line of code: # This is a comment “Hello World” Example: This example uses a comment at the end of the line of code: “Hello World” # This is a comment Comments does not have to be text to explain the code, it can also be used to prevent R from executing the code: # "Good morning!" "Good night!"

Reserved Words in R Reserved words in R programming are a set of words that have special meaning and cannot be used as an identifier (variable name, function name etc.) Reserved words in R if else repeat while function for in next break TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex _ NA_characte r_ ...

Identifiers in R Variables in R Variables are used to store data, whose value can be changed according to our need. Unique name given to variable (function and objects as well) is identifier. Rules for writing Identifiers in R 1.Identifiers can be a combination of letters, digits, period (.) and underscore (_). 2.It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit. 3.Reserved words in R cannot be used as identifiers.

Valid identifiers in R Total, sum, fine.with.dot, Number5, this_is_acceptable Invalid identifiers in R tot@l, 5um, _fine, TRUE, .one Constants in R Constants, as the name suggests, are entities whose value cannot be altered. Basic types of constant are numeric constants and character constants.

Data cleaning in R Here we are using Excel file “Data cleaning in R” To view the first 5 observations the cmd will be head(Data cleaning in R) Handling missing values in R mean(Data cleaning in R$Test1) mean(Data cleaning in R$Test2) mean(Data cleaning in R$Test3) mean(Data cleaning in R$Test1. na.rm = TRUE) summary(Data cleaning in R)

Imputing Excel file To install “Excel” package install.package(“xlsx”) library(“xlsx”) Reading excel File # Read the first worksheet in the file input.xlsx. data <- read.xlsx("input.xlsx", sheetIndex = 1) print(data)

Class(file_name) Typeof(file_name) To access the top two rows of dataframe head(dataframe,2) Tail(dataframe,2) Str(dataframe)

Matrix in R mat<- matrix(c(1,2,3,4,5,6),nrow = 2, ncol = 3) mat mat[1,2] mat[,2] mat[1,] mat[2,] stringmatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) newmatrix <- cbind(stringmatrix, c("strawberry", "blueberry", "raspberry")) # Print the new matrix newmatrix

Data Visualization A histogram is A visual representation of the distribution of dataset. Used to plot a frequency of score occurrences in a continuous dataset. Working on movies dataset with file name: moviesData.csv The script used here is myPlot.R To plot histogram type the following command:  hist(movies$runtime) How to add lables and colour to the histogram for this we have to add more arguments to the histogram: hist(movies$runtime) hist(movies$runtime, main = "Distribution of movies' length", xlab = "Runtime of movies", xlim = c(0,300), col = "Blue", breaks = 4)

Pie chart It is a circular chart Divided into wedge-like sectors, illustrating proportion. The total value of the pie chart is always 100 percent. In the movie data set, we are making pie chart of the column “Genre”, for that first we are making frequency table of the column Genre. genrecount <- table(movies$genre) View(genrecount) pie(genreCount, main = "Proportion of movies' genre", border = "blue", col = "orange")

Bar Chart A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot to create bar charts We are plotting bar chart from the movie dataset, of the column imdb_ratings and for the sake of simplicity we are taking only 20 observations. moviesSub <- movies[1:20,] barplot(moviesSub$imdb_rating, ylab = "IMDB Rating", xlab = "Movies", col = "blue", ylim = c(0,10), main = "Movies', IMDB Rating")

In continuation of the previous slide, we will add the movie names in the x-axis barplot(moviesSub$imdb_rating, ylab = "IMDB Rating", xlab = "Movies", col = "blue", ylim = c(0,10), main = "Movies', IMDB Rating", names.arg = moviesSub$title) In the O/P, not all name are visible, for that we will add the name in the perpendicular to the x-axis.

barplot(moviesSub$imdb_rating, ylab = "IMDB Rating", xlab = "Movies", col = "blue", ylim = c(0,10), main = "Movies', IMDB Rating", names.arg = moviesSub$title, las = 2)

Let us analyse the relation between “imdb_ratings” and “audience_score” for this we draw a scatter plot using the plot function Scatter plot is a graph in which the values of the two variables are plotted along two axes. The pattern of the resulting points reveals the correlation. plot(x = movies$imdb_rating, y = movies$audience_score, main = "IMDB Ratings vs Audience Score", xlab = "IMDB Rating", ylab = "Audience Score", xlim = c(0,10), ylim = c(0,100), col = "blue")

Now, we will see the correlation between the imdb_rating and audience_score: cor(movies$imdb_rating, movies$audience_score) O/P 0.8651485

Box Plot Boxplots are created in R by using the boxplot() function. Syntax The basic syntax to create a boxplot in R is − boxplot(x, data, notch, varwidth, names, main) Following is the description of the parameters used − •x is a vector or a formula. •data is the data frame. •notch is a logical value. Set as TRUE to draw a notch. •varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size. •names are the group labels which will be printed under each boxplot. •main is used to give a title to the graph.

boxplot(mtcars$mpg) boxplot(mtcars$mpg, main="Mileage Data Boxplot", ylab="Miles Per Gallon(mpg)", xlab="No. of Cylinders", col="orange") boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders", ylab = "Miles Per Gallon", main = "Mileage Data")

Introduction to ggplot2 Visualization is an important tool for insight generation It is used to understand the data structure, identify outliers and find patterns There are two methods of data visualization in R:  Basic Graphics  Grammer of graphics (popularly known as ggplot2)  Basic Graphics Following are the code for “sin” curve plot(x,y, main = "Plotting sin curve", ylab = "sin(x)") Now, we will learn how to change the type of the curve plot(x,y, main = "Plotting sin curve", ylab = "sin(x)", type = "l", col = "blue")

To plot the “cosine” and “sin” curve on the same plot plot(x, sin(x), main = "Two Graphs in one plot", ylab = "", type = "l", col = "blue") lines(x, cos(x), col = "red") Here, we will use “legend” to differentiate between the two graphs plot(x, sin(x), main = "Two Graphs in one plot", ylab = "", type = "l", col = "blue")lines(x, cos(x), col = "red")legend("topleft", c("sin(x)","cos(x)"), fill = c("blue", "red"))

ggplot2 graphics ggplot2 package was created by Hadley Wickham in 2005 If offers a powerful graphics language for creating elegant and complex plots We will use “movies” dataset for exploring “ggplot2” package library(ggplot2) View(movies) Now, we want to draw scatter plot between the “critics_score” and “audience_score”: Ggplot2 package take three arguments in its function: 1. Data 2. Aesthetics 3. Geometrical

ggplot(data = movies, mapping = aes(x=critics_score, y=audience_score))+ geom_point()

There is positive correlation between critics_score and audience_score How to save the ggplot2 graph using ggplot save function in our current working directory? ggsave("scatter_plot.png")

Aesthetic mapping in ggplot2 We will learn: 1. What is aesthetic 2. How to create plots using aesthetic 3. Turning parameters in aesthetic

What is Aesthetic  Aesthetic is a visual property of the objects in a plot  It includes lines, points, symbols, colors and positions  It is used to add customization to our plots # Load ggplot2 library(ggplot2) # Clear R workspace rm(list = ls() ) # Declare a variable to read and store moviesData movies <- read.csv("moviesData.csv") # View movies data frame View(movies) # Plot critics_score and audience_score ggplot(data = movies, mapping = aes(x = critics_score, y = audience_score)) + geom_point()

Now, we will assign the unique color to each “Genre” of movie column ggplot(data = movies,mapping = aes(x = critics_score, y = audience_score, color = genre)) + geom_point() How to draw “Bar chart” using ggplot function The following code represents the type of the column “mpaa_ratings” and number of elements in this column: str(movies$mpaa_ratings) levels(movies$mpaa_ratings) ggplot(data = movies,mapping = aes(x = movies$mpaa_rating))+ geom_bar() We will learn how to add labels to this bar chart:

ggplot(data = movies, mapping = aes(x = movies$mpaa_rating, fill=genre))+ geom_bar()+ labs(y="Rating counts", title="Count of mpaa rating") Now we will draw histogram for the variable “run time” # Histogram for "runtime“ ggplot(data = movies, mapping = aes(x=runtime))+geom_histogram()+ labs(x="Runtime of Movies", title="Distribution of Runtime")

Data manipulation using dplyr package “dplyr” is a package for data manipulation, written and maintained by Hadley Wickham It comprises many functions that perform mostly used data manipulation operations # Clear R workspace rm(list = ls()) # Declare a variable to list and store movies data movies<- read.csv("moviesData.csv") View(movies)

Now we will install “dplyr” package install.packages(“dplyr”) library(dplyr) Key functions in “dplyr” package Filter- to select cases based on their values Arrange – to reorder the cases Select – to select variables based on their names Mutate – to add new variables that are functions of existing variables Summarise – to condense multiple values to a single value All these functions can be combined with group_by functions. It allows us to perform any operation by group.

# Clear R workspace rm(list = ls()) # Declare a variable to list and store movies data movies<- read.csv("moviesData.csv") View(movies) # using "filter" function we will filter the column "genre" by comedy movies moviesComedy <- filter(movies, genre == “Comedy") View(moviesComedy) moviesComedyDr <- filter(movies, genre =="Comedy"| genre == "Drama") View(moviesComedyDr)

irisspecies <- filter(iris, Species==“Setosa”) View(irisspecies) irisspecies <- filter(iris, Species==“Setosa”|Petal.Length>=1.5) Vies(irisspecies)

# filter the movies data by genre "Comedy" having "imdb_rating" greater than or equal to 7.5 moviesComedyIm <- filter(movies, genre == "Comedy" & imdb_rating >=7.5) View(moviesComedyIm) # using "arrange" function arranging the imdb_rating by ascending order moviesImA <- arrange(movies, imdb_rating) View(moviesImA)

install.packages(“dplyr) library(dplyr) data(iris) View(iris) iris_pet_arr <- arrange(iris, Petal.Length) View(iris_pet_arr)

# using "arrange" function arranging the imdb_rating by descending order moviesImD <- arrange(movies,desc(imdb_rating)) View(moviesImD) # Arrange the two columns "genre" by alphabetical order and "imdb_rating" by ascending order moviesGeIm <- arrange(movies, genre, imdb_rating) View(moviesGeIm)

More functions in “dplyr” package 1. Select 2. Remane 3. Mutate Here, we are using myVis.R script which is folder containg moviesData and set myVis folder as working directory. Before using the above functions install the package “dplyr”

# using select function from dplyr package moviesTGI <- select(movies, title, genre, imdb_rating) View(moviesTGI) Let us select the three columns “thtr_rel_year”, “thtr_rel_month” and “thtr_rel_day” along with the “title” column For that enter the following cmd in the console window: moviesTHT <- select(movies, title, starts_with("thtr")) View(moviesTHT)

Let us change the name of the column “thtr_rel_year” using “rename” function moviesR <- rename(movies, rel_year = "thtr_rel_year") View(moviesR) Suppose we want to add a new variable (column) in movies dataset for that we will use “mutate” function moviesLess <- select(movies, title:audience_score) View(moviesLess) # use of Mutate function moviesMu <- mutate(moviesLess, criAud = critics_score- audience_score) View(moviesMu)

Pipe operator We will learn about: 1. Summarise and group_by functions 2. Operations in summarise functions 3. Pipe operator Make folder names “pipeops” in myproject folder and set “pipeops” as working directory

Summarise function 1. Summarise function reduces a dataframe into a single row. 2. It gives summaries like mean, median etc., of the variable available in the dataframe 3. We use summarise along with the group_by function # use of summarise function summarise(movies, mean(imdb_rating)) 1. When we use group_by function, the data frame is divided into groups. We group the “genre” variable using group_by function

# use of group_by function group_Movies <- group_by(movies, genre) # using summarise function on the above cmd summarise(group_Movies, mean(imdb_rating)) Now, we are using filter, group_by and summarise function to extract the drama movies mean from mpaa_rating. dramaMov <- filter(movies, genre == "Drama") gr_dramaMov <- group_by(dramaMov, mpaa_rating) summarise(gr_dramaMov, mean(imdb_rating))

Pipe operator The pipe operator is denoted as % > % It prevents us from making unnecessary data frames We can read the pipe as a series of imperative statements If we want to find the cosine of sine for pi, we can write Pi % > % sin() % > % cos() We will learn how to do the same above analysis using pipe operator movies %>% filter(genre =="Drama") %>% group_by(mpaa_rating) %>% summarise(mean(imdb_rating))

Let us find the difference between “critics_score” and “audience_score” from movies data frame. We will use box plot for this,using the pipe operator we will combine the functions of “ggplot2” and “dplyr” packages movies %>% mutate(diff = audience_score - critics_score) %>% ggplot (mapping = aes(x=genre, y=diff))+ geom_boxplot() Now, we are going to find that number of category of movies in mpaa_rating movies %>% group_by(genre, mpaa_rating) %>% summarise(num = n())

Conditional statements We will learn: 1. Conditional statements 2. If, else and else if statements Conditional statements are used to execute some logical conditions in the code If, else and else if statements are some basic conditional statements

Statistical function for data analysis Data Set A data set is a collection of data, often presented in a table. There is a popular built-in data set in R called "mtcars" (Motor Trend Car Road Tests), which is retrieved from the 1974 Motor Trend US Magazine. In the examples below (and for the next chapters), we will use the mtcars data set, for statistical purposes:

To get in-built data set in R data() data(mtcars) View(mtcars) head(mtcars,6) head(mtcars) nrow(mtcars) ncol(mtcars) Example # Print the mtcars data set mtcars Information About the Data Set You can use the question mark (?) to get information about the mtcars data set: # Use the question mark to get information about the data set ?mtcars

Get Information Use the dim() function to find the dimensions of the data set, and the names() function to view the names of the variables: Example Data_Cars <- mtcars # create a variable of the mtcars data set for better organization # Use dim() to find the dimension of the data set dim(Data_Cars) # Use names() to find the names of the variables from the data set names(Data_Cars)

Sort Variable Values To sort the values, use the sort() function: Example Data_Cars <- mtcars sort(Data_Cars$cyl) Analyzing the Data Now that we have some information about the data set, we can start to analyze it with some statistical numbers. For example, we can use the summary() function to get a statistical summary of the data: Data_Cars <- mtcars summary(Data_Cars) sd(mtcars$cyl)

statistical function in R Mean, Median, and Mode In statistics, there are often three values that interests us: •Mean - The average value •Median - The middle value •Mode - The most common value Data_Cars <- mtcars mean(Data_Cars$wt)

Median The median value is the value in the middle, after you have sorted all the values. If we take a look at the values of the wt variable (from the mtcars data set), we will see that there are two numbers in the middle: Data_Cars <- mtcars median(Data_Cars$wt) mean(marks$Test1) mean(marks$Test1, na.rm = TRUE) d1 <- na.omit(old_filename)

Mode The mode value is the value that appears the most number of times. R does not have a function to calculate the mode. However, we can create our own function to find it. If we take a look at the values of the wt variable (from the mtcars data set), we will see that the numbers 3.440 are often shown: Data_Cars <- mtcars names(sort(-table(Data_Cars$wt)))[1]

http://www.sthda.com/english/wiki/ggplot2- essentials#:~:text=There%20are%20two%20major%20functions,a%20pl ot%20piece%20by%20piece. Website give the details of ggplot2 package. https://bookdown.org/jeffreytmonroe/business_analytics_with_r7/basi cs.html https://www.geeksforgeeks.org/packages-in-r-programming/?ref=lbp https://www.modernstatisticswithr.com/datachapter.html https://www.w3schools.com/r/r_stat_data_set.asp https://www.geeksforgeeks.org/r-keywords/?ref=lbp

Basics of R programming for analytics [Autosaved] (1).pdf

More Related Content

Similar to Basics of R programming for analytics [Autosaved] (1).pdf

More from suanshu15

Recently uploaded

Basics of R programming for analytics [Autosaved] (1).pdf