PROBLEMS:
I. Shipments of Household Appliances: Line Graphs.
The file ApplianceShipments.xls contains the series of quarterly shipments (in million $) of U.S. household
appliances between 1985 and 1989 (data courtesy of Ken Black).
a. Create a well-formatted time plot of the data using Excel.
Script:
“This is a well-formatted plot of the data using Excel. Sa loob ng Excel file, may dataset
between 1985 to 1989 na nagcocontain ng series of quarterly shipments in million dollars of US
household appliance. From the first quarter of 1985, it has a value of 4,009 which increased to
4,595 for the year 1988 but then decreased to 4,245 by the year 1989. The second quarter started
off from 4,321 which increased up to 4,806 after 2 years and decreased by 7 data points then
increased by another 11 for the next 2 years. The third quarter started from 4,224 by the year 1985
then increased 4,657 for the year 1986 but decreased for the following next 2 years and increased
for the last year. For the fourth quarter, it started from 3,944 then increased to 4,485, decreased to
4,258 then increased again to 4,533.”
b. Does there appear to be a quarterly pattern? For a closer view of the patterns, zoom in to the
range of 3500–5000 on the y axis.
Script:
“If we look at the plot closer from the range of 3,500 – 5,000, we can see the number of
shipments for each quarter over time and kada quarter, nag increase yung number. Tignan nalang
from Quarter 1, then Quarter 2, pero pagdating ng Quarter 3, pag increase after a year,
nagdecrease siya for the next 2 years sabay taas ulit for the last year. Sa Quarter 4 naman, nag
increade from 3,944 to 4,485 then bumaba ng 4,258 saka tumaas ulit ng 4,533.”
c. Create four separate lines for Q1, Q2, Q3, and Q4, using Excel. In each, plot a line graph. In
Excel, order the data by Q1, Q2, Q3, Q4 (alphabetical sorting will work), and plot them as separate
series on the line graph. Zoom in to the range of 3500–5000 on the y axis. Does there appear to be
a difference between quarters?
Script:
“Kada quarter, iba’t iba yung fluctuations niya kada year depending sa number of
shipments. Dahil nakaseparate yung lines, makikita yung differences ng pag-iiba iba ng number of
shipments every quarter kaya maeeasily determine kung ano rin yung form of pattern nila
depending sa year.”
d. Using Excel, create a line graph of the series at a yearly aggregated level (i.e., the total
shipments in each year).
Script:
“From 1985, we have a total shipments of 16,498 which is nagcontinue to increase for the
next two years up to 1987, having 18,335. Pagdating ng 1988, bumaba yung value to 18,069 then
nag increase ulit for the following last year na 18,263.”
e. Re-create the above plots using a different forms of visualization. Make sure to enter the quarter
information in a format that is recognized by the software as a date.
Script:
“Makikita natin yung difference of use ng Time Plot in terms of Scattered Plot and
Clustered Column. Ang difference nila in terms of purpose ay yung left side mas shinoshow niya
yung relationship between two variables over time while yung right naman ay nagcocompare ng
multiple categories of data over time. Then yung representation ng data sa left ay represented by a
dot while sa right ay nakaseparate column. Ginagamit yung scatter plot para ma-identify yung
trends and patterns in the relationship between two variables while yung clustered column naman
is para macompare yung performance ng different categories of data over time.”
f. Compare the two processes of generating the line graphs in terms of the effort as well as the
quality of the resulting plots. What are the advantages of each?
A well-formatted time plot is a line graph that shows data over time. The x-axis of the
graph represents time, and the y-axis represents the data values. A separate series on the line
graph is a line that represents a different set of data values. There are a number of advantages
to using a separate series on a line graph. First, it allows you to compare two or more different
data sets on the same graph. This can be helpful for identifying trends and relationships
between the data sets. Second, using a separate series can make it easier to see the details of
each data set. If you were to plot all of the data sets on the same line, it would be difficult to
see the individual trends and relationships. Finally, using a separate series can make your
graph more visually appealing. By using different colors and line styles for each series, you can
make it easier for viewers to distinguish between the different data sets.
Script:
*ayan na mismo yung explanation hehe ^^*
II. Sales of Riding Mowers: Scatterplots. A company that manufactures riding mowers wants to identify the best
sales prospects for an intensive sales campaign. In particular, the manufacturer is interested in classifying
households as prospective owners or non-owners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2).
The marketing expert looked at a random sample of 24 households, included in the file RidingMowers.xls.
a. Using Excel, create a scatterplot of Lot Size vs. Income, color coded by the outcome variable
owner/non-owner. Make sure to obtain a well-formatted plot (remove excessive background and
gridlines; create legible labels and a legend, etc.). The result should be similar to Figure 1.
Script:
“Ang kagandahan sa paggamit ng scatterplot is mapapakita yung relationship between
two variables. Maganda rin siya pangvisualize how two variables are related to each other. Another
purpose is that they can show both linear and non-linear relationships between variables, which is
important because maraming relationships in the real world are non-linear. Then they can show
outliers. Magagamit yung scatterplot sap ag-identify ng outliers in your data. Outliers are data
points that are significantly different from the rest of the data; it can be caused by errors in data
collection or entry, or they may represent genuine phenomena. Then lastly, madali siyang iinterpret,
even for people who are not familiar with statistics which makes them a good choice for
communicating data to a general audience.”
b. Create the same plot, this time using any form of visualization.
Script:
“May two graphs above, which is nagpapakita ng sum of income and sum of lot size, both
by ownership. Sa naunang graph, makikita na hindi masyado nagkakalayo yung total number of
non-owner na may 688.8 compared to owner na may 953.7. Meanwhile sa baba, halos ganon din
na hindi masyado nagkakalayo mukha lang malapit since mas nagfocus sa smaller range.”
c. Compare the two processes of generating the plot in terms of the effort as well as the quality
In terms of effort, creating a clustered column plot is generally easier than creating a
scatter plot. This is because a clustered column plot does not require to specify the color and
size of the data points. Creating a scatter plot requires more effort. This is because it needs to
specify the color and size of the data points. It may also need to add a trendline to the plot.
In terms of quality, both clustered column plots and scatter plots can be used to create
informative and visually appealing plots. However, scatter plots are generally considered to be
more informative, especially when trying to identify trends in the data. This is because scatter
plots allow us to see the relationship between two variables without having to group the data
into categories. A clustered column plot would not allow to see this trend as clearly, because
the data would be grouped into categories.
Therefore, if it is trying to identify trends in data, it is generally better to use a scatter plot.
However, if it is simply trying to compare two different groups of data, a clustered column plot
may be sufficient.
Script:
*ayan na mismo yung explanation hehe (2) ^^*
III. Use the data for the breakfast cereals example to explore and summarize the data as follows:
a) Calculate the following summary statistics: mean, median, min, and max for each of the continuous
variables, and the count for each categorical variable. Which, if any of the variables is missing values?
Script:
“Yung statistical table na ito ay nagsusummarize ng mean, median, minimum, maximum,
and count ng kada continuous variable na mayroon sa dataset. Makikita na yung Sodium has the
highest number ng mean, having 160.68 while yung Cups per Serving naman yung may lowest
value, having a total mean of 0.63. In terms of median, Sodium pa rin ang una having 180 while
yung Cups per Serving ay 0.75. Carbs, Sugars, Potassium, and Cups of Serving naman ay pare-
pareho ng number of minimum which is -1. Then Potassium has the highest number for maximum,
having a value of 330. Makikita rin ay no variables are missing values according sa table na na-
generate.”
b) Use any charts or graphs to compare the calories in hot vs. cold cereals. What does the charts/graphs
show?
Script:
“Ineexplain ng clustered column yung sum of calories by type (hot or cold) of food. Cold
food has the most calories, having a value of 7,510, while hot food naman yung least, having a
value of 300. The graph has two bars: one for cold food and one for hot food. Visually speaking,
makikita na yung cold food bar is taller than the hot food bar, which indicates that cold food has
more calories than hot food.”
c) Use any charts or graphs to plot a consumer rating as a function of the shelf height (the variable shelf).
If we were to predict consumer rating from shelf height, does it appear that we need to keep all three
categories of shelf height?
Script
“If we were to predict yung consumer rating from shelf height, it does appear that we would
need to keep all three categories of shelf height. Bakit? Kasi yung rating distribution ay
magkakaiba kada shelf height. Yung may pinakamataas na shelf height ay yung may
pinakamataas na ratings, while yung pinakamababa na shelf height ay nag iindicate ng
pinakamababa na ratings. Kapag isa or dalawang shelf heights lang ang icoconsider, hindi
accurately maprepredict yung consumer rating kasi may iba’t ibang preferences ang consumers.”