library("lessR")
lessR provides many versions of a scatter plot with its Plot()
function. To illustrate, first read the Employee data included as part of lessR.
<- Read("Employee") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
The typical scatterplot of two continuous variables.
Plot(Years, Salary)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
The enhanced scatterplot with parameter enhance
. The visualization includes the mean of each variable indicated by the respective line through the scatterplot, the 95% confidence ellipse, labeled outliers, least-squares regression line with 95% confidence interval, and the corresponding regression line with the outliers removed.
Plot(Years, Salary, enhance=TRUE)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
## >>> Outlier analysis with Mahalanobis Distance
##
## MD ID
## ----- -----
## 8.14 Correll, Trevon
## 7.84 Capelle, Adam
##
## 5.63 Korhalkar, Jessica
## 5.58 James, Leslie
## 3.75 Hoang, Binh
## ... ...
Plot the scatterplot with the non-linear best fit "loess"
line. The three available values for the fit
line are "loess"
for non-linear, "lm"
for linear, and "null"
for the null model line, the flat line at the mean of \(y\). Also, setting fit
to TRUE
plots the "loess"
line.
For emphasis set plot_errors
to TRUE
to plot the residuals from the line.
Plot(Years, Salary, fit="loess", plot_errors=TRUE)
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Two categorical variables result in a bubble plot of their joint frequencies.
Plot(Dept, Gender)
## >>> Suggestions
## Plot(Dept, Gender, size_cut=FALSE)
## Plot(Dept, Gender, trans=.8, bg="off", grid="off")
## SummaryStats(Dept, Gender) # or ss
##
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
## F 3 4 1 5 5 18
## M 2 2 3 1 10 18
## Sum 5 6 4 6 15 36
##
##
## Cramer's V: 0.415
##
## Chi-square Test: Chisq = 6.200, df = 4, p-value = 0.185
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate
A categorical variable plotted with a continuous variable results in the mean of the continuous variable displayed at each level of the categorical variable.
Plot(Gender, Salary)
## >>> Suggestions
## Plot(Gender, Salary, means=FALSE) # do not plot means
## Plot(Gender, Salary, stat="mean") # only plot means
## ttest(Salary ~ Gender) # inferential analysis
##
##
## Salary
## - by levels of -
## Gender
##
## n miss mean sd min mdn max
## F 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
Map a continuous variable, such as Pre, to the points with the size
parameter, a bubble plot.
Plot(Years, Salary, size=Pre)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Plot levels of categorical variable Gender with the by
parameter.
Plot(Years, Salary, by=Gender)
The categorical variable can also generate Trellis plots with the by1
parameter.
Plot(Years, Salary, by1=Gender)
## [Trellis graphics from Deepayan Sarkar's lattice package]
Can have a second categorical variable in the Trellis plot with a by2
parameter. Can also combine with the by
parameter. Here also request the linear model fit line with fit
set to "lm"
.
Plot(Years, Salary, by1=Dept, by2=Gender, by=Plan, fit="lm")
## [Trellis graphics from Deepayan Sarkar's lattice package]
Indicate multiple variables to plot along either axis with a vector defined according to the base R function c()
. Plot the linear model for each variable according to the fit
parameter set to "lm"
. Turn off the confidence interval by setting the standard errors to zero with fit_se
set to 0
.
Plot(c(Pre, Post), Salary, fit="lm", fit_se=0)
## >>> Suggestions
## Plot(c(Pre, Post), Salary, out_cut=.10) # label top 10% potential outliers
## Plot(c(Pre, Post), Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 37
##
##
## Sample Correlation of Pre and Salary: r = -0.007
##
##
## Hypothesis Test of 0 Correlation: t = -0.043, df = 35, p-value = 0.966
## 95% Confidence Interval for Correlation: -0.330 to 0.318
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 37
##
##
## Sample Correlation of Post and Salary: r = -0.070
##
##
## Hypothesis Test of 0 Correlation: t = -0.416, df = 35, p-value = 0.680
## 95% Confidence Interval for Correlation: -0.385 to 0.260
Three or more variables can plot as a scatterplot matrix. For the first parameter value, pass a single vector such as defined by c()
. Request the non-linear fit line by specifying TRUE
or loess
for the fit
parameter. Request a linear fit line with the value of "lm"
.
Plot(c(Salary, Years, Pre, Post), fit=TRUE)
For more than 2500 points, Plot()
smooths the scatterplot by default. Generate random data with base R rnorm()
, then plot. Plot()
also plots these variables from the global environment (workspace) instead of from a data frame.
<- rnorm(5000)
x <- rnorm(5000)
y Plot(x, y)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 5000
##
##
## Sample Correlation of x and y: r = -0.003
##
##
## Hypothesis Test of 0 Correlation: t = -0.211, df = 4998, p-value = 0.833
## 95% Confidence Interval for Correlation: -0.031000000000 to 0.025000000000
The individual points superimposed on the smoothed plot are potential outliers. The default number plotted is 100. Turn off completely by setting parameter smooth_points
to 0
.
Another option is to turn smoothing off, with the smooth
parameter set to FALSE
, and then turn on a high level of transparency, setting trans
to 0.97
. Here change the theme for this plot to slatered
with the theme
parameter.
Plot(x, y, smooth=FALSE, trans=.97, theme="slatered")
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 5000
##
##
## Sample Correlation of x and y: r = -0.003
##
##
## Hypothesis Test of 0 Correlation: t = -0.211, df = 4998, p-value = 0.833
## 95% Confidence Interval for Correlation: -0.031000000000 to 0.025000000000
The default plot for a single continuous variable includes not only the scatterplot, but also the violin plot and box plot, with outliers identified. Call this plot the VBS plot.
Plot(Salary)
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
##
## --- Salary ---
## Present: 37
## Missing: 0
## Total : 37
##
## Mean : 73795.557
## Stnd Dev : 21799.533
## IQR : 31012.560
## Skew : 0.190 [medcouple, -1 to 1]
##
## Minimum : 46124.970
## Lower Whisker: 46124.970
## 1st Quartile : 56772.950
## Median : 69547.600
## 3rd Quartile : 87785.510
## Upper Whisker: 122563.380
## Maximum : 134419.230
##
##
## (Box plot) Outliers: 1
##
## Small Large
## ----- -----
## Correll, Trevon 134419.23
##
##
## Number of duplicated values: 0
##
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.61 size of plotted points
## jitter_y: 0.45 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
Meaningful if the data value for a numerical variable are ordered sequentially, such as by time, can also plot a run chart of a single variable according to the parameter run
. Analogous to a time series visualization, the run chart plots the data values sequentially, but without dates or times.
Plot(Salary, run=TRUE)
## >>> Suggestions
## Plot(Salary, run=TRUE, size=0) # just line segments, no points
## Plot(Salary, run=TRUE, lwd=0) # just points, no line segments
## Plot(Salary, run=TRUE, fill="on") # default color fill
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
## ------------
## Run Analysis
## ------------
##
## Total number of runs: 19
## Total number of values that do not equal the median: 36
For a single categorical variable, get the corresponding bubble plot of frequencies.
Plot(Dept)
## >>> Suggestions
## Plot(Dept, color_low="lemonchiffon2", color_hi="maroon3")
## Plot(Dept, values="count") # scatter plot of counts
##
##
## --- Dept ---
##
##
## ACCT ADMN FINC MKTG SALE Total
## Frequencies: 5 6 4 6 15 36
## Proportions: 0.139 0.167 0.111 0.167 0.417 1.000
##
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 10.944, df = 4, p-value = 0.027
The Cleveland dot plot, here for a single variable, has row names on the y-axis. The default plots sorts by the value plotted.
Plot(Salary, row_names)
## >>> Suggestions
## Plot(Salary, y=row_names, sort_yx=FALSE, segments_y=FALSE)
##
##
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.6 21799.5 46125.0 69547.6 134419.2
##
##
## (Box plot) Outliers: 1
##
## Small Large
## ----- -----
## 134419.2
The standard scatterplot version of a Cleveland dot plot.
Plot(Salary, row_names, sort_yx="0", segments_y=FALSE)
## >>> Suggestions
##
##
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.6 21799.5 46125.0 69547.6 134419.2
##
##
## (Box plot) Outliers: 1
##
## Small Large
## ----- -----
## 134419.2
This Cleveland dot plot has two x-variables, indicated as a standard R vector with the c()
function. In this situation the two points on each row are connected with a line segment. By default the rows are sorted by distance between the successive points.
Plot(c(Pre, Post), row_names)
## >>> Suggestions
## Plot(c(Pre, Post), y=row_names, sort_yx=FALSE, segments_y=FALSE)
##
##
##
## --- Pre ---
##
## n miss mean sd min mdn max
## 37 0 78.8 12.0 59.0 80.0 100.0
##
##
## --- Post ---
##
## n miss mean sd min mdn max
## 37 0 81.0 11.6 59.0 84.0 100.0
##
##
## No (Box plot) outliers
##
##
## n diff Row
## ---------------------------
## 1 -4.0 Gvakharia, Kimberly
## 2 -4.0 Downs, Deborah
## 3 -3.0 Anderson, David
## 4 -3.0 Correll, Trevon
## 5 -3.0 Kralik, Laura
## 6 -3.0 Jones, Alissa
## 7 -2.0 Capelle, Adam
## 8 -2.0 Stanley, Emma
## 9 -2.0 Adib, Hassan
## 10 -2.0 Skrotzki, Sara
## 27 5.0 Bellingar, Samantha
## 28 6.0 LaRoe, Maria
## 29 7.0 Cassinelli, Anastis
## 30 7.0 Hamide, Bita
## 31 7.0 Sheppard, Cory
## 32 8.0 Campagna, Justin
## 33 10.0 Ritchie, Darnell
## 34 12.0 Anastasiou, Crystal
## 35 12.0 Wu, James
## 36 13.0 Korhalkar, Jessica
## 37 13.0 Cooper, Lindsay
Plot()
can plot from three different forms of the data: long-form, wide-form, and a time-series object.
Read time series data of stock Price for three companies: Apple, IBM, and Intel. The data table is in long form, part of lessR.
<- Read("StockPrice") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## Date: Date with year, month and day
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 date Date 1374 0 458 1980-12-01 ... 2019-01-01
## 2 Company character 1374 0 3 Apple Apple ... Intel Intel
## 3 Price double 1374 0 1259 0.027 0.023 ... 46.634 46.823
## ------------------------------------------------------------------------------------------
1:5,] d[
## date Company Price
## 1 1980-12-01 Apple 0.027
## 2 1981-01-01 Apple 0.023
## 3 1981-02-01 Apple 0.021
## 4 1981-03-01 Apple 0.020
## 5 1981-04-01 Apple 0.023
Activate a time series plot by setting the \(x\)-variable to a variable of R type Date
, which is true of the variable date in this data set. Can also plot a time series by passing a time series object, created with the base R function ts()
as the variable to plot.
Here plot just for Apple, with the two variables date and Price, stock price. The parameter rows
specifies what rows of the input data frame to retain for the analysis.
Plot(date, Price, rows=(Company=="Apple"))
## >>> Suggestions
## Plot(date, Price, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(date, Price, out_cut=.10) # label top 10% potential outliers
## Plot(date, Price, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 458
##
##
## Sample Correlation of date and Price: r = 0.706
##
##
## Hypothesis Test of 0 Correlation: t = 21.280, df = 456, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.6570 to 0.7490
With the by
parameter, plot all three companies on the same panel.
Plot(date, Price, by=Company)
Stack the plots.
Plot(date, Price, by=Company, stack=TRUE)
With the by1
parameter, plot all three companies on the different panels, a Trellis plot.
Plot(date, Price, by1=Company)
## [Trellis graphics from Deepayan Sarkar's lattice package]
Do the Trellis plot with some color. Then return to the default style.
style(sub_theme="black", window_fill="gray10")
Plot(date, Price, by1=Company, n_col=1, fill="darkred", color="red", trans=.55)
## [Trellis graphics from Deepayan Sarkar's lattice package]
style()
## theme set to "colors"
Stack the three time series, fill under each curve with a version of the lessR sequential range "emeralds"
.
Plot(date, Price, by=Company, trans=0.4, stack=TRUE, area_fill="emeralds")
Plot()
also reads wide-format data. First convert the long form as read to the wide form. In the wide form, the three companies each have their own column of data, repeated for each date.
<- reshape(d, direction = "wide",
dw idvar = "date", timevar = "Company",
varying = list(c("Apple", "IBM", "Intel")))
head(dw)
## date Apple IBM Intel
## 1 1980-12-01 0.027 2.051 0.212
## 2 1981-01-01 0.023 1.945 0.196
## 3 1981-02-01 0.021 1.941 0.185
## 4 1981-03-01 0.020 1.910 0.191
## 5 1981-04-01 0.023 1.795 0.199
## 6 1981-05-01 0.027 1.799 0.216
Now the analysis, which repeats a previous analysis, but with wide-form data. Because the data frame is not the default d, explicitly indicate with the data
parameter.
Plot(date, c(Intel, Apple, IBM), area_fill="blues", stack=TRUE, trans=.4, data=dw)
## >>> Suggestions
## Plot(date, c(Intel, Apple, IBM), fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(date, c(Intel, Apple, IBM), out_cut=.10) # label top 10% potential outliers
## Plot(date, c(Intel, Apple, IBM), enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 458
##
##
## Sample Correlation of date and Intel: r = 0.837
##
##
## Hypothesis Test of 0 Correlation: t = 32.616, df = 456, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.8070 to 0.8620
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 458
##
##
## Sample Correlation of date and Apple: r = 0.706
##
##
## Hypothesis Test of 0 Correlation: t = 21.280, df = 456, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.6570 to 0.7490
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 458
##
##
## Sample Correlation of date and IBM: r = 0.930
##
##
## Hypothesis Test of 0 Correlation: t = 53.893, df = 456, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.9160 to 0.9410
Can also plot directly from an R time series object, created with the base R ts()
function.
<- ts(dw$Apple, frequency=12, start=c(1980, 12))
a1.ts Plot(a1.ts)
## >>> Note: a1.ts is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(a1.ts, a1.ts, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(a1.ts, a1.ts, out_cut=.10) # label top 10% potential outliers
## Plot(a1.ts, a1.ts, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 458
##
##
## Sample Correlation of a1.ts and a1.ts: r = 0.706
##
##
## Hypothesis Test of 0 Correlation: t = 21.280, df = 456, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.6570 to 0.7490
With style()
many themes can be selected, such as "lightbronze"
, "dodgerblue"
, "darkred"
, such as "gray"
. When no theme
is specified, return to the default theme, colors
.
style()
## theme set to "colors"
<- Read("Employee", quiet=TRUE) d
Add three different text blocks at three different specified locations.
Plot(Years, Salary, add=c("Hi", "Bye", "Wow"), x1=c(12, 16, 18),
y1=c(80000, 100000, 60000))
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
A rectangle requires two points, four coordinates, <x1,y1> and <x2,y2>.
style(add_trans=.8, add_fill="gold", add_color="gold4", add_lwd=0.5)
Plot(Years, Salary, add="rect", x1=12, y1=80000, x2=16, y2=115000)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Use the base R help()
function to view the full manual for Plot()
. Simply enter a question mark followed by the name of the function.
?Plot
More on Scatterplots and other visualizations from lessR and other packages such as ggplot2 at:
Gerbing, D., R Visualizations: Derive Meaning from Data, CRC Press, May, 2020, ISBN 978-1138599635.