lessR provides many versions of a scatter plot with its Plot()
function, all accessible with the same simple syntax. Illustrate with the Employee data included as part of lessR.
<- Read("Employee") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a csv
file or an Excel file.
Currently, read the label file into the l data frame. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label.
<- rd("Employee_lbl") l
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 label character 8 0 8 Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------
The typical scatterplot visualizes the relationship of two continuous variables, here Years worked at a company, and annual Salary. Following is the function call to Plot()
for the default visualization. Because d is the default name of the data frame that contains the variables for analysis, the data
parameter that names the input data frame need not be specified.
Plot(Years, Salary)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Enhance the default scatterplot with parameter enhance
. The visualization includes the mean of each variable indicated by the respective line through the scatterplot, the 95% confidence ellipse, labeled outliers, least-squares regression line with 95% confidence interval, and the corresponding regression line with the outliers removed.
Plot(Years, Salary, enhance=TRUE)
## [Ellipse with Murdoch and Chow's function ellipse from their ellipse package]
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
## >>> Outlier analysis with Mahalanobis Distance
##
## MD ID
## ----- -----
## 8.14 Correll, Trevon
## 7.84 Capelle, Adam
##
## 5.63 Korhalkar, Jessica
## 5.58 James, Leslie
## 3.75 Hoang, Binh
## ... ...
A variety of fit lines can be plotted. The available values: "loess"
for general non-linear fit, "lm"
for linear least squares, "null"
for the null (flat line) model, "exp"
for the exponential model, "sqrt"
for the square root model, and "reciprocal"
for the reciprocal model. Setting fit
to TRUE
plots the "loess"
line.
Here, plot general non-linear fit. For emphasis set plot_errors
to TRUE
to plot the residuals from the line.
Plot(Years, Salary, fit="loess", plot_errors=TRUE)
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Next, plot the exponential fit and show the residuals from the exponential curve. These data are approximately linear so the exponential curve does not vary far fom a straight line.
Plot(Years, Salary, fit="exp", plot_errors=TRUE)
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Map a continuous variable, such as Pre, to the plotted points with the size
parameter, a bubble plot.
Plot(Years, Salary, size=Pre)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
## Plot(x=Years, y=Salary, size=Pre, radius=0.18) # larger bubbles
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #3C6A82 filled color of the points
## color: #3C6A82 edge color of the points
## radius: 0.12 size of largest bubble
## power: 0.50 relative bubble sizes
Indicate multiple variables to plot along either axis with a vector defined according to the base R function c()
. Plot the linear model for each variable according to the fit
parameter set to "lm"
. Turn off the confidence interval by setting the standard errors to zero with fit_se
set to 0
.
Plot(c(Pre, Post), Salary, fit="lm", fit_se=0)
## >>> Suggestions
## Plot(c(Pre, Post), Salary, out_cut=.10) # label top 10% potential outliers
## Plot(c(Pre, Post), Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Pre: Test score on legal issues before instruction
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 37
##
##
## Sample Correlation of Pre and Salary: r = -0.007
##
##
## Hypothesis Test of 0 Correlation: t = -0.043, df = 35, p-value = 0.966
## 95% Confidence Interval for Correlation: -0.330 to 0.318
##
## >>> Pearson's product-moment correlation
##
## Post: Test score on legal issues after instruction
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 37
##
##
## Sample Correlation of Post and Salary: r = -0.070
##
##
## Hypothesis Test of 0 Correlation: t = -0.416, df = 35, p-value = 0.680
## 95% Confidence Interval for Correlation: -0.385 to 0.260
Three or more variables for the first parameter value plot as a scatterplot matrix. Pass a single vector, such as defined by c()
. Request the non-linear fit line and corresponding confidence interval by specifying TRUE
or loess
for the fit
parameter. Request a linear fit line with the value of "lm"
.
Plot(c(Salary, Years, Pre, Post), fit=TRUE)
Generate random data with base R rnorm()
, then plot. Plot()
first checks the presence of the specified variables in the global environment (workspace). If not there, then from a data frame, of which the default value is d. Here, generate values for x and y in the workspace.
<- rnorm(4000)
x <- rnorm(4000)
y Plot(x, y)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 4000
##
##
## Sample Correlation of x and y: r = 0.006
##
##
## Hypothesis Test of 0 Correlation: t = 0.390, df = 3998, p-value = 0.696
## 95% Confidence Interval for Correlation: -0.025000000000 to 0.037000000000
With large data sets, even for continuous variables there can be much over-plotting of points. One strategy to address this issue smooths the scatterplot. The individual points superimposed on the smoothed plot are potential outliers. The default number plotted is 100. Turn off completely by setting parameter smooth_points
to 0
.
Plot(x, y, smooth=TRUE)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 4000
##
##
## Sample Correlation of x and y: r = 0.006
##
##
## Hypothesis Test of 0 Correlation: t = 0.390, df = 3998, p-value = 0.696
## 95% Confidence Interval for Correlation: -0.025000000000 to 0.037000000000
Another strategy for alleviating over-plotting makes the fill color mostly transparent with the trans
parameter, or turn off completely by setting to fill
to "off"
. The closer the value of trans
is to 1, the more transparent is the fill.
Plot(x, y, trans=0.95)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 4000
##
##
## Sample Correlation of x and y: r = 0.006
##
##
## Hypothesis Test of 0 Correlation: t = 0.390, df = 3998, p-value = 0.696
## 95% Confidence Interval for Correlation: -0.025000000000 to 0.037000000000
The default plot for a single continuous variable includes not only the scatterplot, but also the superimposed violin plot and box plot, with outliers identified. Call this plot the VBS plot.
Plot(Salary)
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## --- Salary ---
## Present: 37
## Missing: 0
## Total : 37
##
## Mean : 73795.557
## Stnd Dev : 21799.533
## IQR : 31012.560
## Skew : 0.190 [medcouple, -1 to 1]
##
## Minimum : 46124.970
## Lower Whisker: 46124.970
## 1st Quartile : 56772.950
## Median : 69547.600
## 3rd Quartile : 87785.510
## Upper Whisker: 122563.380
## Maximum : 134419.230
##
##
## (Box plot) Outliers: 1
##
## Small Large
## ----- -----
## Correll, Trevon 134419.23
##
##
## Number of duplicated values: 0
##
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.61 size of plotted points
## out_size: 0.82 size of plotted outlier points
## jitter_y: 0.45 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
Control the choice of the three superimposed plots – violin, box, and scatter – with the vbs_plot
parameter. The default setting is vbs
for all three plots. Here, for example, obtain just the box plot. Or, use the alias BoxPlot()
in place of Plot()
.
Plot(Salary, vbs_plot="b")
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## --- Salary ---
## Present: 37
## Missing: 0
## Total : 37
##
## Mean : 73795.557
## Stnd Dev : 21799.533
## IQR : 31012.560
## Skew : 0.190 [medcouple, -1 to 1]
##
## Minimum : 46124.970
## Lower Whisker: 46124.970
## 1st Quartile : 56772.950
## Median : 69547.600
## 3rd Quartile : 87785.510
## Upper Whisker: 122563.380
## Maximum : 134419.230
##
##
## (Box plot) Outliers: 1
##
## Small Large
## ----- -----
## Correll, Trevon 134419.23
##
##
## Number of duplicated values: 0
Create a Cleveland dot plot when one of the variables has unique (ID) values. In this example, for a single variable, row names are on the y-axis. The default plots sorts by the value plotted.
Plot(Salary, row_names)
## >>> Suggestions
## Plot(Salary, y=row_names, sort_yx=FALSE, segments_y=FALSE)
##
##
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.6 21799.5 46125.0 69547.6 134419.2
##
##
## (Box plot) Outliers: 1
##
## Small Large
## ----- -----
## 134419.2
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #3C6A82 filled color of the points
## color: #3C6A82 edge color of the points
## size: 0.80 size of plotted points
## jitter_y: 0.60 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
The standard scatterplot version of a Cleveland dot plot.
Plot(Salary, row_names, sort_yx="0", segments_y=FALSE)
## >>> Suggestions
##
##
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.6 21799.5 46125.0 69547.6 134419.2
##
##
## (Box plot) Outliers: 1
##
## Small Large
## ----- -----
## 134419.2
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #3C6A82 filled color of the points
## color: #3C6A82 edge color of the points
## size: 0.80 size of plotted points
## jitter_y: 0.60 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
This Cleveland dot plot has two x-variables, indicated as a standard R vector with the c()
function. In this situation, the two points on each row are connected with a line segment. By default the rows are sorted by distance between the successive points.
Plot(c(Pre, Post), row_names)
## >>> Suggestions
## Plot(c(Pre, Post), y=row_names, sort_yx=FALSE, segments_y=FALSE)
##
##
##
## --- Pre ---
##
## n miss mean sd min mdn max
## 37 0 78.8 12.0 59.0 80.0 100.0
##
##
## --- Post ---
##
## n miss mean sd min mdn max
## 37 0 81.0 11.6 59.0 84.0 100.0
##
##
## No (Box plot) outliers
##
##
## n diff Row
## ---------------------------
## 1 -4.0 Gvakharia, Kimberly
## 2 -4.0 Downs, Deborah
## 3 -3.0 Anderson, David
## 4 -3.0 Correll, Trevon
## 5 -3.0 Kralik, Laura
## 6 -3.0 Jones, Alissa
## 7 -2.0 Capelle, Adam
## 8 -2.0 Stanley, Emma
## 9 -2.0 Adib, Hassan
## 10 -2.0 Skrotzki, Sara
## 27 5.0 Bellingar, Samantha
## 28 6.0 LaRoe, Maria
## 29 7.0 Cassinelli, Anastis
## 30 7.0 Hamide, Bita
## 31 7.0 Sheppard, Cory
## 32 8.0 Campagna, Justin
## 33 10.0 Ritchie, Darnell
## 34 12.0 Anastasiou, Crystal
## 35 12.0 Wu, James
## 36 13.0 Korhalkar, Jessica
## 37 13.0 Cooper, Lindsay
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #4398D0 filled color of the points
## color: #4398D0 edge color of the points
## size: 0.80 size of plotted points
## jitter_y: 0.60 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
A mixture of categorical and continuous variables can be plotted a variety of ways, as illustrated below.
Plot a scatterplot of two continuous variables for each level of a categorical variable on the same panel with the by
parameter. Here, plot Years and Salary each for the two levels of Gender in the data. Colors and geometric plot shapes can distinguish between the plots. For all variables except an ordered factor, the default plots according to the default qualitative color palette, "hues"
, with the geometric shape of a point.
Plot(Years, Salary, by=Gender)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
If the by
variable is an ordered factor, the default color palette is sequential according to the underlying theme, such as "blues"
for the default theme of "colors"
. Change the general theme with the style()
function.
$Gender.f <- factor(d$Gender, ordered=TRUE)
dPlot(Years, Salary, by=Gender.f)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Change the plot colors with the fill
(interior) and color
(exterior or edge) parameters. Because there are two levels of the by
variable, specify two fill colors and two edge colors each with an R vector defined by the c()
function. Also include the regression line for each group and increase the size of the plotted points.
Plot(Years, Salary, by=Gender, size=2, fit="lm",
fill=c("olivedrab3", "gold1"),
color=c("darkgreen", "gold4")
)
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
Change the plotted shapes with the shape
parameter.
A Trellis plot, or facet plot, creates a separate panel for the plot of each level of the categorical variable. Generate Trellis plots with the by1
parameter. In this example, plot the best-fit linear model for the data in each panel according to the fit
parameter. By default, the 95% confidence interval for each line is also displayed.
Plot(Years, Salary, by1=Gender, fit="lm")
## [Trellis graphics from Deepayan Sarkar's lattice package]
Turn off the confidence interval by setting the parameter
fit_se
to 0 for the value of the confidence level.
A categorical variable plotted with a continuous variable results in a traditional scatterplot though, of course, the scatter is confined to the straight lines that represent the levels of the categorical variable, its values. In addition, by default the mean of the continuous variable is displayed at each level of the categorical variable, as well in the text output. To avoid point overlap, by default some some horizontal jitter for each plotted point is added, which can be adjusted with either the jitter_x
or jitter_y
parameter.
The first two parameters of Plot()
are x and y. In this example, the categorical variable, Dept, listed second, specifies the y variable, as in y=Dept. There is no distinction in this function call for two continues variables or one continuous and one categorical. The Plot()
function evaluates each variable for continuity and responds appropriately.
Plot(Salary, Dept)
## >>> Suggestions
## Plot(Salary, Dept, means=FALSE) # do not plot means
## Plot(Salary, Dept, stat="mean") # only plot means
## ANOVA(Salary ~ Dept) # inferential analysis
##
##
## Salary: Annual Salary (USD)
## - by levels of -
## : Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.8 12774.6 46125.0 69547.6 72502.5
## ADMN 6 0 81277.1 27585.2 53788.3 71058.6 122563.4
## FINC 4 0 69010.7 17852.5 57139.9 61937.6 95027.6
## MKTG 6 0 70257.1 19869.8 51036.8 61659.0 99062.7
## SALE 15 0 78830.1 23476.8 49189.0 77714.9 134419.2
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #3C6A82 filled color of the points
## color: #3C6A82 edge color of the points
## size: 0.80 size of plotted points
## jitter_y: 0.60 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
Adding some horizontal jitter for each plotted point is particularly useful for larger data sets. With enough jitter, the points that align with each level of the categorical variable plot in the shape of a bar, displaying the density of values along the continuous variable axis.
Another helpful technique for large data sets is to add some fill transparency with the trans
parameter, with values such as 0.8 and 0.9. The combination of jitter and transparency allows for plotting many thousands of points.
To move from a traditional scatterplot, plus means, to a superimposed violin, box, and scatter plot, a VBS plot, invoke the by
parameter. Here, plot Salary across the levels of Dept on the same panel.
Plot(Salary, by=Dept)
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ANOVA(Salary ~ Dept) # add the data parameter if not d
## Salary: Annual Salary (USD)
## - by levels of -
## Dept
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.776 12774.606 46124.970 69547.600 72502.500
## ADMN 6 0 81277.117 27585.151 53788.260 71058.595 122563.380
## FINC 4 0 69010.675 17852.498 57139.900 61937.625 95027.550
## MKTG 6 0 70257.128 19869.812 51036.850 61658.990 99062.660
## SALE 15 0 78830.065 23476.839 49188.960 77714.850 134419.230
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## ACCT 0
## ADMN 0
## FINC 0
## MKTG 0
## SALE 0
##
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.52 size of plotted points
## out_size: 0.79 size of plotted outlier points
## jitter_y: 0.50 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
Alternatively, move from a traditional scatterplot, plus means, to a Trellis plot that consists of a VBS plot on a separate panel for each level. Accomplish the Trellis plot with the by1
parameter. Here, plot Salary across the levels of Dept. Again, specify one, two, or, by default, all three superimposed plots: violin, box, and scatter. In this example, the categorical variable, Dept specifies the by1 variable.
Plot(Salary, by1=Dept)
## [Trellis graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ANOVA(Salary ~ Dept) # add the data parameter if not d
## Salary: Annual Salary (USD)
## - by levels of -
## Dept: Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.776 12774.606 46124.970 69547.600 72502.500
## ADMN 6 0 81277.117 27585.151 53788.260 71058.595 122563.380
## FINC 4 0 69010.675 17852.498 57139.900 61937.625 95027.550
## MKTG 6 0 70257.128 19869.812 51036.850 61658.990 99062.660
## SALE 15 0 78830.065 23476.839 49188.960 77714.850 134419.230
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## ACCT 0
## ADMN 0
## FINC 0
## MKTG 0
## SALE 0
##
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.52 size of plotted points
## out_size: 0.79 size of plotted outlier points
## jitter_y: 0.50 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
The default coloring of the boxes for variables other than an ordered factor follows the default qualitative palette, "hues"
. For an ordered factor, the fill color follows the default sequential palette of the corresponding theme, such as "blues"
. Customize colors with the box_fill
parameter.
Just show the box plots according to the vbs_plot
parameter, which has a default setting of vbs
for the superimposed violin, box, and scatter plots. Set vbs_plot
to "b"
. Or, use the alias BoxPlot()
. Change the fill color of each box with the box_fill
parameter. In addition to the traditional median for a box plot, show the mean as well as with the vbs_mean
parameter. If specifying just one fill color, then all boxes are filled with that color.
Or, drop the box plot and only plot the violins and the scatter plots. Without the boxes, the violins take on the default colors. Specify a value of "vs"
for the vbs_plot
parameter. If only plotting the violins, then can also use the alias ViolinPlot()
.
Plot(Salary, by1=Dept, vbs_plot="vs")
## [Trellis graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ANOVA(Salary ~ Dept) # add the data parameter if not d
## Salary: Annual Salary (USD)
## - by levels of -
## Dept: Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.776 12774.606 46124.970 69547.600 72502.500
## ADMN 6 0 81277.117 27585.151 53788.260 71058.595 122563.380
## FINC 4 0 69010.675 17852.498 57139.900 61937.625 95027.550
## MKTG 6 0 70257.128 19869.812 51036.850 61658.990 99062.660
## SALE 15 0 78830.065 23476.839 49188.960 77714.850 134419.230
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## ACCT 0
## ADMN 0
## FINC 0
## MKTG 0
## SALE 0
##
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.52 size of plotted points
## out_size: 0.79 size of plotted outlier points
## jitter_y: 0.50 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
BoxPlot(Salary, by1=Gender, vbs_mean=TRUE, box_fill="lightgoldenrod")
## [Trellis graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # add the data parameter if not d
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Female or Male
##
## n miss mean sd min mdn max
## F 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## F 0
## M 0
For the violin plot, change the fill color with parameter violin_fill
.
Show the different distributions of the continuous variable across the levels of the categorical variable. Here, show the distribution of Salary for Males and Females across the various departments.
Plot(Salary, Dept, by=Gender)
## >>> Suggestions
## Plot(Salary, Dept, by=Gender, means=FALSE) # do not plot means
## Plot(Salary, Dept, by=Gender, stat="mean") # only plot means
## ANOVA(Salary ~ Dept) # inferential analysis
##
##
## Salary: Annual Salary (USD)
## - by levels of -
## : Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.8 12774.6 46125.0 69547.6 72502.5
## ADMN 6 0 81277.1 27585.2 53788.3 71058.6 122563.4
## FINC 4 0 69010.7 17852.5 57139.9 61937.6 95027.6
## MKTG 6 0 70257.1 19869.8 51036.8 61659.0 99062.7
## SALE 15 0 78830.1 23476.8 49189.0 77714.9 134419.2
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #4398D0 filled color of the points
## color: #4398D0 edge color of the points
## size: 0.80 size of plotted points
## jitter_y: 0.60 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
Plot a second categorical variable in the Trellis plot with the by2
parameter. Can also combine the by1
and by2
parameters with the by
parameter. Here, also request the linear model fit line with fit
set to "lm"
. Obtain the scatterplots of all combinations of the levels of Dept and Gender, showing the three levels of Plan for each scatterplot.
Plot(Years, Salary, by1=Dept, by2=Gender, by=Plan, fit="lm")
## [Trellis graphics from Deepayan Sarkar's lattice package]
To do a Trellis plot with two categorical variables, invoke the by2
parameter in addition to the by1
parameter. By default, the box fill colors are unique for each level of the by1
variable, and then the colors cycle through all the values of the by2
variable.
Plot(Salary, by1=Gender, by2=Dept)
## [Trellis graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # add the data parameter if not d
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Female or Male
##
## n miss mean sd min mdn max
## F 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## F 0
## M 0
##
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.56 size of plotted points
## out_size: 0.80 size of plotted outlier points
## jitter_y: 0.54 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
To specify custom colors with the box_fill
parameter, specify the number of colors according to the number of levels of the by1
variable. The colors for by1
then cycle over the by2
values.
Alternatively, invoke the by
parameter and the by1
parameter. The values of the by
variable plot as separate panels, the Trellis part, and the by
variable plot for each panel.
Plot(Salary, by1=Dept, by=Gender)
## [Trellis graphics from Deepayan Sarkar's lattice package]
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # add the data parameter if not d
## ANOVA(Salary ~ Dept) # add the data parameter if not d
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.776 12774.606 46124.970 69547.600 72502.500
## ADMN 6 0 81277.117 27585.151 53788.260 71058.595 122563.380
## FINC 4 0 69010.675 17852.498 57139.900 61937.625 95027.550
## MKTG 6 0 70257.128 19869.812 51036.850 61658.990 99062.660
## SALE 15 0 78830.065 23476.839 49188.960 77714.850 134419.230
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## ACCT 0
## ADMN 0
## FINC 0
## MKTG 0
## SALE 0
##
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.52 size of plotted points
## out_size: 0.79 size of plotted outlier points
## jitter_y: 0.50 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
Plotting two categorical variables result in a bubble plot of their joint frequencies.
Plot(Dept, Gender)
## >>> Suggestions
## Plot(Dept, Gender, size_cut=FALSE)
## Plot(Dept, Gender, trans=.8, bg="off", grid="off")
## SummaryStats(Dept, Gender) # or ss
##
##
## Dept: Department Employed
## - by levels of -
## Gender: Female or Male
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
## F 3 4 1 5 5 18
## M 2 2 3 1 10 18
## Sum 5 6 4 6 15 36
##
##
## Cramer's V: 0.415
##
## Chi-square Test: Chisq = 6.200, df = 4, p-value = 0.185
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #3C6A82 filled color of the points
## color: #3C6A82 edge color of the points
## radius: 0.22 size of largest bubble
## power: 0.50 relative bubble sizes
The parameter radius
scales the size of the bubbles according to the size of the largest displayed bubble in inches. The power
parameter sets the relative size of the bubbles. The default power
value of 0.5 scales the bubbles so that the area of each bubble is the value of the corresponding sizing variable. A value of 1 scales so the radius of each bubble is the value of the sizing variable, increasing the discrepancy of size between the variables.
In this example, increase the absolute size of the bubbles as well as the relative discrepancy in their sizes. If the bubbles become to large so that the largest bubbles become truncated, increase the spacing of the respective axes with the pad_x
and/or pad_y
parameters.
Plot(Dept, Gender, radius=.6, power=0.8, pad_x=0.05, pad_y=0.05)
## >>> Suggestions
## Plot(Dept, Gender, radius=0.6, power=0.8, pad_x=0.05, pad_y=0.05, size_cut=FALSE)
## Plot(Dept, Gender, radius=0.6, power=0.8, pad_x=0.05, pad_y=0.05, trans=.8, bg="off", grid="off")
## SummaryStats(Dept, Gender) # or ss
##
##
## Dept: Department Employed
## - by levels of -
## Gender: Female or Male
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
## F 3 4 1 5 5 18
## M 2 2 3 1 10 18
## Sum 5 6 4 6 15 36
##
##
## Cramer's V: 0.415
##
## Chi-square Test: Chisq = 6.200, df = 4, p-value = 0.185
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #3C6A82 filled color of the points
## color: #3C6A82 edge color of the points
## radius: 0.60 size of largest bubble
## power: 0.80 relative bubble sizes
Alternatively, plot two categorical variables with a Trellis (facet) chart by invoking the by1
parameter. If the first listed variable in the function call, the x
parameter, is categorical, the result is a dot chart for each level of the by1
variable.
Plot(Dept, by1=Gender)
## [Trellis graphics from Deepayan Sarkar's lattice package]
Plotting a single categorical variable yields the corresponding bubble plot of frequencies.
Plot(Dept)
## >>> Suggestions
## Plot(Dept, color_low="lemonchiffon2", color_hi="maroon3")
## Plot(Dept, values="count") # scatter plot of counts
##
##
## --- Dept: Department Employed ---
##
##
## ACCT ADMN FINC MKTG SALE Total
## Frequencies: 5 6 4 6 15 36
## Proportions: 0.139 0.167 0.111 0.167 0.417 1.000
##
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 10.944, df = 4, p-value = 0.027
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## fill: #3C6A82 filled color of the points
## color: #3C6A82 edge color of the points
## radius: 0.22 size of largest bubble
## power: 0.50 relative bubble sizes
Use the base R help()
function to view the full manual for Plot()
. Simply enter a question mark followed by the name of the function.
?Plot
More on Scatterplots, Time Series plots, and other visualizations from lessR and other packages such as ggplot2 at:
Gerbing, D., R Visualizations: Derive Meaning from Data, CRC Press, May, 2020, ISBN 978-1138599635.