library("lessR")
The analysis of proportions is of two primary types.
From standard base R functions, the lessR function Prop_test()
, abbreviated prop()
, provides for either type of the analysis for proportions. To use, enter either the original data from which the sample proportions are computed, or directly enter already computed sample proportions.
When analyzing the original data, an entered value for the parameter success
for the categorical variable of interest, indicated by parameter variable
, triggers the test of homogeneity. If the proportions are entered directly, indicate the number of successes and the total number of trials with the n_succ
and n_tot
parameters, each as a single value for a single sample or as vectors of multiple values for multiple samples. Without a value for success
or n_succ
the analysis is of goodness-of-fit or independence.
Consider a single proportion for a value of a variable of interest, analyzed from a single sample. What is the proportion of occurrences of a designated value of variable? That tradition is to call that value a success
. All other values are failures. Success or failure in this context does not necessarily mean good or bad, desired or undesired, but simply that a designated value occurred or it did not occur.
The example below is the same example given in the documentation for the base R binom.test()
, with the same result as Prop_test()
, which relies upon that base R function for this analysis.
For a given categorical variable of interest, in this case a type of plant, consider two values, either “giant” or “dwarf”. From a sample of 925 plants, the specified value of “giant” occurred 682 times, and did not occur 243 times. The null hypothesis tested is the specified value occurs for 3/4 of the population according to the p0
parameter.
Prop_test(n_succ=682, n_fail=243, p0=.75)
##
## >>> Exact binomial test of a proportion <<<
##
## ------ Description ------
##
## Number of successes: 682
## Number of failures: 243
## Number of trials: 925
## Sample proportion: 0.737
##
## ------ Inference ------
##
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765
To illustrate with data, read the Employee data included as part of lessR.
<- Read("Employee") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
For the variable
Gender in the default d data frame, in this example parameter success
defines a success for the value of Gender of “F”. Analyze the proportion of successes, that is, those reporting a Gender of “F”. The default null hypothesis is a population value of 0.5.
Here include the parameter names, but not necessary in this example as the parameters are listed in the order that they are defined in the definition of the Prop_test()
function.
Prop_test(variable=Gender, success="F")
##
## >>> Exact binomial test of a proportion <<<
##
## Variable: Gender
## success: F
##
## ------ Description ------
##
## Number of missing values: 0
## Number of successes: 19
## Number of failures: 18
## Number of trials: 37
## Sample proportion: 0.514
##
## ------ Inference ------
##
## Hypothesis test for null of 0.5, p-value: 1.000
## 95% Confidence interval: 0.344 to 0.681
The null hypothesis is not rejected, with a \(p\)-value of 1. The sample result of \(p=0.514\) is considered close to the default hypothesized value of \(0.5\) for the proportion of "F"
values for Gender.
In this next example, change the null hypothesis with the parameter p0
to 0.6. Use the abbreviation prop()
.
prop(Gender, "F", p0=0.6)
##
## >>> Exact binomial test of a proportion <<<
##
## Variable: Gender
## success: F
##
## ------ Description ------
##
## Number of missing values: 0
## Number of successes: 19
## Number of failures: 18
## Number of trials: 37
## Sample proportion: 0.514
##
## ------ Inference ------
##
## Hypothesis test for null of 0.6, p-value: 0.315
## 95% Confidence interval: 0.344 to 0.681
The null hypothesis of \(p_0=0.6\) is also not rejected as the \(p\)-value is well above \(\alpha=005\).
The next example is the same in the documentation for the base R prop.test()
, with the same result as Prop_test()
, which relies upon that base R function for the comparison of proportions across different groups. To indicate multiple proportions, specified across groups, when inputting proportions provide multiple values for the n_succ
and n_tot
parameters.
The null hypothesis is that the four populations of patients from which the samples were drawn have the same population proportion of smokers. The alternative is that at least one population proportion is different.
<- c(83, 90, 129, 70)
smokers <- c(86, 93, 136, 82)
patients Prop_test(n_succ=smokers, n_tot=patients)
##
## >>> 4-sample test for equality of proportions without continuity correction <<<
##
##
## >>> Description
##
## 1 2 3 4
## ----------- ------ ------ ------ ------
## n_ 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## >>> Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
Can also label the groups in the output by providing a named vector for the successes.
<- c(83, 90, 129, 70)
smokers names(smokers) <- c("Group1","Group2","Group3","Group4")
<- c(86, 93, 136, 82)
patients Prop_test(n_succ=smokers, n_tot=patients)
##
## >>> 4-sample test for equality of proportions without continuity correction <<<
##
##
## >>> Description
##
## Group1 Group2 Group3 Group4
## ----------- ------- ------- ------- -------
## n_ 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## >>> Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
Here duplicate these results from data. First create the data frame d according to the proportions of smokers and non-smokers.
<- c(rep("smoke", 83), rep("nosmoke", 3))
sm1 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm3 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm4 <- c(sm1, sm2, sm3, sm4)
sm <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
grp <- data.frame(sm, grp) d
Examine the first six rows and last six rows of the data frame d. Indicate the variable of interest, sm, with values “smoke” and “nosmoke”.
head(d)
## sm grp
## 1 smoke A
## 2 smoke A
## 3 smoke A
## 4 smoke A
## 5 smoke A
## 6 smoke A
tail(d)
## sm grp
## 392 nosmoke D
## 393 nosmoke D
## 394 nosmoke D
## 395 nosmoke D
## 396 nosmoke D
## 397 nosmoke D
To indicate a comparison across groups, retain the format for a single proportion, providing a categorical variable
of interest. Define a success by the value “smoke”. What is added for this analysis is to indicate the comparison across the four groups with a grouping variable that contains a label that identifies the corresponding group, Specify the grouping variable with the by
parameter. The grouping variable in this example is grp, with values the first four uppercase letters of the alphabet.
The relevant parameters variable
, success
, and by
are listed in their given order in this example, so the parameter names are not necessary. They are listed here for completeness. .
Prop_test(variable=sm, success="smoke", by=grp)
##
## >>> 4-sample test for equality of proportions without continuity correction <<<
##
## Variable: sm
## success: smoke
## by: grp
##
## >>> Description
##
## A B C D
## ----------- ------ ------ ------ ------
## n_smoke 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## >>> Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
The analysis, of courses, provides the same results as providing the proportions directly.
For the goodness-of-fit test to a uniform distribution, provide the frequencies for five cells for the parameter n_tot
. The default null hypothesis is that the proportions of the different categories of a categorical variable are equal.
= c(5,6,4,6,15)
x Prop_test(n_tot=x)
##
## >>> Chi-squared test for given probabilities <<<
##
##
## >>> Description
##
## 1 2 3 4 5
## --------- ------- ------- ------- ------- ------
## observed 5 6 4 6 15
## expected 7.200 7.200 7.200 7.200 7.200
## residual -0.820 -0.447 -1.193 -0.447 2.907
## stdn res -0.917 -0.500 -1.333 -0.500 3.250
##
## >>> Inference
##
## Chi-square statistic: 10.944
## Degrees of freedom: 4
## Hypothesis test of equal population proportions: p-value = 0.027
Make the n_tot
parameter a named vector to label the output accordingly.
= c(5,6,4,6,15)
x names(x) = c("ACCT", "ADMN", "FINC","MKTG","SALE")
Prop_test(n_tot=x)
##
## >>> Chi-squared test for given probabilities <<<
##
##
## >>> Description
##
## ACCT ADMN FINC MKTG SALE
## --------- ------- ------- ------- ------- ------
## observed 5 6 4 6 15
## expected 7.200 7.200 7.200 7.200 7.200
## residual -0.820 -0.447 -1.193 -0.447 2.907
## stdn res -0.917 -0.500 -1.333 -0.500 3.250
##
## >>> Inference
##
## Chi-square statistic: 10.944
## Degrees of freedom: 4
## Hypothesis test of equal population proportions: p-value = 0.027
Next the same analysis but from the data.
<- Read("Employee", quiet=TRUE) d
Prop_test(Dept)
##
## >>> Chi-squared test for given probabilities <<<
##
## Variable: Dept
##
## >>> Description
##
## ACCT ADMN FINC MKTG SALE
## --------- ------- ------- ------- ------- ------
## observed 5 6 4 6 15
## expected 7.200 7.200 7.200 7.200 7.200
## residual -0.820 -0.447 -1.193 -0.447 2.907
## stdn res -0.917 -0.500 -1.333 -0.500 3.250
##
## >>> Inference
##
## Chi-square statistic: 10.944
## Degrees of freedom: 4
## Hypothesis test of equal population proportions: p-value = 0.027
To do the \(\chi^2\) test of independence, specify two categorical variables. The first variable listed in this example is the value of the parameter variable
, so does not need the parameter name. The second variable listed must include the parameter name by
.
Prop_test(Dept, by=Gender)
##
## >>> Pearson's Chi-squared test <<<
##
## Variable: Dept
## by: Gender
##
## >>> Description
##
## Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
## F 3 4 1 5 5 18
## M 2 2 3 1 10 18
## Sum 5 6 4 6 15 36
##
## Cramer's V: 0.415
##
## >>> Inference
##
## Chi-square statistic: 6.200
## Degrees of freedom: 4
## Hypothesis test of independence: p-value = 0.185