To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey
The SciencesPo package is meant to provide algorithms and functions for analyzing political behavior data, including measures of political fragmentation and seats allocation. In addition, it also offers built-in functions for descriptive statistics, tests, tables, and pre set publication-ready plots and themes for ggplot2, which will require only a minimum amount of fiddling with sizes and labels, etc. The package is intended to give students and postdocs an easy way of making the most of their data–usually small datasets. Although this package is available for the general public, it meets my personal needs and tastes. Yours may be different.
You can find the package source on Github, and you are welcome to contribute code via pull requests, or file feature requests and bug reports via Github issues.
This vignette builds on the examples scattered through the package’s manual available on https://cran.r-project.org. This first example walks through setting up SciencesPo for conducting basic data analysis.
library(SciencesPo)
## Do things
detach("package:SciencesPo", unload=TRUE)
#You can also use the unloadNamespace command,
unloadNamespace("SciencesPo")
Here are some examples demonstrating the results of help.search(), or you can also use ?? to search for a string.
help.search('bar.plot')
help.search('twoway', package = 'SciencesPo')
## No vignettes or demos or help files found with alias or concept or
## title matching 'twoway' using fuzzy matching.
To see the existing vignettes, type:
vignette(package = "SciencesPo")
To see the collection of data included in the package, type:
data(package = "SciencesPo")
Whenever you load the package, it will setup its own environment, including the plotting theme. Thus, some objects may be printed or plotted differently with what you would have seen in a normal R console. To not use the default settings of the package, you can use getOption()
to see what defaults are, and change them accordingly.
This function performs three types of skewness tests:
## [1] -0.228936
## [1] -1.171291
## [1] -0.2292801
## [1] -1.171145
## [1] -0.2285927
## [1] -1.174947
To demonstrate the output of these functions, below are the ages of the Presidents of the United States at the time of their inaugurations.
pres =c(42,43,46,46,47,48,49,49,50,51,51,51,51,51,52,52,54,54,54,54,54,55,55,55,55,56,56,56,57,57,57,57,58,60,61,61,61,62,64,64,65,68,69)
ci(pres, level=.95) # confidence interval
## CI Lower Est. Mean CI Upper Std. Error
## 52.9236 54.8372 56.7508 0.9482
ci(pres, level=.95)@mean # confidence interval
## [1] 54.83721
se(pres) # std. error
## [1] 0.9482339
## [1] 4.771
## [1] 54.83721
safe.chars
By default, R converts character columns to factors. Instead of re-reading the data using , the function identifies which columns are currently factors, and convert them all to characters parsing the levels as strings.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
destring
This function converts factor variables to numeric, much like the way Stata does.
require(SciencesPo)
mylevels <- c('Strongly Disagree',
'Disagree',
'Neither',
'Agree',
'Strongly Agree')
myvar <- factor(sample(mylevels[1:5], 10, replace=TRUE))
As we can see, this vector is un(ordered) in a “strange” way so to reflect the meaning of the levels I’ve attributed:
unclass(myvar) # testing the order
## [1] 2 4 4 1 1 2 3 1 3 1
## attr(,"levels")
## [1] "Disagree" "Neither" "Strongly Agree"
## [4] "Strongly Disagree"
By destring
this, we should get a numeric result with the same (un)order:
destring(myvar)
## [1] 2 4 4 1 1 2 3 1 3 1
rounded
It’s rather common to use numbers without leading zeros. The rounded
function does just that. Isn’t fancy?
(x = seq(0, 1, by=.1))
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
rounded(x)
## [1] .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00
There is a specific document covering one-way, two-way, and multiway tabulations with accompanying independent tests.
To tabulate on variable’s responses, simply:
CrossTabs(titanic$SURVIVED)
A more performant descriptive output can be obtained with the Freq
command, which resembles the SPSS output.
Frequency(titanic, SURVIVED)
Frequency Valid Percent Cum Percent
no 1490 67.7 67.7
yes 711 32.3 100.0
--------------------------------------------------------
Total 2201
Warning: Statistics may not be meaningful for factors!
Mean Std dev
1.323035 0.4677422
Minimum Maximum
1 2
Valid cases Missing cases
2201 0
To add a second, for a cross-tabulation:
crosstable(titanic, SEX, CLASS, SURVIVED)
##
## =================================
## SURVIVED
## -------------
## SEX CLASS no yes Total
## ---------------------------------
## female 1st 4 118 122
## 3.28% 97% 100%
## 2nd 13 154 167
## 7.78% 92% 100%
## 3rd 106 422 528
## 20.08% 80% 100%
## crew 3 670 673
## 0.45% 100% 100%
## --------------------------
## Total 126 1364 1490
## 8.46% 92% 100%
## ---------------------------------
## male 1st 141 62 203
## 69.5% 31% 100%
## 2nd 93 25 118
## 78.8% 21% 100%
## 3rd 90 88 178
## 50.6% 49% 100%
## crew 20 192 212
## 9.4% 91% 100%
## --------------------------
## Total 344 367 711
## 48.4% 52% 100%
## ---------------------------------
## Total 1st 145 180 325
## 44.6% 55% 100%
## 2nd 106 179 285
## 37.2% 63% 100%
## 3rd 196 510 706
## 27.8% 72% 100%
## crew 23 862 885
## 2.6% 97% 100%
## --------------------------
## Total 470 1731 2201
## 21.4% 79% 100%
## =================================
##
## SEX : female
##
## Number of cases in table: 1490
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 152.19, df = 3, p-value = 8.86e-33
## X^2 df P(> X^2)
## Likelihood Ratio 168.99 3 0
## Pearson 152.19 3 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.304
## Cramer's V : 0.32
##
##
## SEX : male
##
## Number of cases in table: 711
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 208.97, df = 3, p-value = 4.851e-45
## X^2 df P(> X^2)
## Likelihood Ratio 233.97 3 0
## Pearson 208.97 3 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.477
## Cramer's V : 0.542
##
##
## Total
##
## Number of cases in table: 2201
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 349.9, df = 3, p-value = 1.557e-75
## X^2 df P(> X^2)
## Likelihood Ratio 412.60 3 0
## Pearson 349.91 3 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.37
## Cramer's V : 0.399
To delete table entries that are less relevant, switch it to FALSE
. And switch it to TRUE
to add table entries that relevant. For instance, the chisq
argument refers to a Chi-Square test of Independence, to calculate the test, switch it to TRUE
.
uniform
library("SciencesPo")
# The 1980 presidential election in the US (vote share):
(US1980 <- c("Democratic"=0.410, "Republican"=0.507,
"Independent"=0.066, "Libertarian"=0.011,
"Citizens"=0.003, "Others"=0.003));
## Democratic Republican Independent Libertarian Citizens Others
## 0.410 0.507 0.066 0.011 0.003 0.003
politicalDiversity(US1980); # ENEP (laakso/taagepera) method
## [1] 2.328
politicalDiversity(US1980, index= "golosov");
## [1] 2.597
Considers the following data.frame
with electoral results for the 1999 election in Helsinki, the seats were allocated using both the Saint-Laguë and the D’Hondt methods:
# Helsinki's 1999
Helsinki <- data.frame(votes = c(68885, 18343, 86448, 21982, 51587,
27227, 8482, 7250, 365, 2734, 1925,
475, 1693, 693, 308, 980, 560, 590, 185),
seats.SL=c(5, 1, 6, 1, 4, 2, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0),
seats.dH=c(5, 1, 7, 1, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0))
# politicalDiversity(Helsinki$votes); #ENEP Votes
politicalDiversity(Helsinki$seats.SL); #ENP for Saint-Lague
[1] 4.762
politicalDiversity(Helsinki$seats.dH); #ENP for D'Hondt
[1] 4.167
Now using data from 2014 Brazilian legislative elections, especifically from one district, let’s compare the results from D’Hondt, Saint-Lague, Hungtinton-Hill, and Imperiali methods.
# Results for the state legislative house of Ceara (2014):
Ceara <- c("PCdoB"=187906, "PDT"=326841,"PEN"=132531, "PMDB"=981096,
"PRB"=2043217,"PSB"=15061, "PSC"=103679, "PSTU"=109830,
"PTdoB"=213988, "PTC"=67145, "PTN"=278267)
The basic imputs for this class of functions are: 1) a list of parties, 2) a list of positive votes, and 3) a constant value for the number of seats to be returned. A numeric value (0~1) for the threshold is optional.
highestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh")
Method: d'Hondt
Divisors: 1 2 3 4 ...
ENP: 3.65 (After): 3.14
Gallagher Index: 47.01
Party Seats %Seats
1 PRB 21 0.500
2 PMDB 10 0.238
3 PDT 3 0.071
4 PTN 2 0.048
5 PTdoB 2 0.048
6 PCdoB 1 0.024
7 PEN 1 0.024
8 PSC 1 0.024
9 PSTU 1 0.024
10 PSB 0 0.000
11 PTC 0 0.000
The d’Hondt method is only one way of allocating seats in party list systems. Other methods include the Saint-Lague, the modified Saint-Lague, the Danish version, Imperiali (do not to confuse with the Imperiali quota which is a Largest remainder method), and Hungtinton-Hill.
highestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "sl")
Method: Sainte-Laguë
Divisors: 1 3 5 7 ...
ENP: 3.65 (After): 3.74
Gallagher Index: 43.99
Party Seats %Seats
1 PRB 19 0.452
2 PMDB 9 0.214
3 PDT 3 0.071
4 PTN 3 0.071
5 PCdoB 2 0.048
6 PTdoB 2 0.048
7 PEN 1 0.024
8 PSC 1 0.024
9 PSTU 1 0.024
10 PTC 1 0.024
11 PSB 0 0.000
highestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "hh")
Method: Hungtinton-Hill
Divisors: 0 1.41 2.45 3.46 ...
ENP: 3.65 (After): 4.05
Gallagher Index: 42.76
Party Seats %Seats
1 PRB 18 0.429
2 PMDB 9 0.214
3 PDT 3 0.071
4 PTN 3 0.071
5 PCdoB 2 0.048
6 PTdoB 2 0.048
7 PEN 1 0.024
8 PSB 1 0.024
9 PSC 1 0.024
10 PSTU 1 0.024
11 PTC 1 0.024
highestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "imperiali")
Method: Imperiali
Divisors: 1 1.5 2 2.5 ...
ENP: 3.65 (After): 2.48
Gallagher Index: 52.15
Party Seats %Seats
1 PRB 24 0.571
2 PMDB 11 0.262
3 PDT 3 0.071
4 PTN 2 0.048
5 PCdoB 1 0.024
6 PTdoB 1 0.024
7 PEN 0 0.000
8 PSB 0 0.000
9 PSC 0 0.000
10 PSTU 0 0.000
11 PTC 0 0.000
Let’s assume the electoral system has a 5% vote threshold. Meaning that parties must get at least 5% of the total unspoiled votes cast in order to participate in the distribution of seats. Parties PCdoB, PTdoB, PEN, PSC, PSTU, PSB, and PTC would then be elimiated from competition at the outset. If the d’Hondt method of seat allocation were employed, then party PRB would get 4 extra seats than otherwise, and party PMDB 3 additional seats.
highestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh", threshold = 5/100)
Method: d'Hondt
Divisors: 1 2 3 4 ...
ENP: 3.65 (After): 2.39
Gallagher Index: 53.23
Party Seats %Seats
1 PRB 24 0.571
2 PMDB 12 0.286
3 PDT 3 0.071
4 PTN 3 0.071
5 PCdoB 0 0.000
6 PEN 0 0.000
7 PSB 0 0.000
8 PSC 0 0.000
9 PSTU 0 0.000
10 PTC 0 0.000
11 PTdoB 0 0.000
Other methods divide the votes by a mathematically derived quota, such as the Droop quota, the Hare quota (or Hamilton/Vinton), or the Imperiali quota, see next.
largestRemainders(parties=names(Ceara), votes=Ceara,
seats = 42, method = "hare")
largestRemainders(parties=names(Ceara), votes=Ceara,
seats = 42, method = "droop")
# The 1946 Italian Constituent Assembly election results: parties and unspoilt votes
Italy = data.frame(party=c("DC", "PSIUP", "PCI", "UDN", "UQ", "PRI",
"BNL", "PdA", "MIS", "PCd'I", "CDR",
"PSd'Az", "MUI", "PCS", "PDL", "FDPR"),
votes=c(8101004, 4758129, 4356686, 1560638, 1211956,
1003007, 637328, 334748, 171201, 102393,
97690, 78554, 71021, 51088, 40633, 21853))
with(Italy, largestRemainders(parties=party, votes=votes,
seats = 556, method = "imperiali.q") )
The output produced by the highestAveragesof()
and largestRemainders()
functions is always a data.frame
; therefore, it’s very straightforward to use with other aplications. For instance, I like the idea of using the output with the knitr package to produce publishable-quality tables, or graphs with ggplot2.
mytable = highestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh")
Method: d'Hondt
Divisors: 1 2 3 4 ...
ENP: 3.65 (After): 3.14
Gallagher Index: 47.01
library(knitr)
kable(mytable, align=c("l","c","c"))
Party | Seats | %Seats |
---|---|---|
PRB | 21 | 0.500 |
PMDB | 10 | 0.238 |
PDT | 3 | 0.071 |
PTN | 2 | 0.048 |
PTdoB | 2 | 0.048 |
PCdoB | 1 | 0.024 |
PEN | 1 | 0.024 |
PSC | 1 | 0.024 |
PSTU | 1 | 0.024 |
PSB | 0 | 0.000 |
PTC | 0 | 0.000 |
mytable = highestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh")
## Method: d'Hondt
## Divisors: 1 2 3 4 ...
## ENP: 3.65 (After): 3.14
## Gallagher Index: 47.01
##
p <- ggplot(mytable, aes(x=reorder(Party, Seats), y=Seats)) +
geom_bar(position="dodge", stat = "identity") +
coord_flip() + labs(x="", y="# Seats")
p + theme_grey()
The default ggplot2 design has its charm, in my opinion. But, very often I don’t like the gray background grid, particularly when I’m preparing academic papers because I feel it distracts from the data. For example, see this ggplot2 visualization of the following plot:
detach("package:SciencesPo")
ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp)) + geom_point(alpha=0.7) + labs(title="Bubble Plot") + scale_size_continuous(range = c(3,10))
qplot(1:3, 1:3)
I rather prefer a clean layout for publication. Starting with simple layouts is better because we can add things as we need rather than taking them away. Thus, by default, the theme_pub
prints a plot without background and minor grid lines. Also, if a legend is needed, it will appear underneath of the plot rather than on the right side.
require(SciencesPo)
## Loading required package: SciencesPo
## initializing ... done
qplot(1:3, 1:3)
There’s a complete discussion of plot design in the outstanding reference Cookbook for R that might be of your interest.
If you want to change the theme for an entire session you can use theme_set()
as in theme_set(theme_gray())
to switch to default ggplot2 theme for all subsequent plots. Otherwise, you may also apply themes without changing the default setup as of plot + theme_gray()
.
To modify general aspects of the theme_pub()
as fontsize, font family etc:
The default fontfamily used in the theme is “Helvetica”, but it is easy to chane to another style:
Modify it with theme()
You might be able to put the legend inside the plot area, using:
line_plot + theme(legend.justification=c(1,0), legend.position=c(1,0))
This positions the legend inside the plot area, at the bottom-right. You can also put the legend at the top or bottom of the plot using, e.g.:
line_plot + theme(legend.position=“bottom”)
## List of 10
## $ family : chr ""
## $ face : chr "plain"
## $ colour : chr "red"
## $ size : num 11
## $ hjust : num 0.5
## $ vjust : num 0.5
## $ angle : num 0
## $ lineheight: num 0.9
## $ margin :Classes 'margin', 'unit' atomic [1:4] 0 0 0 0
## .. ..- attr(*, "unit")= chr "pt"
## .. ..- attr(*, "valid.unit")= int 8
## $ debug : logi FALSE
## - attr(*, "class")= chr [1:2] "element_text" "element"
## List of 10
## $ family : NULL
## $ face : NULL
## $ colour : chr "red"
## $ size : NULL
## $ hjust : NULL
## $ vjust : NULL
## $ angle : NULL
## $ lineheight: NULL
## $ margin : NULL
## $ debug : NULL
## - attr(*, "class")= chr [1:2] "element_text" "element"
By default, SciencesPo disables grid lines on the plot. In many cases, this is the cleanest and most elegant way to display the data. However, sometimes gridlines may be useful, and thus SciencesPo provides a simple way of adding gridlines, via the function background_grid()
:
plot.mpg + background_grid(major = "xy", minor = "none")
While the same result could be obtained using the function theme()
, the function background_grid()
makes the most commonly used option easily accessible. See the reference documentation for details.
Finally, the draw_plot()
function also allows us to place graphs at arbitrary locations and at arbitrary sizes onto the canvas. This is useful for combining subplots into a layout that is not a simple grid, e.g. with one sub-plot spanning the entire width of the figure and two other figures using up half of the figure width:
plot.iris <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point() + facet_grid(. ~ Species) + stat_smooth(method = "lm") +
background_grid(major = 'y', minor = "none") + # add thin horizontal lines
panel_border() # and a border around each panel
# plot.mpg and plot.diamonds were defined earlier
ggdraw() +
draw_plot(plot.iris, 0, .5, 1, .5) +
draw_plot(plot.mpg, 0, 0, .5, .5) +
draw_plot_label(c("A", "B", "C"), c(0, 0, 0.5), c(1, 0.5, 0.5), size = 15)
The functions background_grid()
and panel_border()
are convenience functions defined by SciencesPo to save some typing when manipulating the background grid and panel border.
The function geom_foot()
can be used to add text as a plot footnote. To demonstrate its use, we first draw a plot, then add the footnote:
theme_set(theme_pub())
# Generating a ratio winner/opponent measure
Presidents = transform(Presidents,
height_ratio = winner.height/opponent.height)
# Avoid missing data
Presidents <- subset(Presidents, !is.na(height_ratio))
fit=lm(winner.vote~height_ratio,data=Presidents)
mylabel=lm2eqn("Presidents","height_ratio","winner.vote")
p1 <- ggplot(Presidents, aes(x=height_ratio, y=winner.vote)) +
geom_smooth(method=lm, colour="red", fill="gold")+
geom_point(size = 5, alpha = .7) +
annotate(geom = 'text', x = 1.1, y = 70, size = 5, label = mylabel, fontface = 'italic') +
xlim(0.85,1.2) + ylim(25, 70) +
xlab("Winner/Opponent Height Ratio") +
ylab("Relative Support for the Winner")
p1
geom_foot("Draft Analysis, 2015", color = fade("brown1"))