Describe the object, how the goal is to store unchanged, original texts, whose only processing has been to convert their encoding to a common format.
A quanteda corpus can store settings, metadata, document variables, and be indexed. It can be linked to dictionaries, collocation lists, and custom stop words.
quanteda has tools for getting texts into a corpus from a variety of sources:
character
) already loaded into R;a VCorpus
corpus object from the tm package.
url ???
The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.
If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character vector of 57 US president inaugural speeches called inaugTexts
.
str(inaugTexts) # this gives us some information about the object
#> Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
#> - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
myCorpus <- corpus(inaugTexts) # build the corpus
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#>
#> Text Types Tokens Sentences
#> text1 594 1431 24
#> text2 90 135 4
#> text3 794 2321 37
#> text4 680 1730 42
#> text5 776 2166 45
#>
#> Source: /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpZwPAV8/Rbuild4ba32d50c1b3/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Mon Jul 13 18:59:18 2015.
#> Notes: .
If we wanted, we could add some document-level variables – what quanteda calls docvars
– to this corpus. We can do this using the R’s substring()
function to extract characters from a name – in this case, the name of the character vector inaugTexts
. This works using our fixed starting and ending positions with substring()
because these names are a very regular format of YYYY-PresidentName
.
docvars(myCorpus, "President") <- substring(names(inaugTexts), 6)
docvars(myCorpus, "Year") <- as.integer(substring(names(inaugTexts), 1, 4))
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#>
#> Text Types Tokens Sentences President Year
#> text1 594 1431 24 Washington 1789
#> text2 90 135 4 Washington 1793
#> text3 794 2321 37 Adams 1797
#> text4 680 1730 42 Jefferson 1801
#> text5 776 2166 45 Jefferson 1805
#>
#> Source: /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpZwPAV8/Rbuild4ba32d50c1b3/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Mon Jul 13 18:59:18 2015.
#> Notes: .
If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.
metadoc(myCorpus, "language") <- "english"
metadoc(myCorpus, "docsource") <- paste("inaugTexts", 1:ndoc(myCorpus), sep="_")
summary(myCorpus, n=5, showmeta=TRUE)
#> Corpus consisting of 57 documents, showing 5 documents.
#>
#> Text Types Tokens Sentences President Year _language _docsource
#> text1 594 1431 24 Washington 1789 english inaugTexts_1
#> text2 90 135 4 Washington 1793 english inaugTexts_2
#> text3 794 2321 37 Adams 1797 english inaugTexts_3
#> text4 680 1730 42 Jefferson 1801 english inaugTexts_4
#> text5 776 2166 45 Jefferson 1805 english inaugTexts_5
#>
#> Source: /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpZwPAV8/Rbuild4ba32d50c1b3/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Mon Jul 13 18:59:18 2015.
#> Notes: .
The last command, metadoc
, allows you to define your own document meta-data fields. Note that in assiging just the single value of "english"
, R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource
, we used the quanteda function ndoc()
to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow()
and ncol()
.
The +
operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.
library(quanteda)
mycorpus1 <- corpus(inaugTexts[1:5], note="First five inaug speeches")
mycorpus2 <- corpus(inaugTexts[6:10], note="Next five inaug speeches")
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
#> Corpus consisting of 10 documents.
#>
#> Text Types Tokens Sentences
#> text1 594 1431 24
#> text2 90 135 4
#> text3 794 2321 37
#> text4 680 1730 42
#> text5 776 2166 45
#> text11 521 1177 21
#> text21 518 1211 33
#> text31 980 3378 121
#> text41 1202 4476 131
#> text51 963 2916 74
#>
#> Source: Combination of corpuses mycorpus1 and mycorpus2.
#> Created: Mon Jul 13 18:59:18 2015.
#> Notes: First five inaug speeches Next five inaug speeches.
subset
Coming soon
Coming soon
segment
changeunits
print
summary
ndoc
and nfeature
texts
docvars
metacorpus
metadoc
kwic
Dispersion plots – coming soon.
collocations
tokenize
dfm
Often, texts aren’t available as pre-made R character vectors, and we need to load them from an external source. To do this, we first create a source for the documents, which defines how they are loaded from the source into the corpus. The source may be a character vector, a directory of text files, a zip file, a twitter search, or several external package formats such as tm
’s VCorpus
.
Once a source has been defined, we make a new corpus by calling the corpus
constructor with the source as the first argument. The corpus constructor also accepts arguments which can set some corpus metadata, and define how the document variables are set.
A very common source of files for creating a corpus will be a set of text files found on a local (or remote) directory. To load texts in this way, we first define a source for the directory, and pass this source as an argument to the corpus constructor. We create a directory source by calling the directory
function.
# Basic file import from directory
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')
myCorpus <- corpus(d)
If the document variables are specified in the filenames of the texts, we can read them by setting the docvarsfrom
argument (docvarsfrom = "filenames"
) and specifiying how the filenames are formatted with the sep
argument. For example, if the inaugural address texts were stored on disk in the format Year-President.txt
(e.g. 1973-Nixon.txt
), then we can load them and automatically populate the document variables. The docvarnames
argument sets the names of the document variables — it must be the same length as the parts of the filenames.
# File import reading document variables from filenames
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')
# In this example the format of the filenames is `Year-President.txt`.
# Because there are two variables in the filename, docvarnames must contain two names
myCorpus <- corpus(d, docvarsfrom="filenames", sep="-", docvarnames=c("Year", "President") )
quanteda
provides an interface to retrieve and store data from a twitter search as a corpus object. The REST API query uses the twitteR package, and an API authorization from twitter is required. The process of obtaining this authorization is described in detail here: https://openhatch.org/wiki/Community_Data_Science_Workshops/Twitter_authentication_setup, correct as of October 2014. The twitter API is a commercial service, and rate limits and the data returned are determined by twitter.
Four keys are required, to be passed to quanteda
’s getTweets
source function, in addition to the search query term and the number of results required. The maximum number of results that can be obtained is not exactly identified in the API documentation, but experimentation indicates an upper bound of around 1500 results from a single query, with a frequency limit of one query per minute.
The code below performs authentication and runs a search for the string ‘quantitative’. Many other functions for working with the API are available from the twitteR package. An R interface to the streaming API is also available link.
# These keys are examples and may not work! Get your own key at dev.twitter.com
consumer_key="vRLy03ef6OFAZB7oCL4jA"
consumer_secret="wWF35Lr1raBrPerVHSDyRftv8qB1H7ltV0T3Srb3s"
access_token="1577780816-wVbOZEED8KZs70PwJ2q5ld2w9CcvcZ2kC6gPnAo"
token_secret="IeC6iYlgUK9csWiP524Jb4UNM8RtQmHyetLi9NZrkJA"
tw <- getTweets('quantitative', numResults=20, consumer_key, consumer_secret, access_token, token_secret)
The return value from the above query is a source object which can be passed to quanteda’s corpus constructor, and the document variables are set to correspond with tweet metadata returned by the API.
twCorpus <- corpus(tw)
names(docvars(twCorpus))