Describe the object, how the goal is to store unchanged, original texts, whose only processing has been to convert their encoding to a common format.
A quanteda corpus can store settings, metadata, document variables, and be indexed. It can be linked to dictionaries, collocation lists, and custom stop words.
quanteda has tools for getting texts into a corpus from a variety of sources:
character
) already loaded into R;a VCorpus
corpus object from the tm package.
url ???
The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.
If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character vector of 57 US president inaugural speeches called inaugTexts
.
str(inaugTexts) # this gives us some information about the object
#> Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
#> - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
myCorpus <- corpus(inaugTexts) # build the corpus
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#>
#> Text Types Tokens Sentences
#> 1789-Washington 594 1429 23
#> 1793-Washington 90 135 4
#> 1797-Adams 794 2318 37
#> 1801-Jefferson 681 1726 41
#> 1805-Jefferson 775 2166 45
#>
#> Source: /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/Rtmpq1RuqS/Rbuild176882407477f/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Tue Apr 7 13:02:37 2015.
#> Notes: .
If we wanted, we could add some document-level variables – what quanteda calls docvars
– to this corpus. We can do this using the R’s substring()
function to extract characters from a name – in this case, the name of the character vector inaugTexts
. This works using our fixed starting and ending positions with substring()
because these names are a very regular format of YYYY-PresidentName
.
docvars(myCorpus, "President") <- substring(names(inaugTexts), 6)
docvars(myCorpus, "Year") <- as.integer(substring(names(inaugTexts), 1, 4))
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#>
#> Text Types Tokens Sentences President Year
#> 1789-Washington 594 1429 23 Washington 1789
#> 1793-Washington 90 135 4 Washington 1793
#> 1797-Adams 794 2318 37 Adams 1797
#> 1801-Jefferson 681 1726 41 Jefferson 1801
#> 1805-Jefferson 775 2166 45 Jefferson 1805
#>
#> Source: /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/Rtmpq1RuqS/Rbuild176882407477f/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Tue Apr 7 13:02:37 2015.
#> Notes: .
If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.
language(myCorpus) <- "english"
metadoc(myCorpus, "docsource") <- paste("inaugTexts", 1:ndoc(myCorpus), sep="_")
summary(myCorpus, n=5, showmeta=TRUE)
#> Corpus consisting of 57 documents, showing 5 documents.
#>
#> Text Types Tokens Sentences President Year _language
#> 1789-Washington 594 1429 23 Washington 1789 english
#> 1793-Washington 90 135 4 Washington 1793 english
#> 1797-Adams 794 2318 37 Adams 1797 english
#> 1801-Jefferson 681 1726 41 Jefferson 1801 english
#> 1805-Jefferson 775 2166 45 Jefferson 1805 english
#> _docsource
#> inaugTexts_1
#> inaugTexts_2
#> inaugTexts_3
#> inaugTexts_4
#> inaugTexts_5
#>
#> Source: /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/Rtmpq1RuqS/Rbuild176882407477f/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Tue Apr 7 13:02:37 2015.
#> Notes: .
The last command, metadoc
, allows you to define your own document meta-data fields. The two docmeta fields language
and encoding
are so common that quanteda has shortened accessor and replacement functions for manipulating these: encoding()
and language()
. Note that in assiging just the single value of "english"
, R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource
, we used the quanteda function ndoc()
to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow()
and ncol()
.
The +
operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.
library(quanteda)
mycorpus1 <- corpus(inaugTexts[1:5], note="First five inaug speeches")
mycorpus2 <- corpus(inaugTexts[6:10], note="Next five inaug speeches")
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
#> Corpus consisting of 10 documents.
#>
#> Text Types Tokens Sentences
#> 1789-Washington 594 1429 23
#> 1793-Washington 90 135 4
#> 1797-Adams 794 2318 37
#> 1801-Jefferson 681 1726 41
#> 1805-Jefferson 775 2166 45
#> 1809-Madison 520 1175 21
#> 1813-Madison 518 1210 33
#> 1817-Monroe 980 3370 122
#> 1821-Monroe 1190 4457 131
#> 1825-Adams 962 2915 74
#>
#> Source: Combination of corpuses mycorpus1 and mycorpus2.
#> Created: Tue Apr 7 13:02:38 2015.
#> Notes: First five inaug speeches Next five inaug speeches.
subset
Coming soon
Coming soon
segment
changeunits
print
summary
ndoc
and nfeature
texts
docvars
metacorpus
metadoc
kwic
Dispersion plots – coming soon.
collocations
tokenize
dfm
Often, texts aren’t available as pre-made R character vectors, and we need to load them from an external source. To do this, we first create a source for the documents, which defines how they are loaded from the source into the corpus. The source may be a character vector, a directory of text files, a zip file, a twitter search, or several external package formats such as tm
’s VCorpus
.
Once a source has been defined, we make a new corpus by calling the corpus
constructor with the source as the first argument. The corpus constructor also accepts arguments which can set some corpus metadata, and define how the document variables are set.
A very common source of files for creating a corpus will be a set of text files found on a local (or remote) directory. To load texts in this way, we first define a source for the directory, and pass this source as an argument to the corpus constructor. We create a directory source by calling the directory
function.
# Basic file import from directory
d <- directory('~/Dropbox/QUANTESS/corpora/inaugural')
myCorpus <- corpus(d)
If the document variables are specified in the filenames of the texts, we can read them by setting the docvarsfrom
argument (docvarsfrom = "filenames"
) and specifiying how the filenames are formatted with the sep
argument. For example, if the inaugural address texts were stored on disk in the format Year-President.txt
(e.g. 1973-Nixon.txt
), then we can load them and automatically populate the document variables. The docvarnames
argument sets the names of the document variables — it must be the same length as the parts of the filenames.
# File import reading document variables from filenames
d <- directory('~/Dropbox/QUANTESS/corpora/inaugural')
# In this example the format of the filenames is `Year-President.txt`.
# Because there are two variables in the filename, docvarnames must contain two names
myCorpus <- corpus(d, docvarsfrom="filenames", sep="-", docvarnames=c("Year", "President") )
quanteda
provides an interface to retrieve and store data from a twitter search as a corpus object. The REST API query uses the twitteR package, and an API authorization from twitter is required. The process of obtaining this authorization is described in detail here: https://openhatch.org/wiki/Community_Data_Science_Workshops/Twitter_authentication_setup, correct as of October 2014. The twitter API is a commercial service, and rate limits and the data returned are determined by twitter.
Four keys are required, to be passed to quanteda
’s getTweets
source function, in addition to the search query term and the number of results required. The maximum number of results that can be obtained is not exactly identified in the API documentation, but experimentation indicates an upper bound of around 1500 results from a single query, with a frequency limit of one query per minute.
The code below performs authentication and runs a search for the string ‘quantitative’. Many other functions for working with the API are available from the twitteR package. An R interface to the streaming API is also available link.
# These keys are examples and may not work! Get your own key at dev.twitter.com
consumer_key="vRLy03ef6OFAZB7oCL4jA"
consumer_secret="wWF35Lr1raBrPerVHSDyRftv8qB1H7ltV0T3Srb3s"
access_token="1577780816-wVbOZEED8KZs70PwJ2q5ld2w9CcvcZ2kC6gPnAo"
token_secret="IeC6iYlgUK9csWiP524Jb4UNM8RtQmHyetLi9NZrkJA"
tw <- getTweets('quantitative', numResults=20, consumer_key, consumer_secret, access_token, token_secret)
The return value from the above query is a source object which can be passed to quanteda’s corpus constructor, and the document variables are set to correspond with tweet metadata returned by the API.
twCorpus <- corpus(tw)
names(docvars(twCorpus))