Corpus texts should remain unchanged during subsequent analysis and processing. In other words, after loading and encoding, we should discourage users from modifying a corpus of texts as a form of processing, so that the corpus can act as a library and record of the original texts, prior to any downstream processing. This not only aids in replication, but also means that a corpus presents the unmodified texts to which any processing, feature selection, transformations, or sampling may be applied or reapplied, without hard-coding any changes made as part of the process of analyzing the texts. The only exception is to reshape the units of text in a corpus, but we will record the details of this reshaping to make it relatively easy to reverse unit changes. Since the definition of a “document” is part of the process of loading texts into a corpus, however, rather than processing, we will take a less stringent line on this aspect of changing a corpus.
A corpus should be capable of holding additional objects that will be associated with the corpus, such as dictionaries, stopword, and phrase lists. These will be named objects, that can be invoked when using (for instance) dfm()
. This allows a corpus to contain all of the additional objects that would normally be associated with it, rather than requiring a set of separate, extra-corpus objects.
A tokenized text object, and a dfm object, should have settings that record the processing options applied to the texts or corpus from which they were created. These provide a record of what was done to the text, and where it came from. Examples are tolower
, stem
, removeTwitter
, etc. They also include any objects used in feature selection, such as dictionaries or stopword lists.
A dfm should consist mainly of a (sparse) matrix, that can be used for any sort of quantitative analysis. The basic structure will always be documents (or document groups) in rows by features in columns.
Encoding of texts should be done in the corpus, and recorded as meta-data in the corpus. We should be able to detect encodings and suggest (and perform) and conversion when storing texts in a corpus. This encoding should be UTF-8
by default. We will use the tools available in the stringi
package to detect and set character encodings, namely stri_enc_detect()
and stri_conv()
, with reports and suggestions made at the time of corpus creation.
Corpus construction and management. These operate on a corpus, and return a corpus, or report on a corpus.
changeunits
corpus
docnames, <-
docvars, <-
encoding, <-
language, <-
metacorpus, metadoc, <-
ndoc
ntoken
segment # also works on character vectors
settings, <-
subset
summary
textfile
texts
Text manipulation. These operate on character vectors, and return character vectors.
Operations on character vectors or the character vector of texts from a corpus.
Returns a list of character vectors:
tokenize
Returning a character vector:
phrasetotoken
Returns a collocations
object,
collocations
Returns a screen output and data/frame:
kwic
Counts the number of tokens:
ntoken
Operations on character vectors of tokens only, returning a character vector of tokens:
syllables
wordstem
Operations on character vectors of tokens, but also dfm objects and collocations:
removeFeatures
Operations that previously worked (currently work) on character vectors of any size, but that will now be folded into Workflow Step 2 functions (see below) as part of tokenize:
bigrams
ngrams
clean
dfm
construction and manipulation.
dfm # also works directly on (the texts of) a corpus
convert
docfreq
docnames
features
lexdiv
ndoc
ntoken
plot
print, show
removeFeatures
similarity
sort
textmodel, textmodel_*
topfeatures
trim
weight
settings
Auxiliary functions.
dictionary
stopwords
textfile
Example datasets and objects.
Example data objects:
exampleString # character, length 1
ukimmigTexts # character, length 14
inaugTexts # character, length 57
ie2010Corpus # corpus
inaugCorpus # corpus
LBGexample # dfm
and some built-in objects used by functions:
englishSyllables # named character vector, length 133245
stopwords # named list .stopwords, length 16
Creating the corpus
Reading files, probably using textfile()
, then creating a corpus using corpus()
, making sure the texts have a common encoding, and adding document variables (docvars
) and metadata (metadoc
and metacorpus
).
Defining and delimiting documents
Defining what are “texts”, for instance using changeunits
or grouping.
Suggestion: add a groups=
option to texts()
, to extract texts from a corpus concatenated by groups of document variables. (This functionality is currently only available through dfm
.)
Defining and delimiting textual features
This step involves defining and extracting the relevant features from each document, using
tokenize
, the main function for this step, involves indentifying instances of defined features (“tokens”) and extracting them as vectors. Usually these will consist of words, but may also consist of:
bigrams
or ngrams
, adjacent sequences of words, not separated by punctuation marks or sentence boundaries; includingphrasetotoken
, for selected word ngrams as identified in selected lists rather than simply using all adjacent word pairs or n-sequences.tokenize
returns a new object class of tokenized texts, which are essentially a list of character vectors, with each element in the list corresponding to a document, and each characte vector consisting of the tokens in that document.
By defining the broad class of tokens we wish to extract, in this step we also apply rules that will keep or ignore elements such as punctuation or digits, or special aggregations of word and other characters that make up URLS, Twitter tags, or currency-prefixed digits. This will involve adding the following options to tokenize
:
removeDigits
removePunct
removeAdditional
removeTwitter
removeURL
By default, tokenize()
extracts word tokens, and only removeSeparators
is TRUE
, meaning that tokenize()
will return a list including punctuation as tokens. This follows a philosophy of minimal intervention, and one requiring that additional decisions be made explicit by the user when invoking tokenize()
. Note that in the dfm()
method described below, however, we do turn on all of these options except removeTwitter
, which is by default FALSE
.
For converting to lowercase, it is actually faster to perform this step before tokenization, but logically it falls under the next workflow step. However for efficiency, toLower()
works on
Since the tokenizer we will use may not distinguish the puncutation characters used in constructs such as URLs, email addresses, Twitter handles, or digits prefixed by currency symbols, we will mostly need to use a substitution strategy to replace these with alternative characters prior to tokenization, and then replace the substitutions with the original characters. This will slow down processing but will only be active by explicit user request for this type of handling to take place. We could offer three possible options here, such as for URLs, consisting of c("ignore", "keep", "remove", "ignore")
, to pretend they do not exist and tokenize come what may, to preserve remove URLs in their entirety as “tokens”, or to remove them completely, respectively.
Note that that defining and delimiting features may alao include their parts of speech, meaning we will need to add functionality for POS tagging and extraction in this step.
Further feature selection
Once features have been identified and separated from the texts in the tokenization step, features may be removed from token lists, or
handled as part of dfm
construction. Features may be:
removeFeatures
or ignoredFeatures
(dfm
option)removeFeatures
or keptFeatures
(dfm
option)stem
option in dfm
dfm
option dictionary
) or as a supplement to uncollapsed features (dfm
option thesaurus
)toLower
to consider as equivalent the same word features despite having different cases, by converting all features to lower caseIt will be sometimes possible to perform these steps separately from the dfm
creating stage, but in most cases these steps will be performed as options to the dfm
function.
Analysis of the documents and features
From a corpus.
These steps don't necessarily require the processing steps above.
kwic
lexdiv
summary
From a dfm – after dfm
on the processed document and features.
dfm
, the Swiss Army knifeMost common use case
In most cases, users will use the default settings to create a dfm straight from a corpus. dfm
will combine steps 3–4, even though basic functions will be available to perform these separately. All options shown in steps 3–4 will be available in dfm
.
If separate steps are desired
We will do our best to ensure that all functions allow piping using the magrittr
package, e.g.
mydfm <- texts(mycorpus, group = "party") %>% toLower %>% tokenize %>% wordstem %>%
removeFeatures(stopwords("english")) %>% dfm
We recognize however that not all sequences will make sense, for instance wordstem
will only work after tokenization, and will try to catch these errors and make the proper sequence clear to users.
The current processing options, their defaults, and the function their value is finally passed to, in in order of increasing generality:
Option default other function
keepAcronyms FALSE TRUE toLower
what word sentence, character tokenize fastestword, fasterword
cleanFirst TRUE FALSE tokenize
verbose FALSE TRUE tokenize, dfm
toLower TRUE FALSE tokenize, dfm
removeNumbers TRUE FALSE tokenize, dfm
removePunct TRUE FALSE tokenize, dfm
removeSeparators TRUE FALSE tokenize, dfm
removeTwitter TRUE FALSE tokenize
simplify FALSE TRUE tokenize
cores detect numeric tokenize
stem FALSE TRUE dfm
ignoredFeatures NULL stopwords(), character dfm
keptFeatures NULL regex dfm
matrixType sparse dense dfm
language english character dfm
fromCorpus FALSE TRUE dfm
bigrams FALSE TRUE dfm
include.unigrams TRUE FALSE dfm
thesaurus NULL list dfm
dictionary NULL list dfm
dictionary_regex FALSE re dfm
addto NULL dfm dfm
A dfm
object can be created using piping, or in one step:
mydfm <- texts(ie2010Corpus, groups = "party") %>% toLower %>% tokenize %>%
removeFeatures(stopwords("english")) %>% wordstem %>% dfm
# same as:
mydfm2 <- dfm(ie2010Corpus, groups = "party", ignoredFeatures = stopwords("english"), stem = TRUE)
quanteda
is in development and will remain so until we declare a 1.0 version, at which time we will only add new functions, not change the names of
existing ones. In the meantime, we suggest:
tokenize(mytexts, what = "sentence")
instead of tokenize(mytexts, "sentence")
– since the order is not stable; and also using named formals rather than relying on current defaults, e.g. tokenize(mytexts, removePunct = FALSE)
since the default values are not stable.All testing should be in tests/testthat/test_
For performance comprisons, we write up the results and document them in the vignette performance_comparisons.Rmd.
Development and branches: We add new features through workingDev
. Before merging this with dev
, we make sure the build passes a full CRAN check.
Please use the issue page on the GitHub repository, or contact kbenoit@lse.ac.uk directly.
We always welcome hearing about your experiences (and problems!) in using quanteda, as additional use cases and problems you may encounter help us to make the package more functional and robust.
encoding()
to detect encoding, and replace iconv()
calls with stringi::stri_encode()
in corpus()
tokenize()
based on that package):
ntoken()
ntype()
and nfeature()
phrasetotoken()
ie2010Corpus
(and see if CRAN lets us get away with it)language()
tokenizedTexts
objects:
dfm.tokenizedTexts
removeFeatures.tokenizedTexts
syllables.tokenizedTexts
removeFeatures
now much faster, based on fixed binary matches and stringi
character classesreadability()
nsentence()
ngrams
added as an option to tokenize()
collocations()
behaviour, changed since new tokenize()
weight()
breaks for dfm objects with all zero counts, for instance those that have been trim()
meddfm
objects need a subset method for selection rows, not just columnslexdiv()
to make the API similar to readability()
and to use data.tablesegment()
to make use of new tokenizer that segments on sentencescollocations
code for bigrams and trigrams and reduce the internal memory usagecorpus.VCorpus()
is fully workingdfm
documentation needs to group arguments into sections and describe how these correspond to the logical workflowkwic
to use new tokenizer, and to allow searches for multi-word regular expressionssettings()
and figure out how to add additional objects to a corpus, namely one or more:
similarity()
wordstem()
, stopwords()
, and syllables()
textmodel
: Devise and document a consistent, logical, and easy-to-use and remember scheme for textmodels.convert()
needs substantial work+
is defined.resample
functionality to enable resampling from different text unitsindex
(?) for pre-tokenizing and indexing a corpus