The Design of quanteda

Basic Principles

  1. Corpus texts should remain unchanged during subsequent analysis and processing. In other words, after loading and encoding, we should discourage users from modifying a corpus of texts as a form of processing, so that the corpus can act as a library and record of the original texts, prior to any downstream processing. This not only aids in replication, but also means that a corpus presents the unmodified texts to which any processing, feature selection, transformations, or sampling may be applied or reapplied, without hard-coding any changes made as part of the process of analyzing the texts. The only exception is to reshape the units of text in a corpus, but we will record the details of this reshaping to make it relatively easy to reverse unit changes. Since the definition of a “document” is part of the process of loading texts into a corpus, however, rather than processing, we will take a less stringent line on this aspect of changing a corpus.

  2. A corpus should be capable of holding additional objects that will be associated with the corpus, such as dictionaries, stopword, and phrase lists. These will be named objects, that can be invoked when using (for instance) dfm(). This allows a corpus to contain all of the additional objects that would normally be associated with it, rather than requiring a set of separate, extra-corpus objects.

  3. A tokenized text object, and a dfm object, should have settings that record the processing options applied to the texts or corpus from which they were created. These provide a record of what was done to the text, and where it came from. Examples are tolower, stem, removeTwitter, etc. They also include any objects used in feature selection, such as dictionaries or stopword lists.

  4. A dfm should consist mainly of a (sparse) matrix, that can be used for any sort of quantitative analysis. The basic structure will always be documents (or document groups) in rows by features in columns.

  5. Encoding of texts should be done in the corpus, and recorded as meta-data in the corpus. We should be able to detect encodings and suggest (and perform) and conversion when storing texts in a corpus. This encoding should be UTF-8 by default. We will use the tools available in the stringi package to detect and set character encodings, namely stri_enc_detect() and stri_conv(), with reports and suggestions made at the time of corpus creation.

Major categories of functions

  1. Corpus construction and management. These operate on a corpus, and return a corpus, or report on a corpus.

    changeunits
    corpus
    docnames, <-
    docvars, <-
    encoding, <-
    language, <-
    metacorpus, metadoc, <-
    ndoc
    ntoken
    segment      # also works on character vectors
    settings, <-
    subset
    summary
    textfile
    texts
    
  2. Text manipulation. These operate on character vectors, and return character vectors.

    1. Operations on character vectors or the character vector of texts from a corpus.

      Returns a list of character vectors:

      tokenize
      

      Returning a character vector:

      phrasetotoken
      

      Returns a collocations object,

      collocations
      

      Returns a screen output and data/frame:

      kwic
      

      Counts the number of tokens:

      ntoken
      
    2. Operations on character vectors of tokens only, returning a character vector of tokens:

      syllables
      wordstem
      
    3. Operations on character vectors of tokens, but also dfm objects and collocations:

      removeFeatures
      
    4. Operations that previously worked (currently work) on character vectors of any size, but that will now be folded into Workflow Step 2 functions (see below) as part of tokenize:

      bigrams
      ngrams
      clean
      
  3. dfm construction and manipulation.

    dfm         # also works directly on (the texts of) a corpus
    convert
    docfreq
    docnames
    features
    lexdiv
    ndoc
    ntoken
    plot
    print, show
    removeFeatures
    similarity
    sort
    textmodel, textmodel_*
    topfeatures
    trim
    weight
    settings
    
  4. Auxiliary functions.

    dictionary
    stopwords
    textfile
    
  5. Example datasets and objects.

    Example data objects:

    exampleString       # character, length 1
    ukimmigTexts        # character, length 14
    inaugTexts          # character, length 57
    ie2010Corpus        # corpus
    inaugCorpus         # corpus
    LBGexample          # dfm
    

    and some built-in objects used by functions:

    englishSyllables    # named character vector, length 133245
    stopwords           # named list .stopwords, length 16
    

Basic text analysis workflow

Working with a corpus, documents, and features

  1. Creating the corpus

    Reading files, probably using textfile(), then creating a corpus using corpus(), making sure the texts have a common encoding, and adding document variables (docvars) and metadata (metadoc and metacorpus).

  2. Defining and delimiting documents

    Defining what are “texts”, for instance using changeunits or grouping.

    Suggestion: add a groups= option to texts(), to extract texts from a corpus concatenated by groups of document variables. (This functionality is currently only available through dfm.)

  3. Defining and delimiting textual features

    This step involves defining and extracting the relevant features from each document, using tokenize, the main function for this step, involves indentifying instances of defined features (“tokens”) and extracting them as vectors. Usually these will consist of words, but may also consist of:

    tokenize returns a new object class of tokenized texts, which are essentially a list of character vectors, with each element in the list corresponding to a document, and each characte vector consisting of the tokens in that document.

    By defining the broad class of tokens we wish to extract, in this step we also apply rules that will keep or ignore elements such as punctuation or digits, or special aggregations of word and other characters that make up URLS, Twitter tags, or currency-prefixed digits. This will involve adding the following options to tokenize:

    By default, tokenize() extracts word tokens, and only removeSeparators is TRUE, meaning that tokenize() will return a list including punctuation as tokens. This follows a philosophy of minimal intervention, and one requiring that additional decisions be made explicit by the user when invoking tokenize(). Note that in the dfm() method described below, however, we do turn on all of these options except removeTwitter, which is by default FALSE.

    For converting to lowercase, it is actually faster to perform this step before tokenization, but logically it falls under the next workflow step. However for efficiency, toLower() works on

    Since the tokenizer we will use may not distinguish the puncutation characters used in constructs such as URLs, email addresses, Twitter handles, or digits prefixed by currency symbols, we will mostly need to use a substitution strategy to replace these with alternative characters prior to tokenization, and then replace the substitutions with the original characters. This will slow down processing but will only be active by explicit user request for this type of handling to take place. We could offer three possible options here, such as for URLs, consisting of c("ignore", "keep", "remove", "ignore"), to pretend they do not exist and tokenize come what may, to preserve remove URLs in their entirety as “tokens”, or to remove them completely, respectively.

    Note that that defining and delimiting features may alao include their parts of speech, meaning we will need to add functionality for POS tagging and extraction in this step.

  4. Further feature selection

    Once features have been identified and separated from the texts in the tokenization step, features may be removed from token lists, or handled as part of dfm construction. Features may be:

    It will be sometimes possible to perform these steps separately from the dfm creating stage, but in most cases these steps will be performed as options to the dfm function.

  5. Analysis of the documents and features

    1. From a corpus.

      These steps don't necessarily require the processing steps above.

      • kwic
      • lexdiv
      • summary
    2. From a dfm – after dfm on the processed document and features.

dfm, the Swiss Army knife

Overview

  1. Most common use case

    In most cases, users will use the default settings to create a dfm straight from a corpus. dfm will combine steps 3–4, even though basic functions will be available to perform these separately. All options shown in steps 3–4 will be available in dfm.

  2. If separate steps are desired

    We will do our best to ensure that all functions allow piping using the magrittr package, e.g.

    mydfm <- texts(mycorpus, group = "party") %>% toLower %>% tokenize %>% wordstem %>%
                                    removeFeatures(stopwords("english")) %>% dfm
    

    We recognize however that not all sequences will make sense, for instance wordstem will only work after tokenization, and will try to catch these errors and make the proper sequence clear to users.

Options for processing from corpus to dfm

The current processing options, their defaults, and the function their value is finally passed to, in in order of increasing generality:


Option default other function


keepAcronyms FALSE TRUE toLower

what word sentence, character tokenize fastestword, fasterword

cleanFirst TRUE FALSE tokenize

verbose FALSE TRUE tokenize, dfm

toLower TRUE FALSE tokenize, dfm

removeNumbers TRUE FALSE tokenize, dfm

removePunct TRUE FALSE tokenize, dfm

removeSeparators TRUE FALSE tokenize, dfm

removeTwitter TRUE FALSE tokenize

simplify FALSE TRUE tokenize

cores detect numeric tokenize

stem FALSE TRUE dfm

ignoredFeatures NULL stopwords(), character dfm

keptFeatures NULL regex dfm

matrixType sparse dense dfm

language english character dfm

fromCorpus FALSE TRUE dfm

bigrams FALSE TRUE dfm

include.unigrams TRUE FALSE dfm

thesaurus NULL list dfm

dictionary NULL list dfm

dictionary_regex FALSE re dfm

addto NULL dfm dfm


dfm creation with ie2010 Corpus

A dfm object can be created using piping, or in one step:

mydfm <- texts(ie2010Corpus, groups = "party") %>% toLower %>% tokenize %>% 
             removeFeatures(stopwords("english")) %>% wordstem %>% dfm

# same as:
mydfm2 <- dfm(ie2010Corpus, groups = "party", ignoredFeatures = stopwords("english"), stem = TRUE)

Development Guidance

Suggestions for using quanteda during development

quanteda is in development and will remain so until we declare a 1.0 version, at which time we will only add new functions, not change the names of existing ones. In the meantime, we suggest:

Notes to the quanteda team

  1. All testing should be in tests/testthat/test_.R. No more haphazard tests in other locations.

  2. For performance comprisons, we write up the results and document them in the vignette performance_comparisons.Rmd.

  3. Development and branches: We add new features through workingDev. Before merging this with dev, we make sure the build passes a full CRAN check.

For bug reports and feature requests

Please use the issue page on the GitHub repository, or contact kbenoit@lse.ac.uk directly.

We always welcome hearing about your experiences (and problems!) in using quanteda, as additional use cases and problems you may encounter help us to make the package more functional and robust.

Outstanding Tasks and Priorities

Completed

To Do Remaining