Introducing the ‘polmineR’-package

Andreas Blätte (andreas.blaette@uni-due.de)

2018-05-10

Purpose

The purpose of the package polmineR is to facilitate the interactive analysis of corpora using R. Apart from performance and usability, key considerations for developing the package are:

The ‘polmineR’ package supplements R packages that are already widely used for text mining. The CRAN task view is a good place to learn about relevant packages, see CRAN. The polmineR is intended to be an interface between the Corpus Workbench (CWB), an efficient system for storing and querying large corpora, and existing packages for mining text with advanced statistical methods.

Apart from the speed of text processing, the Corpus Query Processor (CQP) and the CQP syntax provide a powerful and widely used syntax to query corpora. This is not an unique idea. Using a combination of R and the CWB implies a software architecture you will also find in the TXM project, or with CQPweb. The ‘polmineR’ package offers a library with the grammer of corpus analysis below a graphical user interface (GUI). It is a toolset to perform simple tasts efficiently as well as to implement complex workflows.

Advanced user will benefit from acquiring a good understanding of the Corpus Workbench. The Corpus Encoding Tutorial is an authoritative text for that. The vignette of the rcqp package includes an excellent explanation of the CWB data-model.

The most important thing users need to now is the difference between “s”- and “p”-attributes. The CWB distinguishes structural attributes (s-attributes) that will contain the metainformation that can be used to generate subcorpora, and positional attributes (p-attributes). Typically, the p-attributes will be ‘word’, ‘pos’ (for part-of-speech) and ‘lemma’ (for the lemmatized word form).

Getting started

Check that the CORPUS_REGISTRY environment variable is set

The annex of the vignette includes a detailed explanation how to install polmineR on Windows, MacOS, and Linux. Once you have installed polmineR, check that the environment variable CORPUS_REGISTRY is set.

The CORPUS_REGISTRY environment variable defines the directory with registry files that describe where the CWB will find the files of an indexed corpus, and the s- and p-attributes. See the annex for an explanation how to set the CORPUS_REGISTRY environment variable for the current R session, or permanently.

Loading polmineR

If the CORPUS_REGISTRY variable is set correctly, i.e. pointing to the directory with the registry files describing the corpora, load the polmineR package.

Using and installing packaged corpora

If you want to use a CWB corpus packaged in a R data package, you can call ´use´ with the name of the R package. To access the corpus in the data package, the CORPUS_REGISTRY environment variable will be reset. In the followings examples, the REUTERS corpus will be used for demonstration purposes. It is a sampe of Reuters articles that is included in the tm package (cp. https://www.scss.tcd.ie/~luzs/t/cs4ll4/sw/reuters21578-xml/) and may already be known for many R users.

Note that the use-function will call the resetRegistry-function that can also be used to set again the original path to the directory with registry files. If you want to use corpora, you can download the EuroParl and the GermaParl data packages from a CRAN-like repository hosted by the PolMine Project.

Data packages with corpora have a version number which may be important for reproducing results, they can include a vignette documenting the data, and functions to perform specialized tasks.

Checking that corpora are available

Use the corpus-method to check which corpora are accessible. It should be the REUTERS corpus in our case (the names of CWB corpora are always written upper case). In addition to the English REUTERS corpus, a small subset of the GermaParl corpus (“GERMAPARLMINI”) is included in the polmineR package.

##          corpus   size template
## 1 GERMAPARLMINI 222201     TRUE
## 2       REUTERS   4050     TRUE
Session settings

Many functions in the polmineR package use settings that are stored in the general options settings. You can see these settings as follows:

Several methods (such as kwic, or cooccurrences) will use these settings, if no explicit other value is provided. Here are a few examples how to change settings.

Working with corpora: Core methods

Core analytical tasks are implemented as methods (S4 class system), i.e. the bevaviour of the methods changes depending on the object that is supplied. Almost all methods can be applied to corpora as well as partitions (subcorpora). As an easy entry, methods applied to corpora are explained first.

Keyword-in-context (kwic)

The kwic method applied to the name of a corpus will return a KWIC object. Output will be shown in the viewer pane of RStudio. You can include metadata from the corpus using the ‘meta’ parameter.

You can also use the CQP query syntax for formulating queries. That way, you can find multi-word expressions, or match in a manner you may know from using regular expressions.

Explaining the CQP syntax goes beyon this vignette. Consult the CQP tutorial to learn more about the CQP syntax.

Getting counts and frequencies

You can count one or several hits in a corpus.

##     query count        freq
## 1: Kuwait    15 0.003703704
##      query count         freq
## 1:  Kuwait    15 0.0037037037
## 2:     USA     0 0.0000000000
## 3: Bahrain     1 0.0002469136
##                 query count         freq
## 1:  "United" "States"     2 0.0004938272
## 2: "Saudi" "Arabia.*"    12 0.0029629630
Dispersions

… get dispersions of counts accross one (or two) dimensions …

Note that it is a data.table that is returned. Visualising the result as a barplot …

Cooccurrences

… get cooccurrence statistics …

##       word count_partition count_window exp_window exp_partition       ll
## 1   prices              47           48   28.06074      18.93926 51.53564
## 2      oil              78           30   46.56889      31.43111 26.38412
## 3     said              73           55   43.58370      29.41630 25.59149
## 4 industry              10           14    5.97037       4.02963 23.86295
## 5       is              25           24   14.92593      10.07407 22.79780
##   rank_ll
## 1       1
## 2       2
## 3       3
## 4       4
## 5       5

Working with subcorpora - partitions

Easily creating partitions (i.e. subcorpora) based on s-attributes is an important feature of the ‘polmineR’ package. So if we want to work with the articles in the REUTERS corpus related to Kuaweit in 2006:

kuwait <- partition("REUTERS", places = "kuwait", regex = TRUE)

To get some basic information about the partition that has been set up, the ‘show’-method can be used. It is also called when you simply type the name of the partition object.

kuwait
## ** partition object **
## corpus:              REUTERS 
## name:                 
## sAttributes:         places = kuwait 
## cpos:                3 pairs of corpus positions
## size:                660 tokens
## count:               not available

To evaluate s-attributes, regular expressions can be used.

saudi_arabia <- partition("REUTERS", places = "saudi-arabia", regex = TRUE)
sAttributes(saudi_arabia, "id")
## [1] "242" "248" "273" "349" "352"

If you work with a flat XML structure, the order of the provided s-attributes may be relevant for speeding up the set up of the partition. For a nested XML, it is important that with the order, you move from ancestors to childs. For further information, see the documentation of the partition-function.

Cooccurrences

The cooccurrences-method can be applied to partition-objects.

saudi_arabia <- partition("REUTERS", places = "saudi-arabia", regex = TRUE)
oil <- cooccurrences(saudi_arabia, "oil", pAttribute = "word", left = 10, right = 10)

Note that is is possible to provide a query that uses the full CQP syntax. The statistical analysis of collocations to the query can be accessed as the slot “stat” of the context object. Alternatively, you can get the table with the statistics using ´as.data.frame´.

df <- as.data.frame(oil)
df[1:5, c("word", "ll", "rank_ll")]
##       word        ll rank_ll
## 1 adhering 12.769590       1
## 2   prices 12.629860       2
## 3       by  9.353565       3
## 4     will  9.030516       4
## 5      GCC  7.851100       7

Distribution of queries

To understand the occurance of a phenomenon, the distribution of query results across one or two dimensions will often be interesing. This is done via the ‘distribution’ function. The query may use the CQP syntax.

q1 <- dispersion(saudi_arabia, query = 'oil', "id", progress = FALSE)
q2 <- dispersion(saudi_arabia, query = c("oil", "barrel"), "id", progress = FALSE)

Getting features

To identify the specific vocabulary of a corpus of interest, a statistical test based (chi square, or log likelihood) can be performed.

qatar <- partition("REUTERS", places = "saudi-arabia", regex = TRUE)
qatar <- enrich(qatar, pAttribute = "word")

qatar_features <- features(qatar, "REUTERS", included = TRUE)
y <- subset(qatar_features, rank_chisquare <= 10.83 & count_coi >= 5)
as.data.frame(y)[,c("word", "count_coi", "count_ref", "chisquare")]
##      word count_coi count_ref chisquare
## 1   Saudi        18         0  49.22244
## 2   Nazer         7         0  19.08998
## 3  accord         7         0  19.08998
## 4  market        14         6  19.03495
## 5       8         6         0  16.35879
## 6  Arabia         6         0  16.35879
## 7 sources         7         2  11.90069

Getting a tm TermDocumentMatrix

For many applications, term-document matrices are the point of departure. The tm class TermDocumentMatrix serves as an input to several R packages implementing advanced text mining techniques. Obtaining this input from a corpus imported to the CWB will usually involve setting up a partitionBundle and then applying a method to get the matrix.

articles <- partitionBundle("REUTERS", sAttribute = "id", progress = FALSE)
## ... getting matrix with regions for s-attribute:  id
## ... generating the partitions
articles <- enrich(articles, pAttribute = "word", verbose = FALSE)
tdm <- as.TermDocumentMatrix(articles, col = "count", verbose = FALSE)
class(tdm) # to see what it is
## [1] "TermDocumentMatrix"    "simple_triplet_matrix"
show(tdm)
## <<TermDocumentMatrix (terms: 1192, documents: 20)>>
## Non-/sparse entries: 2409/21431
## Sparsity           : 90%
## Maximal term length: 15
## Weighting          : term frequency (tf)
m <- as.matrix(tdm) # turn it into an ordinary matrix
m[c("oil", "barrel"),]
##         Docs
## Terms    127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489
##   oil      5  12   2   1   0   6   4   2   5   6   5   4   4   4   3   4
##   barrel   2   0   1   1   0   3   0   0   0   2   2   0   1   0   0   1
##         Docs
## Terms    502 543 704 708
##   oil      5   2   3   1
##   barrel   1   1   0   0

Moving on

The package includes many features that go beyond this vignette. It is a key aim in the project to develop respective documentation in the vignette and the man pages for the individual functions further. Feedback is very welcome!

Annex I: Installing polmineR

Windows

The following instructions assume that you have installed R. If not, install it fromCRAN. An installation of RStudio is highly recommended.

Windows (32 and 64 bit)

For 64bit Windows, an interface included in the package RcppCWB is offered. It can be installed from the R package repository on the Webserver of the PolMine project as follows. It will be installed automatically when you install polmineR. If that does not happen, you can do it manually.

Then install polmineR.

To install the most recent development version that is hosted in a GitHub repository, use the convenient installation mechanism offered by the devtools package.

Finally, as a basic test whether the REUTERS corpus included in the polmineR package (for testing and demonstration purposes) is available, run:

macOS

The following instructions for Mac users assume that R is installed on your system. Binaries are available from the Homepage of the R Project. An installation of RStudio is highly recommended. Get the Open Source License version of RStudio Desktop.

Installing RcppCWB

To install polmineR on macOS, you first need to install RcppCWB.

First, you will need an installation of Xcode, which you can get it via the Mac App Store. You will also need the Command Line Tools for Xcode. It can be installed from a terminal with:

Please make sure that you agree to the license.

Second, an installation of XQuartz is required, it can be obtained from www.xquartz.org.

Third, to compile the C code in the RcppCWB package, there are system requirements that need to be fulfilled. Using a package manager makes things considerably easier. We recommend using ‘Homebrew’. To install Homebrew, follow the instructions on the Homebrew Homepage. The following commands then need to be executed from a terminal window. They will install the dependencies that the RcppCWB package relies on:

Fourth, install RcppCWB.

A quick check that polmineR is installed correctly is to load the library, and to check which corpora are available.

You should see a message that “CQI.Rcpp” is the interface used, and that the REUTERS corpus is on your system.

Installing ‘polmineR’

The latest release of polmineR can be installed from CRAN using the usual install.packages-function.

The development version of polmineR can be installed using devtools:

Linux (Ubuntu)

Installing R

If you have not yet installed R on your Ubuntu machine, there is a good instruction at ubuntuuser. To install base R, enter in the terminal.

Make sure that you have installed the latest version of R. The following commands will add the R repository to the package sources and run an update. The second line assumes that you are using Ubuntu 16.04.

Installing RStudio

It is highly recommended to install RStudio, a powerful IDE for R. Output of polmineR methods is generally optimized to be displayed using RStudio facilities. If you are working on a remote server, running RStudio Server may be an interesting option to consider.

Base Installation of polmineR

The Corpus Workbench will require the pcre, glib and pkg-config libraries. They can be installed as follows. In addition libxml2 is installed, a dependency of the R package xml2 that is used for manipulating html output.

The system requirements will now be fulfilled. From R, install dependencies for rcqp/polmineR first, and then rcqp and polmineR.

Use devtools to install the development version of polmineR from GitHub.

Finally, check the installation.

Annex I: Setting the CORPUS_REGISTRY environment variable

The environment variable “CORPUS_REGISTRY” can be set as follows in R:

To set the environment variable CORPUS_REGISTRY permanently, see the instructions R offer how to find the file ‘.Renviron’ or ‘.Renviron.site’ when calling the help for the startup process(?Startup).

Annex II: Setting the interface to the Corpus Workbench

The standard interface used by the polmineR package to extract information from CWB indexed corpora is the package ‘RcppCWB’. The interface is defined in a class called ‘CQI’. To check which interface is used:

## [1] "CQI.RcppCWB" "CQI.super"   "R6"

If you see “CQI.perl” leading the character vector that is returned, something went wrong. Accessing the corpora using perl scripts incurs an incredible performence loss. Reset the interface as follows:

An alternative interface is provided by the package ‘rcqp’, [available at GitHub][https://github.com/PolMine/polmineR.Rcpp] so far. To install and use ‘rcqp’:

To switch to the interface offered by polmineR.Rcpp, proceed as follows:

Annex III: polmineR - Full installation

To have access to all package functions and to run all package tests, the installation of further system requirements and packages is required. The xlsx dependency requires that rJava is installed and configured for R. That is done on the shell:

To run package tests including (re-)building the manual and vignettes, a working installation of Latex is required, too. Be aware that this may be a time-consuming operation.

Now install the remaining packages from within R.

Annex IV: CWB corpora and the CORPUS_REGISTRY environment variable

Indexed corpora can be stored in two different locations. The conventional way is to keep CWB corpora in a directory with two subdirectories, a ‘registry’ directory, and an ‘indexed_corpora’ directory. The files in the registry directory (‘registry’ in short) describe the main features of a corpus, and where it is stored. It is necessary to inform rcqp, the package used by polmineR to access corpora, about the registry directory. That is done using the CORPUS_REGISTRY environment variable. It needs be defined before loading rcqp and polmineR. Note that you need to set the environment to the ‘registry’ folder, not the files that are located in this directory.

The CORPUS_REGISTRY environment variable can be set manually from the R console:

Sys.setenv(CORPUS_REGISTRY = "/PATH/TO/YOUR/REGISTRY/DIRECTORY")

To check whether and how the environment variable is set:

Sys.getenv("CORPUS_REGISTRY")

You can set the environment variable permanently to avoid having to set it each time before you want to use polmineR. A good way is to inlude the following line in the file .Renviron in your home directory:

CORPUS_REGISTRY="/PATH/TO/YOUR/REGISTRY/DIRECTORY"

There are a few other options to have environment variables set at every time you launch polmineR. To learn about these, use the help for the R startup procedure.

?Startup