quanteda
is in development and will remain so until we declare a 1.0 version, at which time we will only add new functions, not change the names of existing ones. In the meantime, we suggest:
tokenize(mytexts, what = "sentence")
instead of tokenize(mytexts, "sentence")
– since the order is not stable; and also using named formals rather than relying on current defaults, e.g. tokenize(mytexts, removePunct = FALSE)
since the default values are not stable.All testing should be in tests/testthat/test_
For performance comprisons, we write up the results and document them in the vignette performance_comparisons.Rmd.
Development and branches: We add new features through workingDev
. Before merging this with dev
, we make sure the build passes a full CRAN check.
encoding()
to detect encoding, and replace iconv()
calls with stringi::stri_encode()
in corpus()
tokenize()
based on that package):
ntoken()
ntype()
and nfeature()
phrasetotoken()
ie2010Corpus
(and see if CRAN lets us get away with it)language()
tokenizedTexts
objects:
dfm.tokenizedTexts
removeFeatures.tokenizedTexts
syllables.tokenizedTexts
removeFeatures
now much faster, based on fixed binary matches and stringi
character classesreadability()
nsentence()
ngrams
added as an option to tokenize()
lexdiv()
to make the API similar to readability()
and to use data.tablesegment()
to make use of new tokenizer that segments on sentencesbigrams
, ngrams
punctuation sensitive in the same way that collocations
is currentlycollocations
code for bigrams and trigrams and reduce the internal memory usagecorpus.VCorpus()
is fully workingdfm
documentation needs to group arguments into sections and describe how these correspond to the logical workflowkwic
to use new tokenizer, and to allow searches for multi-word regular expressionssettings()
and figure out how to add additional objects to a corpus, namely one or more:
similarity()
wordstem()
, stopwords()
, and syllables()
textmodel
: Devise and document a consistent, logical, and easy-to-use and remember scheme for textmodels.convert()
needs substantial work+
is defined.resample
functionality to enable resampling from different text unitsindex
(?) for pre-tokenizing and indexing a corpusPlease use the issue page on the GitHub repository, or contact kbenoit@lse.ac.uk directly.