match-utils            package:Biostrings            R Documentation

_U_t_i_l_i_t_y _f_u_n_c_t_i_o_n_s _r_e_l_a_t_e_d _t_o _p_a_t_t_e_r_n _m_a_t_c_h_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     In this man page we define precisely and illustrate what a "match"
     of a pattern P in a subject S is in the context of the Biostrings
     package. This definition of a "match" is central to most pattern
     matching functions available in this package: unless specified
     otherwise, most of them will adhere to the definition provided
     here.

     'neditStartingAt', 'neditEndingAt', 'isMatchingStartingAt' and
     'isMatchingEndingAt' are low-level functions that implement some
     basic concepts. Once these concepts are understood, we can use
     them to provide a simple and concise definition of a "match".

     Other utility functions related to pattern matching are described
     here: the 'mismatch' function for getting the positions of the
     mismatching letters of a given pattern relatively to its matches
     in a given subject, the 'nmatch' and 'nmismatch' functions for
     getting the number of matching and mismatching letters produced by
     the 'mismatch' function, and the 'coverage' function that can be
     used to get the "coverage" of a subject by a given pattern or set
     of patterns.

_U_s_a_g_e:

       neditStartingAt(pattern, subject, starting.at=1, with.indels=FALSE, fixed=TRUE)
       neditEndingAt(pattern, subject, ending.at=1, with.indels=FALSE, fixed=TRUE)
       neditAt(pattern, subject, at=1, with.indels=FALSE, fixed=TRUE)

       isMatchingStartingAt(pattern, subject, starting.at=1,
                       max.mismatch=0, with.indels=FALSE, fixed=TRUE)
       isMatchingEndingAt(pattern, subject, ending.at=1,
                       max.mismatch=0, with.indels=FALSE, fixed=TRUE)
       isMatchingAt(pattern, subject, at=1,
                       max.mismatch=0, with.indels=FALSE, fixed=TRUE)

       mismatch(pattern, x, fixed=TRUE)
       nmatch(pattern, x, fixed=TRUE)
       nmismatch(pattern, x, fixed=TRUE)
       ## S4 method for signature 'MIndex':
       coverage(x, start=NA, end=NA)
       ## S4 method for signature 'XStringViews':
       coverage(x, start=NA, end=NA, weight=1L)
       ## S4 method for signature 'MaskedXString':
       coverage(x, start=NA, end=NA, weight=1L)

_A_r_g_u_m_e_n_t_s:

 pattern: The pattern string. 

 subject: An XString object (or character vector) containing the
          subject sequence. 

starting.at, ending.at, at: An integer vector specifying the starting
          (for 'starting.at' and 'at') or ending (for 'ending.at')
          positions of the pattern relatively to the subject. 

max.mismatch: See details below. 

with.indels: See details below. 

   fixed: Only with a DNAString or RNAString subject can a 'fixed'
          value other than the default ('TRUE') be used.

          With 'fixed=FALSE', ambiguities (i.e. letters from the IUPAC
          Extended Genetic Alphabet (see 'IUPAC_CODE_MAP') that are not
          from the base alphabet) in the pattern _and_ in the subject
          are interpreted as wildcards i.e. they match any letter that
          they stand for.

          'fixed' can also be a character vector, a subset of
          'c("pattern", "subject")'. 'fixed=c("pattern", "subject")' is
          equivalent to 'fixed=TRUE' (the default). An empty vector is
          equivalent to 'fixed=FALSE'. With 'fixed="subject"',
          ambiguities in the pattern only are interpreted as wildcards.
          With 'fixed="pattern"', ambiguities in the subject only are
          interpreted as wildcards. 

       x: An XStringViews object for 'mismatch' (typically, one
          returned by 'matchPattern(pattern, subject)').

          Typically an XStringViews or MIndex object for 'coverage' but
          IRanges, MaskCollection and MaskedXString objects are
          accepted too. 

start, end: Two single integers specifying where to start and end the
          extraction of the coverage in 'x'. 

  weight: An integer vector specifying how much each element in 'x'
          counts. 

_D_e_t_a_i_l_s:

     A "match" of pattern P in subject S is a substring S' of S that is
     considered similar enough to P according to some distance (or
     metric) specified by the user. 2 distances are supported by most
     pattern matching functions in the Biostrings package. The first
     (and simplest) one is the "number of mismatching letters". It is
     defined only when the 2 strings to compare have the same length,
     so when this distance is used, only matches that have the same
     number of letters as P are considered. The second one is the "edit
     distance" (aka Levenshtein distance): it's the minimum number of
     operations needed to transform P into S', where an operation is an
     insertion, deletion, or substitution of a single letter. When this
     metric is used, matches can have a different number of letters
     than P.

     The 'neditStartingAt' (and 'neditEndingAt') function implements
     these 2 distances. If 'with.indels' is 'FALSE' (the default), then
     the first distance is used i.e. 'neditStartingAt' returns the
     "number of mismatching letters" between the pattern P and the
     substring S' of S starting at the positions specified in
     'starting.at' (note that 'neditStartingAt' and 'neditEndingAt' are
     vectorized so long vectors of integers can be passed thru the
     'starting.at' or 'ending.at' arguments). If 'with.indels' is
     'TRUE', then the "edit distance" distance is used: for each
     position specified in 'starting.at', P is compared to all the
     substrings S' of S starting at this position and the smallest
     distance is returned. Note that this distance is guaranteed to be
     reached for a substrings of length < 2*length(P) so, of course, in
     practise, P only needs to be compared to a small number of
     substrings for every starting position.

_V_a_l_u_e:

     'neditStartingAt' and 'neditEndingAt': an integer vector of the
     same length as 'starting.at' (or 'ending.at').

     'isMatchingStartingAt(...)' and 'isMatchingEndingAt(...)': the
     logical vector defined by 'neditStartingAt(...) <= max.mismatch'
     or 'neditEndingAt(...) <= max.mismatch', respectively.

     'neditAt' and 'isMatchingAt' are conveniency wrappers for
     'neditStartingAt' and 'isMatchingStartingAt', respectively.

     'mismatch':  a list of integer vectors.

     'nmismatch':  an integer vector containing the length of the
     vectors produced by 'mismatch'.

     'coverage':  an XRleInteger object indicating the coverage of 'x'
     in the interval specified by the 'start' and 'end' arguments. An
     integer value called the "coverage" can be associated to each
     position in 'x', indicating how many times this position is
     covered by the views or matches stored in 'x'. For example, if 'x'
     is an XStringViews object, the coverage of a given position in 'x'
     is the number of views it belongs to. If 'x' is an MIndex object,
     the coverage of a given position in 'x' is the number of matches
     (or hits) it belongs to. Note that the positions in the returned
     XRleInteger object are to be interpreted as relative to the
     interval specified by the 'start' and 'end' arguments.

_S_e_e _A_l_s_o:

     'matchPattern', 'matchPDict', 'IUPAC_CODE_MAP', XString-class,
     XStringViews-class, MIndex-class, coverage, IRanges-class,
     MaskCollection-class, MaskedXString-class, align-utils

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## neditAt() / isMatchingAt()
       ## ---------------------------------------------------------------------
       subject <- DNAString("GTATA")

       ## Pattern "AT" matches subject "GTATA" at position 3 (exact match)
       neditAt("AT", subject, at=3)
       isMatchingAt("AT", subject, at=3)

       ## ... but not at position 1
       neditAt("AT", subject)
       isMatchingAt("AT", subject)

       ## ... unless we allow 1 mismatching letter (inexact match)
       isMatchingAt("AT", subject, max.mismatch=1)

       ## Here we look at 6 different starting positions and find 3 matches if
       ## we allow 1 mismatching letter
       isMatchingAt("AT", subject, at=0:5, max.mismatch=1)

       ## No match
       neditAt("NT", subject, at=1:4)
       isMatchingAt("NT", subject, at=1:4)

       ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)
       neditAt("NT", subject, at=1:4, fixed=FALSE)
       isMatchingAt("NT", subject, at=1:4, fixed=FALSE)

       ## max.mismatch != 0 and fixed=FALSE can be used together
       neditAt("NCA", subject, at=0:5, fixed=FALSE)
       isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE)

       some_starts <- c(10:-10, NA, 6)
       subject <- DNAString("ACGTGCA")
       is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
       some_starts[is_matching]

       ## ---------------------------------------------------------------------
       ## mismatch() / nmismatch()
       ## ---------------------------------------------------------------------
       m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE)
       mismatch("NCA", m)
       nmismatch("NCA", m)

       ## ---------------------------------------------------------------------
       ## coverage()
       ## ---------------------------------------------------------------------
       coverage(m)

       ## See ?matchPDict for examples of using coverage() on an MIndex object...

