alphabetFrequency         package:Biostrings         R Documentation

_F_u_n_c_t_i_o_n _t_o _c_a_l_c_u_l_a_t_e _t_h_e _f_r_e_q_u_e_n_c_y _o_f _l_e_t_t_e_r_s _i_n _a _b_i_o_l_o_g_i_c_a_l
_s_e_q_u_e_n_c_e _a_n_d _r_e_l_a_t_e_d _f_u_n_c_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Given a biological sequence, the 'alphabetFrequency' function will
     calculate the frequency of each letter in the (base) alphabet, the
     'dinucleotideFrequency' function the frequency of all possible
     dinucleotides and the 'trinucleotideFrequency' function the
     frequency of all possible trinucleotides.

     More generally, the 'oligonucleotideFrequency' function will
     calculate the frequency of all possible oligonucleotides of a
     given length (called the "width" in this particular context).

     In this man page we call "DNA input" a DNAString object, or a
     DNAStringSet object, or an XStringViews object with a DNAString
     subject, or a MaskedDNAString object. Similarly we call "RNA
     input" an RNAString object, or an RNAStringSet object, or an
     XStringViews object with an RNAString subject, or a
     MaskedRNAString object.

_U_s_a_g_e:

       alphabetFrequency(x, baseOnly=FALSE, freq=FALSE, ...)
       hasOnlyBaseLetters(x)
       uniqueLetters(x)

       dinucleotideFrequency(x, freq=FALSE, fast.moving.side="right",
                             as.matrix=FALSE, with.labels=TRUE, ...)
       trinucleotideFrequency(x, freq=FALSE, fast.moving.side="right",
                              as.array=FALSE, with.labels=TRUE, ...)
       oligonucleotideFrequency(x, width, freq=FALSE, fast.moving.side="right",
                                as.array=FALSE, with.labels=TRUE, ...)
       oligonucleotideTransitions(x, left=1, right=1, freq=FALSE)

       ## Some related utility functions
       strrev(x)
       mkAllStrings(alphabet, width, fast.moving.side="right")

_A_r_g_u_m_e_n_t_s:

       x: An XString, XStringSet, XStringViews or MaskedXString object
          for the '*Frequency' and 'uniqueLetters' functions.

          "DNA or RNA input" for 'hasOnlyBaseLetters'.

          A character vector for 'strrev'. 

baseOnly: 'TRUE' or 'FALSE'. If 'TRUE', the returned vector only
          contains frequencies for the letters in the "base" alphabet
          i.e. "A", "C", "G", "T" if 'x' is a "DNA input", and "A",
          "C", "G", "U" if 'x' is "RNA input". When 'x' is a BString
          object (or an XStringViews object with a BString subject, or
          a BStringSet object), then the 'baseOnly' argument is
          ignored. 

    freq: If 'TRUE' then frequencies are reported, otherwise counts. 

     ...: Further arguments to be passed to or from other methods. For
          the XStringViews and XStringSet methods, the 'collapse'
          argument is accepted. 

fast.moving.side: Which side of the strings should move fastest? 

as.matrix: If 'TRUE' then return a numeric matrix, otherwise a numeric
          vector with no dim attribute. 

as.array: If 'TRUE' then return a numeric array, otherwise a numeric
          vector with no dim attribute. 

with.labels: If 'TRUE' then return a named vector (or array). 

   width: The number of nucleotides per oligonucleotide for
          'oligonucleotideFrequency'. The number of letters per string
          for 'mkAllStrings'. 

left, right: The number of nucleotides per oligonucleotide for the rows
          and columns respectively in the transition matrix created by
          'oligonucleotideTransitions'. 

alphabet: The alphabet to use to make the strings. 

_D_e_t_a_i_l_s:

     'alphabetFrequency'  and 'oligonucleotideFrequency' are generic
     functions defined in the Biostrings package with methods defined
     for BString, DNAString, RNAString, XStringViews and XStringSet
     objects.

_V_a_l_u_e:

     All the '*Frequency' functions return an integer vector if 'freq'
     is 'FALSE' (default), otherwise a double vector. If 'as.matrix' or
     'as.array' is 'TRUE', this vector is formatted as a matrix or an
     array.

     For 'alphabetFrequency': if 'x' is a "DNA or RNA input", then the
     returned vector is named with the letters in the alphabet (unless
     'with.labels' is 'FALSE'). If the 'baseOnly' argument is 'TRUE',
     then the returned vector has only 5 elements: 4 elements
     corresponding to the 4 nucleotides + the 'other' element.

     'dinucleotideFrequency' (resp. 'trinucleotideFrequency' and
     'oligonucleotideFrequency') only works on "DNA or RNA input" and
     returns a vector named with all the possible dinucleotides (resp.
     trinucleotides or oligonucleotides).

     If 'x' is a multiple sequence input (i.e. an XStringViews or
     XStringSet object), then the returned object is a matrix (or a
     list) with the same number of rows (or elements) as 'x' unless
     'collapse=TRUE' is specified. In that case the returned vector (or
     array) contains the frequencies cumulated across all sequences in
     'x'.

     'hasOnlyBaseLetters' returns 'TRUE' or 'FALSE' indicating whether
     or not 'x' contains only base letters (i.e. As, Cs, Gs and Ts for
     "DNA input" and As, Cs, Gs and Us for "RNA input").

     'uniqueLetters' returns a vector of 1-letter or empty strings. The
     empty string is used to represent the nul character if 'x' happens
     to contain any. Note that this can only happen if XString base
     subtype of 'x' is BString.

_A_u_t_h_o_r(_s):

     H. Pages

_S_e_e _A_l_s_o:

     'countPDict', XString-class, XStringSet-class, XStringViews-class,
     MaskedXString-class, 'reverse,XString-method', 'rev', 'strsplit',
     'GENETIC_CODE', 'AMINO_ACID_CODE'

_E_x_a_m_p_l_e_s:

       data(yeastSEQCHR1)
       yeast1 <- DNAString(yeastSEQCHR1)

       alphabetFrequency(yeast1)
       alphabetFrequency(yeast1, baseOnly=TRUE)
       hasOnlyBaseLetters(yeast1)
       uniqueLetters(yeast1)

       dinucleotideFrequency(yeast1)
       trinucleotideFrequency(yeast1)
       oligonucleotideFrequency(yeast1, 4)

       ## With a multiple sequence input
       library(drosophila2probe)
       x <- DNAStringSet(drosophila2probe$sequence)
       alphabetFrequency(x[1:50], baseOnly=TRUE)
       alphabetFrequency(x, baseOnly=TRUE, collapse=TRUE)

       ## Get the less and most represented 6-mers
       f6 <- oligonucleotideFrequency(yeast1, 6)
       f6[f6 == min(f6)]
       f6[f6 == max(f6)]

       ## Get the result as an array
       tri <- trinucleotideFrequency(yeast1, as.array=TRUE)
       tri["A", "A", "C"] # == trinucleotideFrequency(yeast1)["AAC"]
       tri["T", , ] # frequencies of trinucleotides starting with a "T"

       ## Get nucleotide transition matrices for yeast1
       oligonucleotideTransitions(yeast1)
       oligonucleotideTransitions(yeast1, 2, freq=TRUE)

       ## Note that when dropping the dimensions of the 'tri' array, elements
       ## in the resulting vector are ordered as if they were obtained with
       ## 'fast.moving.side="left"':
       triL <- trinucleotideFrequency(yeast1, fast.moving.side="left")
       all(as.vector(tri) == triL) # TRUE

       ## Convert the trinucleotide frequency into the amino acid frequency based on
       ## translation
       tri1 <- trinucleotideFrequency(yeast1)
       names(tri1) <- GENETIC_CODE[names(tri1)]
       sapply(split(tri1, names(tri1)), sum) # 12512 occurrences of the stop codon

       ## When the returned vector is very long (e.g. width >= 10), using
       ## 'with.labels=FALSE' will improve the performance considerably (100x, 1000x
       ## or more):
       f12 <- oligonucleotideFrequency(yeast1, 12, with.labels=FALSE) # very fast!

       ## Some related utility functions
       dict1 <- mkAllStrings(LETTERS[1:3], 4)
       dict2 <- mkAllStrings(LETTERS[1:3], 4, fast.moving.side="left")
       identical(strrev(dict1), dict2) # TRUE 

