Contents

Please report comments or bugs to Pierre Lefeuvre - pierre.lefeuvre@cirad.fr

BoSSA CRAN page

1 Summary

Phylogenetic placements represents mapping of query sequences to a reference phylogenetic tree. Put it another way, it corresponds to the position in a tree where a query sequence fits. Different tools exits to infer phylogenetic placements with the EPA algorithm and pplacer. Importantly, both EPA and pplacer produces placements under a common file format. Placements can later be analysed using the guppy software from the pplacer suite to obtain statistically based taxonomic classification of sequences. The BoSSA package implements functions to reads, plots and summarizes phylogentic placements. This vignette is intended to provide functional examples of placements analyses using BoSSA.

2 Important notes

3 How to obtain phylogentic placement file suitable for analysis with BoSSA ?

Assuming you dispose of some query sequences (either full sequences,contigs or short reads) you want to assign to a phylogeny/taxonomy, the process to obtain phylogenetic placement is to (1) build a reference package that contains an align set of reference sequences and a reference phylogenetic tree, (2) align query sequences to the reference alignment, (3) use EPA or pplacer to infer placements (jplace file output, format describe here) and optionally (4) infer the classification of each sequences using guppy (sqlite file output).

Let’s say you now dispose of a reference package (refpkg), a placement file (jplace file) and a guppy classification output (sqlite file). Beside the files available with the package, you can find examples of reference packages and jplaces files from the Matsen group pplacer tutorials. The sqlite file was obtained using the following command:

guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c example.refpkg --sqlite example.sqlite example.jplace

4 Exploration of a reference package

Let’s start by loading the `BoSSA-package:

library("BoSSA")

A good practice would be to inspect the refpkg content.

refpkg_path <- paste(find.package("BoSSA"),"/extdata/example.refpkg",sep="")
refpkg(refpkg_path)
## ### Reference package summary
## 
## Path:/tmp/RtmpBgTh5a/Rinst69912bf019b/BoSSA/extdata/example.refpkg
## 
## Tree with 652 tips 650 nodes
## 
## Classification:
## root 1 
## below_root 1 
## superkingdom 1 
## below_superkingdom 1 
## below_below_superkingdom 1 
## superphylum 1 
## phylum 6 
## subphylum 1 
## class 11 
## subclass 2 
## order 15 
## below_order 3 
## below_below_order 1 
## suborder 3 
## family 28 
## below_family 5 
## genus 45 
## species_group 6 
## species_subgroup 1 
## species 138

It is possible to extract the taxonomy of the sequences included in the refpkg.

taxo <- refpkg(refpkg_path,type="taxonomy")
head(taxo)
##   tax_id parent_id                     rank                     tax_name
## 1      1         1                     root                         root
## 2 131567         1               below_root           cellular organisms
## 3      2    131567             superkingdom                     Bacteria
## 4   2323         2       below_superkingdom        unclassified Bacteria
## 5  95818      2323 below_below_superkingdom       candidate division TM7
## 6  68336         2              superphylum Bacteroidetes/Chlorobi group
##   root below_root superkingdom below_superkingdom below_below_superkingdom
## 1    1                                                                    
## 2    1     131567                                                         
## 3    1     131567            2                                            
## 4    1     131567            2               2323                         
## 5    1     131567            2               2323                    95818
## 6    1     131567            2                                            
##   superphylum phylum subphylum class subclass order below_order
## 1                                                              
## 2                                                              
## 3                                                              
## 4                                                              
## 5                                                              
## 6       68336                                                  
##   below_below_order suborder family below_family genus species_group
## 1                                                                   
## 2                                                                   
## 3                                                                   
## 4                                                                   
## 5                                                                   
## 6                                                                   
##   species_subgroup species below_species
## 1                                       
## 2                                       
## 3                                       
## 4                                       
## 5                                       
## 6

or display a pie chart that summarize the taxonomy…

refpkg(refpkg_path,type="pie",cex.text=0.5)

… or a subset of the taxonomy levels. Here, an example with the “class”, “order” and “family” levels. Note there is a slight decay between the text labels and slices… this will need a fix for next package update.

refpkg(refpkg_path,type="pie",rank_pie=c("class","order","family"),cex.text=0.6)

Finally, a tree display with branch colored according to a given taxonomic level is available. Here tips are colored according to the “order” classification.

refpkg(refpkg_path,type="tree",rank_tree="class",cex.text=0.5)

5 Loading the example data

The BoSSA package come along with examples of phylogenetic placements from the Mastens group.

sqlite_file <- system.file("extdata", "example.sqlite", package = "BoSSA")
jplace_file <- system.file("extdata", "example.jplace", package = "BoSSA")

To read the data, use the read_sqlite function.

pplace <- read_sqlite(sqlite_file,jplace_file)
pplace
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c vaginal_16s.refpkg --sqlite example.sqlite example.jplace
## sequence nb: 100
## placement nb: 30

On screen a summary of the object is printed with the number of runs, the command line, the number of placements and the number of sequences being placed. Pplace objects are list with five components :

str(pplace)
## List of 5
##  $ run       :'data.frame':  1 obs. of  2 variables:
##   ..$ run_id: int 1
##   ..$ params: chr "/home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --c"| __truncated__
##  $ taxo      :'data.frame':  281 obs. of  3 variables:
##   ..$ tax_id  : chr [1:281] "1" "103621" "109790" "113286" ...
##   ..$ tax_name: chr [1:281] "root" "Actinomyces urogenitalis" "Lactobacillus jensenii" "Pseudoramibacter" ...
##   ..$ rank    : chr [1:281] "root" "species" "species" "genus" ...
##  $ multiclass:'data.frame':  100 obs. of  6 variables:
##   ..$ placement_id: int [1:100] 28 22 1 28 5 3 9 28 17 28 ...
##   ..$ name        : chr [1:100] "GLKT0ZE01A0Y5X" "GLKT0ZE01A1B65" "GLKT0ZE01A608I" "GLKT0ZE01A6F1M" ...
##   ..$ want_rank   : chr [1:100] "species" "species" "species" "species" ...
##   ..$ rank        : chr [1:100] "species" "species" "species" "species" ...
##   ..$ tax_id      : chr [1:100] "40543" "187101" "906_1" "40543" ...
##   ..$ likelihood  : num [1:100] 1 0.857 1 1 1 ...
##  $ placement :'data.frame':  193 obs. of  9 variables:
##   ..$ placement_id      : int [1:193] 1 1 1 1 1 1 1 2 2 2 ...
##   ..$ location          : num [1:193] 555 556 553 552 557 554 550 111 115 113 ...
##   ..$ ml_ratio          : num [1:193] 0.143 0.143 0.143 0.143 0.143 ...
##   ..$ log_like          : num [1:193] -11608 -11608 -11608 -11608 -11608 ...
##   ..$ distal_bl         : num [1:193] 5.15e-07 5.15e-07 5.15e-07 7.80e-06 5.78e-06 ...
##   ..$ pendant_bl        : num [1:193] 0.0584 0.0584 0.0584 0.0584 0.0584 ...
##   ..$ tax_id            : chr [1:193] "906_1" "906_1" "906_1" "906_1" ...
##   ..$ map_identity_ratio: num [1:193] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ map_identity_denom: int [1:193] NA NA NA NA NA NA NA NA NA NA ...
##  $ arbre     :List of 5
##   ..$ edge       : int [1:1301, 1:2] 653 654 655 656 657 658 658 657 656 659 ...
##   ..$ Nnode      : int 650
##   ..$ tip.label  : chr [1:652] "S000010758" "S002034883" "S000006177" "S001188095" ...
##   ..$ edge.length: num [1:1301] 0.0014 0.2102 0.0728 0.1295 0.1131 ...
##   ..$ root.edge  : num 0
##   ..- attr(*, "class")= chr "phylo"
##   ..- attr(*, "order")= chr "cladewise"
##  - attr(*, "class")= chr "pplace"

6 Some plots

Three different tree plots are available to display placements:

plot(pplace,type="number",main="number",cex.number=1.5)

plot(pplace,type="color",main="color",edge.width=2)

plot(pplace,type="precise",main="precise")

Note that it is possible to apply a function to modify the dot size using the transfo option. In the following example, the dot size is multiplied by 2. In some other cases log or log10 transformations could be usefull. Beware that when using the transfo option, the legend does not anymore correspond to the placement size but to the transform dot size (i.e. the transform function applied to the dot size).

plot(pplace,type="precise",main="precise",transfo=function(X){X*2})

7 Subsetting the pplace object

Placement object can be subseted. This could be done using placements ids…

sub1 <- sub_pplace(pplace,placement_id=1:100)
sub1
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c vaginal_16s.refpkg --sqlite example.sqlite example.jplace
## sequence nb: 100
## placement nb: 30

…or using placements names.

ids <- sample(pplace$multiclass$name,50)
sub2 <- sub_pplace(pplace,ech_id=ids)
sub2
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c vaginal_16s.refpkg --sqlite example.sqlite example.jplace
## sequence nb: 50
## placement nb: 18

8 Conversion

8.1 To a table

Using the pplace_to_table function produces a table taht contains the placement information along with the classification for each sequence. The output can be limited to the “best” placement (as in the example, i.e. the placements with the highest likelihood for each sequence).

pplace_table <- pplace_to_table(pplace,type="best")
head(pplace_table,n=3)
##   placement_id           name want_rank    rank tax_id likelihood location
## 1           28 GLKT0ZE01A0Y5X   species species  40543  1.0000000      161
## 2           22 GLKT0ZE01A1B65   species species 187101  0.8571653      152
## 3            1 GLKT0ZE01A608I   species species  906_1  1.0000000      555
##    ml_ratio  log_like    distal_bl   pendant_bl tax_id.1
## 1 0.1428572 -11098.20 5.149085e-07 6.113515e-06    40543
## 2 0.1428611 -11098.20 5.149085e-07 6.113515e-06   187101
## 3 0.1428572 -11607.93 5.149085e-07 5.842901e-02    906_1
##   map_identity_ratio map_identity_denom
## 1                 NA                 NA
## 2                 NA                 NA
## 3                 NA                 NA

8.2 To a contingency matrix

The pplace_to_matrix produces a contingency table. Let say the first 50 sequences in the multiclass table correspond to sequence from “sample 1” and the following 50 correspond to “sample 2”, the function output a ocntingency table for these two samples. You can either have the taxonomic names (tax_name=TRUE, in the example) or keep the taxonomic ids (tax_name=FALSE).

example_contingency <- pplace_to_matrix(pplace,c(rep("sample1",50),rep("sample2",50)),tax_name=TRUE)
example_contingency
##         Sneathia sanguinegens Leptotrichia amnionii Megasphaera sp. type 1
## sample1                    18                    10                      6
## sample2                    14                    14                      5
##         Atopobium vaginae Gardnerella vaginalis Dialister sp. type 2
## sample1                 1                     2                    1
## sample2                 1                     3                    0
##         Prevotella genogroup 1 Lactobacillus iners BVAB1
## sample1                      3                   4     1
## sample2                      5                   0     2
##         Prevotella genogroup 2 Sneathia Prevotella genogroup 3
## sample1                      1        2                      1
## sample2                      0        0                      1
##         Aerococcus christensenii Parvimonas micra Eggerthella
## sample1                        0                0           0
## sample2                        1                2           2

8.3 To a taxonomy

Using the pplace_to_taxonomy function, a taxonomy table is compiled for each sequences with the taxonomy levels defined in the reference package. The taxonomy levels can be limited to a set of levels using the rank option.

example_taxo <- pplace_to_taxonomy(pplace,taxo,tax_name=TRUE,rank=c("order","family","genus","species"))
head(example_taxo)
##                order               family               genus        
## GLKT0ZE01A0Y5X "Fusobacteriales"   "Fusobacteriaceae"   "Sneathia"   
## GLKT0ZE01A1B65 "Fusobacteriales"   "Fusobacteriaceae"   "Sneathia"   
## GLKT0ZE01A608I "Clostridiales"     "Veillonellaceae"    "Megasphaera"
## GLKT0ZE01A6F1M "Fusobacteriales"   "Fusobacteriaceae"   "Sneathia"   
## GLKT0ZE01AH4NR "Coriobacteriales"  "Coriobacteriaceae"  "Atopobium"  
## GLKT0ZE01AHXIQ "Bifidobacteriales" "Bifidobacteriaceae" "Gardnerella"
##                species                 
## GLKT0ZE01A0Y5X "Sneathia sanguinegens" 
## GLKT0ZE01A1B65 "Leptotrichia amnionii" 
## GLKT0ZE01A608I "Megasphaera sp. type 1"
## GLKT0ZE01A6F1M "Sneathia sanguinegens" 
## GLKT0ZE01AH4NR "Atopobium vaginae"     
## GLKT0ZE01AHXIQ "Gardnerella vaginalis"

8.4 Make a phyloseq object

Assuming the sequences in the pplace object represent centroids of sequence cluster you obtained from multiple samples, using the taxonomy table and an appropriate OTU file, you can create a phyloseq object.

example_OTU <- matrix(sample(1:100, 500, replace = TRUE), nrow = 100, ncol = 5,dimnames=list(pplace$multiclass$name,paste("sample",1:5,sep="_")))
head(example_OTU)
##                sample_1 sample_2 sample_3 sample_4 sample_5
## GLKT0ZE01A0Y5X       36       28       59       34       44
## GLKT0ZE01A1B65       16       31       30       12       39
## GLKT0ZE01A608I       69       77       66       77        8
## GLKT0ZE01A6F1M       96       83       16       98       84
## GLKT0ZE01AH4NR       34       66       88       73       86
## GLKT0ZE01AHXIQ       11       64       48       87       51
library(phyloseq)
example_phyloseq <- phyloseq(otu_table(example_OTU,taxa_are_rows=TRUE),tax_table(example_taxo))
example_phyloseq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 100 taxa and 5 samples ]
## tax_table()   Taxonomy Table:    [ 100 taxa by 4 taxonomic ranks ]

9 Citation

If you find phyloseq and/or its tutorials useful, you may cite BoSSA:

citation("BoSSA")
## 
## To cite package 'BoSSA' in publications use:
## 
##   Pierre Lefeuvre (2017). BoSSA: A Bunch of Structure and Sequence
##   Analysis. R package version 2.2.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {BoSSA: A Bunch of Structure and Sequence Analysis},
##     author = {Pierre Lefeuvre},
##     year = {2017},
##     note = {R package version 2.2},
##   }
## 
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

10 Other resources

10.1 On phylogenetic placements

pplacer website and documentation

taxtastic

10.3 R packages used by BoSSA

RSQLite and jsonlite to read files, ape and phangorn to manipulate phylogenetic trees and plotrix for pie charts.