The biomartr package allows users to retrieve biological sequences in a very simple and intuitive way.

Using biomartr, users can retrieve either genomes, proteomes, or CDS data using the specialized functions:

Getting Started with Sequence Retrieval

First users can check whether or not the genome, proteome, or CDS of their interest is available for download.

Using the scientific name of the organism of interest, users can check whether the corresponding genome is available via the is.genome.available() function.

# checking whether or not the Arabidopsis thaliana 
# genome is avaialable for download
is.genome.available("Arabidopsis thaliana")
[1] TRUE

By specifying the details = TRUE argument, the genome file size as well as additional information can be printed to the console.

# printing details to the console
is.genome.available("Arabidopsis thaliana", details = TRUE)
           organism_name  kingdoms  group    subgroup file_size_MB chrs organelles plasmids bio_projects
682 Arabidopsis thaliana Eukaryota Plants Land Plants      119.668    6          2       NA            6

Users will observe that the Arabidopsis thaliana genome file has a size of 119.668 MB.

Note: The availability of genomes has been taken from NCBI.

Users can determine the total number of available genomes using the listGenomes() function.

length(listGenomes())
[1] 13074

Hence, currently 13074 genomes (including all kingdoms of life) are stored on NCBI servers.

Optionally, users can also specify the database for which the availability of organisms shall be checked.

# cheking whether A. thaliana is available in the refseq database
is.genome.available("Arabidopsis thaliana", database = "refseq")
[1] TRUE

Users can also determine the total number of genomes stored in refseq.

length(listGenomes(database = "refseq"))
[1] 4896

This result shows that so far (year 2015) 4896 genomes are stored in refseq.

The simplest way to work with listGenomes() is to print available genomes to the console.

# the simplest way to retrieve names of available genomes stored within NCBI databases
head(listGenomes() , 5)
[1] "'Chrysanthemum coronarium' phytoplasma"      
[2] "'Deinococcus soli' Cha et al. 2014"          
[3] "Abaca bunchy top virus"                      
[4] "Abalone herpesvirus Victoria/AUS/2009"       
[5] "Abalone shriveling syndrome-associated virus"

In case users are interested in a detailed output of the corresponding organism file stored on NCBI, again they can specify the details = TRUE argument.

# show all details
head(listGenomes(details = TRUE) , 5)
                                 organism_name kingdoms                       group
1       'Chrysanthemum coronarium' phytoplasma Bacteria                 Tenericutes
2           'Deinococcus soli' Cha et al. 2014 Bacteria         Deinococcus-Thermus
3                       Abaca bunchy top virus  Viruses               ssDNA viruses
4        Abalone herpesvirus Victoria/AUS/2009  Viruses dsDNA viruses, no RNA stage
5 Abalone shriveling syndrome-associated virus  Viruses dsDNA viruses, no RNA stage
      subgroup file_size_MB chrs organelles plasmids bio_projects
1   Mollicutes     0.739592   NA         NA       NA            1
2   Deinococci     3.236980    1         NA       NA            1
3  Nanoviridae     0.006422    6         NA       NA            1
4 unclassified     0.211518    1         NA       NA            1
5 unclassified     0.034952    1         NA       NA            1

Users will observe that the detailed information output includes the organism_name, kingdom, group, subgroup, file_size_MB, chrs, organelles, plasmids, and bio_projects.

In case users are interested in organisms classified into a specific kingdom of life, they can use the kingdom argument to filter for organisms that are classified into the corresponding kingdom.

# show all details only for Bacteria
head(listGenomes(kingdom = "Bacteria", details = TRUE) , 5)
                           organism_name kingdoms               group              subgroup
1 'Chrysanthemum coronarium' phytoplasma Bacteria         Tenericutes            Mollicutes
2     'Deinococcus soli' Cha et al. 2014 Bacteria Deinococcus-Thermus            Deinococci
3                  Abiotrophia defectiva Bacteria          Firmicutes               Bacilli
4                 Acaricomes phytoseiuli Bacteria      Actinobacteria        Actinobacteria
5                          Acaryochloris Bacteria       Cyanobacteria Oscillatoriophycideae
  file_size_MB chrs organelles plasmids bio_projects
1     0.739592   NA         NA       NA            1
2     3.236980    1         NA       NA            1
3     2.043440   NA         NA       NA            1
4     2.419520   NA         NA       NA            1
5     7.875480   NA         NA       NA            1

The following filters can be specified for the kingdom argument: all, Archaea, Bacteria, Eukaryota, Viroids, and Viruses.

Furthermore, users can simply count the kingdom specific availability of genomes as following:

# the number of genomes available for each kingdom
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "kingdoms"])
  Archaea  Bacteria Eukaryota   Viroids   Viruses 
      386      6473      1454        45      4716

Analogous computations can be performed for group, subgroup, etc.

# the number of genomes available for each group
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "group"])
                   Actinobacteria                   Aenigmarchaeota 
                              934                                 3 
                          Animals                         Aquificae 
                              562                                16 
                  Armatimonadetes                     Avsunviroidae 
                                3                                 4 
     Bacteroidetes/Chlorobi group                       Caldiserica 
                              481                                 2 
 Chlamydiae/Verrucomicrobia group                       Chloroflexi 
                               60                                32 
                   Chrysiogenetes                     Crenarchaeota 
                                2                                60 
                    Cyanobacteria                   Deferribacteres 
                              103                                 6 
              Deinococcus-Thermus                        Deltavirus 
                               43                                 1 
                   Diapherotrites                       Dictyoglomi 
                                3                                 2 
      dsDNA viruses, no RNA stage                     dsRNA viruses 
                             2008                               213 
                    Elusimicrobia                     Euryarchaeota 
                                3                               236 
Fibrobacteres/Acidobacteria group                        Firmicutes 
                               47                              1151 
                            Fungi                      Fusobacteria 
                              550                                25 
                 Gemmatimonadetes                      Korarchaeota 
                                5                                 1 
                    Lokiarchaeota                     Nanoarchaeota 
                                1                                10 
                Nanohaloarchaeota                       Nitrospinae 
                                4                                 2 
                      Nitrospirae                             Other 
                               10                                15 
                    Parvarchaeota                    Planctomycetes 
                                2                                22 
                           Plants                     Pospiviroidae 
                              162                                36 
                   Proteobacteria                          Protists 
                             2256                               167 
       Retro-transcribing viruses                        Satellites 
                              130                               214 
                     Spirochaetes                     ssDNA viruses 
                               81                               791 
                    ssRNA viruses                     Synergistetes 
                             1252                                18 
                      Tenericutes                    Thaumarchaeota 
                              132                                49 
            Thermodesulfobacteria                       Thermotogae 
                                7                                26 
               unassigned viruses              unclassified Archaea 
                               10                                17 
    unclassified archaeal viruses             unclassified Bacteria 
                                2                              1004 
              unclassified phages              unclassified viroids 
                               35                                 5 
          unclassified virophages              unclassified viruses 
                                5                                53

Users can also order organisms by their file size.

# order by file size
library(dplyr)

ncbi_genomes <- listGenomes(details = TRUE)
head(arrange(ncbi_genomes, desc(file_size_MB)) , 10)
               organism_name  kingdoms   group      subgroup file_size_MB chrs organelles plasmids bio_projects
1               Picea glauca Eukaryota  Plants   Land Plants     26936.20   NA         NA       NA            2
2  Acanthoscurria geniculata Eukaryota Animals Other Animals      7178.40   NA         NA       NA            1
3         Locusta migratoria Eukaryota Animals       Insects      5759.80   NA          1       NA            1
4           Orycteropus afer Eukaryota Animals       Mammals      4444.08   NA          1       NA            1
5     Chrysochloris asiatica Eukaryota Animals       Mammals      4210.11   NA          1       NA            1
6      Elephantulus edwardii Eukaryota Animals       Mammals      3843.98   NA         NA       NA            1
7          Triticum aestivum Eukaryota  Plants   Land Plants      3800.33    1          1       NA            6
8            Triticum urartu Eukaryota  Plants   Land Plants      3747.05   NA          1       NA            1
9       Dasypus novemcinctus Eukaryota Animals       Mammals      3631.52   NA          1       NA            1
10         Nicotiana tabacum Eukaryota  Plants   Land Plants      3613.16   NA          1       NA            3

This analysis shows that Picea glauca has the largest genome available on the NCBI server.

Internally, the listGenomes() function downloads the Genome Reports file from NCBI and stores it in a tempfile() folder named _ncbi_downloads/overview.txt. It is only downloaded once and is then accessed from your hard drive. In case users would like to update the Genome Reports file, they can specify the update = TRUE argument which allows them to reload the Genome Reports file from the NCBI server.

# users can also update the organism table using the 'update' argument
head(listGenomes(details = TRUE, update = TRUE) , 5)
                                 organism_name kingdoms                       group     subgroup file_size_MB
1       'Chrysanthemum coronarium' phytoplasma Bacteria                 Tenericutes   Mollicutes     0.739592
2           'Deinococcus soli' Cha et al. 2014 Bacteria         Deinococcus-Thermus   Deinococci     3.236980
3                       Abaca bunchy top virus  Viruses               ssDNA viruses  Nanoviridae     0.006422
4        Abalone herpesvirus Victoria/AUS/2009  Viruses dsDNA viruses, no RNA stage unclassified     0.211518
5 Abalone shriveling syndrome-associated virus  Viruses dsDNA viruses, no RNA stage unclassified     0.034952
  chrs organelles plasmids bio_projects
1   NA         NA       NA            1
2    1         NA       NA            1
3    6         NA       NA            1
4    1         NA       NA            1
5    1         NA       NA            1

Again, the listGenomes() function can be users to filter for available genome information in refseq.

# list all Eukaryota that are stored in refseq
head(listGenomes(kingdom = "Eukaryota", database = "refseq") , 20)
                    organism_name
1               Agaricus bisporus
2          Ajellomyces capsulatus
3        Ajellomyces dermatitidis
4           Arthroderma benhamiae
5                Arthroderma otae
6            Aspergillus clavatus
7              Aspergillus flavus
8           Aspergillus fumigatus
9            Aspergillus nidulans
10              Aspergillus niger
11             Aspergillus oryzae
12            Aspergillus terreus
13           Auricularia delicata
14 Batrachochytrium dendrobatidis
15        Baudoinia compniacensis
16               Bipolaris oryzae
17          Bipolaris sorokiniana
18              Bipolaris zeicola
19               Botrytis cinerea
20               Candida albicans

Or analogous:

# the number of genomes available for each kingdom stored in refseq
ncbi_genomes <- listGenomes(details = TRUE, database = "refseq")
table(ncbi_genomes[ , "kingdoms"])
  Archaea  Bacteria Eukaryota 
      220      4167       509

Note that when running the listGenomes() function for the first time, it might take a while until the function returns any results, because necessary information need to be downloaded from NCBI databases. All subsequent executions of listGenomes() will then respond very fast, because they will access the corresponding files stored on your hard drive.

Downloading Biological Sequences

After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, or CDS file in fasta format. The following functions allow users to download proteomes, genomes, and CDS files from several database resources such as: refseq. When a corresponding proteome, genome, or CDS file was loaded to your hard-drive, a documentation *.txt file is generated storing File Name, Organism, Database, URL, and DATE information. This way a better reproducibility of proteome, genome, and CDS versions used for subsequent data analyses can be achieved.

Genome Retrieval

The easiest way to download a genome is to use the getGenome() function.

In this example we will download the genome of A. thaliana.

The getGenome() function is an interface function to the NCBI refseq database from which corresponding genomes are downloaded.

For this purpose users need to specify the kingdom in which their organism of interest is classified into, e.g. "archaea","bacteria", "fungi", "invertebrate", "plant", "protozoa", "vertebrate_mammalian", or "vertebrate_other" and then the scientific name of the organism of interest.

# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
getGenome( db       = "refseq", 
           kingdom  = "plant",
           organism = "Arabidopsis thaliana",
           path     = file.path("_ncbi_downloads","genomes") )

The getGenome() function creates a directory named '_ncbi_downloads/genomes' into which the corresponding genome named Arabidopsis_thaliana_genome.fna.gz is downloaded. The read_genome() function enables users to work with the genome as data.table object.

# path to genome: '_ncbi_downloads/genomes/Arabidopsis_thaliana_genome.fna.gz'
file_path <- file.path("_ncbi_downloads","genomes","Arabidopsis_thaliana_genome.fna.gz")
# read genome as data.table object
Ath_genome <- read_genome(file_path, format = "fasta")

In case users would like to store the genome file at a different location, they can specify the path = file.path("put","your","path","here") argument.

Proteome Retrieval

The getProteome() function is also an interface function to the NCBI refseq database from which corresponding proteomes are downloaded. It works analogous to the getGenome() function.

# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db       = "refseq", 
             kingdom  = "plant",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","proteomes") )

The getProteome() function creates a directory named _ncbi_downloads/proteomes into which the orresponding proteome named Arabidopsis_thaliana_protein.faa.gz is downloaded. The read_proteome() function enables users to work with the proteome as data.table object.

# path to proteome: '_ncbi_downloads/proteomes/Arabidopsis_thaliana_protein.faa.gz'
file_path <- file.path("_ncbi_downloads","proteomes","Arabidopsis_thaliana_protein.faa.gz")
# read proteome as data.table object
Ath_proteome <- read_proteome(file_path, format = "fasta")

In case users would like to store the proteome file at a different location, they can specify the path = file.path("put","your","path","here") argument.

CDS Retrieval

The getCDS() function is also an interface function to the NCBI refseq database from which the corresponding CDS files are downloaded. It works analogous to the getGenome() and getProteome() functions but for CDS files.

# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
getCDS( db       = "refseq", 
        kingdom  = "plant",
        organism = "Arabidopsis thaliana",
        path     = file.path("_ncbi_downloads","CDS") )

The getCDS() function creates a directory named _ncbi_downloads/CDS into which corresponding CDS are loaded. The read_cds() function allows you to read the correspondning CDS as data.table object.

# path to CDS file: '_ncbi_downloads/CDS/Arabidopsis_thaliana_rna.fna.gz'
file_path <- file.path("_ncbi_downloads","CDS","Arabidopsis_thaliana_rna.fna.gz")
# read CDS as data.table object
Ath_cds <- read_cds(file_path, format = "fasta")

In case users would like to store the CDS file at a different location, they can specify the path = file.path("put","your","path","here") argument.

Furthermore, the getCDS() function checks whether all CDS sequences can be divided by 3 (codons). In case sequences of particualar genes cannot be divided by 3, a warning massage is returned to quantify the number of corresponding genes. In case users would like to extract these sequences from their data, they can specify the delete_delete_corrupt = TRUE argument, which will then delete all corrupt CDS sequences.

Retrieving sequences for a set of genes

For most analyses only subsets of sequences (taken from the entire genome) are needed. This section introduces several approaches to select a set of sequences for furthr analyses.

Using the output from getProteome()

As seen before, the getProteome() function allows users to download the entire proteome of a specific organism of interest that is stored in refseq.

# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db       = "refseq", 
             kingdom  = "plant",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","proteomes") )

Again, first we download the A. thaliana proteome and furthermore are interested in the following two genes for subsequent analyses, AT1G06090 and AT1G06100. Both genes are memebers of the fatty acid desaturase family and take over functions in oxidoreductase activity. To access the protein sequences of these genes in a proteome downloaded from refseq, first users need to map the tair_locus id to the corresponding refseq ids.

For this purpose users can use the biomart() function [see Functional Annotation for details].

Here we can use the organismFilter() function to retrieve the filter argument for tair gene ids.

# search for filters related to tair
organismFilters("Arabidopsis thaliana", topic = "tair")
Source: local data frame [6 x 4]

                   name                 description           mart           dataset
1            tair_locus            TAIR locus ID(s) plants_mart_26 athaliana_eg_gene
2      tair_locus_model      TAIR locus model ID(s) plants_mart_26 athaliana_eg_gene
3           tair_symbol           TAIR symbol ID(s) plants_mart_26 athaliana_eg_gene
4       with_tair_locus       with TAIR locus ID(s) plants_mart_26 athaliana_eg_gene
5 with_tair_locus_model with TAIR locus model ID(s) plants_mart_26 athaliana_eg_gene
6      with_tair_symbol      with TAIR symbol ID(s) plants_mart_26 athaliana_eg_gene

Users will observe that we need the filter argument tair_locus from mart plants_mart_26 and dataset athaliana_eg_gene.

To find the attribute argument for refseq ids we can use the organismAttributes() function.

# search for attributes related to refseq
organismAttributes("Arabidopsis thaliana", topic = "refseq")
Source: local data frame [3 x 4]

            name       description           mart           dataset
1    refseq_mrna    RefSeq mRNA ID plants_mart_26 athaliana_eg_gene
2   refseq_ncrna   RefSeq ncRNA ID plants_mart_26 athaliana_eg_gene
3 refseq_peptide RefSeq protein ID plants_mart_26 athaliana_eg_gene

Now it is clear that we need the attribute argument refseq_peptide to get the refseq id of the corresponding filter tair_locus ids. Please notice that BioMart frequently updates their mart names, so in case of plants_mart_26 a new mart version might have to be specified to receive a proper result from biomart().

bm <- biomart(genes      = c("AT3G01360", "AT3G01540"),
              mart       = "plants_mart_26", 
              dataset    = "athaliana_eg_gene",
              attributes = "refseq_peptide",
              filters    = "tair_locus")

bm
  tair_locus refseq_peptide
1  AT3G01360   NP_001030617
2  AT3G01360      NP_186785
3  AT3G01540      NP_850492
4  AT3G01540      NP_974206
5  AT3G01540      NP_566141
6  AT3G01540   NP_001030619
genes_of_interest <- which(sapply(paste0(bm[ , "refseq_peptide"],".1"),
                                  function(gene) stringr::str_detect(gene,Ath_proteome[ , geneids])))

na.omit(Ath_proteome[genes_of_interest , list(geneids, seqs)])
          geneids
1: NP_001030617.1

seqs
1: mgilsyavagggfavigawesldssnldpnsssgtadssspmaqirasppksvgsssipvallsslfiansffsffs
sigsrdrvgsmiqlqivavavlflyyailtylvnsknavfvalpssittllllfgfieeflmfylqkkdvsgienryydl
mlvpiaicvfstvfelksdssthrfaklgrgiglilqgtwflqmgvsfftglitnnctfheksrgnftikckghgdyhra
kaiatlqfnchlalmvvvatglfsviankngylrqdhskyrplgaelenlstftldsdeedevreesnvakevglngnsshd