The biomartr
package allows users to retrieve biological sequences in a very simple and intuitive way.
Using biomartr
, users can retrieve either genomes, proteomes, or CDS data using the specialized functions:
getGenome()
getProteome()
getCDS()
geneSequence()
First users can check whether or not the genome, proteome, or CDS of their interest is available for download.
Using the scientific name of the organism of interest, users can check whether the corresponding genome is available via the is.genome.available()
function.
# checking whether or not the Arabidopsis thaliana
# genome is avaialable for download
is.genome.available("Arabidopsis thaliana")
[1] TRUE
By specifying the details = TRUE
argument, the genome file size as well as additional information can be printed to the console.
# printing details to the console
is.genome.available("Arabidopsis thaliana", details = TRUE)
organism_name kingdoms group subgroup file_size_MB chrs organelles plasmids bio_projects
682 Arabidopsis thaliana Eukaryota Plants Land Plants 119.668 6 2 NA 6
Users will observe that the Arabidopsis thaliana
genome file has a size of 119.668 MB
.
Note: The availability of genomes has been taken from NCBI.
Users can determine the total number of available genomes using the listGenomes()
function.
length(listGenomes())
[1] 13074
Hence, currently 13074 genomes (including all kingdoms of life) are stored on NCBI servers.
Optionally, users can also specify the database for which the availability of organisms shall be checked.
# cheking whether A. thaliana is available in the refseq database
is.genome.available("Arabidopsis thaliana", database = "refseq")
[1] TRUE
Users can also determine the total number of genomes stored in refseq.
length(listGenomes(database = "refseq"))
[1] 4896
This result shows that so far (year 2015) 4896 genomes are stored in refseq.
The simplest way to work with listGenomes()
is to print available genomes to the console.
# the simplest way to retrieve names of available genomes stored within NCBI databases
head(listGenomes() , 5)
[1] "'Chrysanthemum coronarium' phytoplasma"
[2] "'Deinococcus soli' Cha et al. 2014"
[3] "Abaca bunchy top virus"
[4] "Abalone herpesvirus Victoria/AUS/2009"
[5] "Abalone shriveling syndrome-associated virus"
In case users are interested in a detailed output of the corresponding organism file stored on NCBI, again they can specify the details = TRUE
argument.
# show all details
head(listGenomes(details = TRUE) , 5)
organism_name kingdoms group
1 'Chrysanthemum coronarium' phytoplasma Bacteria Tenericutes
2 'Deinococcus soli' Cha et al. 2014 Bacteria Deinococcus-Thermus
3 Abaca bunchy top virus Viruses ssDNA viruses
4 Abalone herpesvirus Victoria/AUS/2009 Viruses dsDNA viruses, no RNA stage
5 Abalone shriveling syndrome-associated virus Viruses dsDNA viruses, no RNA stage
subgroup file_size_MB chrs organelles plasmids bio_projects
1 Mollicutes 0.739592 NA NA NA 1
2 Deinococci 3.236980 1 NA NA 1
3 Nanoviridae 0.006422 6 NA NA 1
4 unclassified 0.211518 1 NA NA 1
5 unclassified 0.034952 1 NA NA 1
Users will observe that the detailed information output includes the organism_name
, kingdom
, group
, subgroup
, file_size_MB
, chrs
, organelles
, plasmids
, and bio_projects
.
In case users are interested in organisms classified into a specific kingdom of life, they can use the kingdom
argument to filter for organisms that are classified into the corresponding kingdom.
# show all details only for Bacteria
head(listGenomes(kingdom = "Bacteria", details = TRUE) , 5)
organism_name kingdoms group subgroup
1 'Chrysanthemum coronarium' phytoplasma Bacteria Tenericutes Mollicutes
2 'Deinococcus soli' Cha et al. 2014 Bacteria Deinococcus-Thermus Deinococci
3 Abiotrophia defectiva Bacteria Firmicutes Bacilli
4 Acaricomes phytoseiuli Bacteria Actinobacteria Actinobacteria
5 Acaryochloris Bacteria Cyanobacteria Oscillatoriophycideae
file_size_MB chrs organelles plasmids bio_projects
1 0.739592 NA NA NA 1
2 3.236980 1 NA NA 1
3 2.043440 NA NA NA 1
4 2.419520 NA NA NA 1
5 7.875480 NA NA NA 1
The following filters can be specified for the kingdom
argument: all
, Archaea
, Bacteria
, Eukaryota
, Viroids
, and Viruses
.
Furthermore, users can simply count the kingdom specific availability of genomes as following:
# the number of genomes available for each kingdom
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "kingdoms"])
Archaea Bacteria Eukaryota Viroids Viruses
386 6473 1454 45 4716
Analogous computations can be performed for group
, subgroup
, etc.
# the number of genomes available for each group
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "group"])
Actinobacteria Aenigmarchaeota
934 3
Animals Aquificae
562 16
Armatimonadetes Avsunviroidae
3 4
Bacteroidetes/Chlorobi group Caldiserica
481 2
Chlamydiae/Verrucomicrobia group Chloroflexi
60 32
Chrysiogenetes Crenarchaeota
2 60
Cyanobacteria Deferribacteres
103 6
Deinococcus-Thermus Deltavirus
43 1
Diapherotrites Dictyoglomi
3 2
dsDNA viruses, no RNA stage dsRNA viruses
2008 213
Elusimicrobia Euryarchaeota
3 236
Fibrobacteres/Acidobacteria group Firmicutes
47 1151
Fungi Fusobacteria
550 25
Gemmatimonadetes Korarchaeota
5 1
Lokiarchaeota Nanoarchaeota
1 10
Nanohaloarchaeota Nitrospinae
4 2
Nitrospirae Other
10 15
Parvarchaeota Planctomycetes
2 22
Plants Pospiviroidae
162 36
Proteobacteria Protists
2256 167
Retro-transcribing viruses Satellites
130 214
Spirochaetes ssDNA viruses
81 791
ssRNA viruses Synergistetes
1252 18
Tenericutes Thaumarchaeota
132 49
Thermodesulfobacteria Thermotogae
7 26
unassigned viruses unclassified Archaea
10 17
unclassified archaeal viruses unclassified Bacteria
2 1004
unclassified phages unclassified viroids
35 5
unclassified virophages unclassified viruses
5 53
Users can also order organisms by their file size.
# order by file size
library(dplyr)
ncbi_genomes <- listGenomes(details = TRUE)
head(arrange(ncbi_genomes, desc(file_size_MB)) , 10)
organism_name kingdoms group subgroup file_size_MB chrs organelles plasmids bio_projects
1 Picea glauca Eukaryota Plants Land Plants 26936.20 NA NA NA 2
2 Acanthoscurria geniculata Eukaryota Animals Other Animals 7178.40 NA NA NA 1
3 Locusta migratoria Eukaryota Animals Insects 5759.80 NA 1 NA 1
4 Orycteropus afer Eukaryota Animals Mammals 4444.08 NA 1 NA 1
5 Chrysochloris asiatica Eukaryota Animals Mammals 4210.11 NA 1 NA 1
6 Elephantulus edwardii Eukaryota Animals Mammals 3843.98 NA NA NA 1
7 Triticum aestivum Eukaryota Plants Land Plants 3800.33 1 1 NA 6
8 Triticum urartu Eukaryota Plants Land Plants 3747.05 NA 1 NA 1
9 Dasypus novemcinctus Eukaryota Animals Mammals 3631.52 NA 1 NA 1
10 Nicotiana tabacum Eukaryota Plants Land Plants 3613.16 NA 1 NA 3
This analysis shows that Picea glauca
has the largest genome available on the NCBI server.
Internally, the listGenomes()
function downloads the Genome Reports file from NCBI and stores it in a tempfile()
folder named _ncbi_downloads/overview.txt
. It is only downloaded once and is then accessed from your hard drive. In case users would like to update the Genome Reports file, they can specify the update = TRUE
argument which allows them to reload the Genome Reports file from the NCBI server.
# users can also update the organism table using the 'update' argument
head(listGenomes(details = TRUE, update = TRUE) , 5)
organism_name kingdoms group subgroup file_size_MB
1 'Chrysanthemum coronarium' phytoplasma Bacteria Tenericutes Mollicutes 0.739592
2 'Deinococcus soli' Cha et al. 2014 Bacteria Deinococcus-Thermus Deinococci 3.236980
3 Abaca bunchy top virus Viruses ssDNA viruses Nanoviridae 0.006422
4 Abalone herpesvirus Victoria/AUS/2009 Viruses dsDNA viruses, no RNA stage unclassified 0.211518
5 Abalone shriveling syndrome-associated virus Viruses dsDNA viruses, no RNA stage unclassified 0.034952
chrs organelles plasmids bio_projects
1 NA NA NA 1
2 1 NA NA 1
3 6 NA NA 1
4 1 NA NA 1
5 1 NA NA 1
Again, the listGenomes()
function can be users to filter for available genome information in refseq.
# list all Eukaryota that are stored in refseq
head(listGenomes(kingdom = "Eukaryota", database = "refseq") , 20)
organism_name
1 Agaricus bisporus
2 Ajellomyces capsulatus
3 Ajellomyces dermatitidis
4 Arthroderma benhamiae
5 Arthroderma otae
6 Aspergillus clavatus
7 Aspergillus flavus
8 Aspergillus fumigatus
9 Aspergillus nidulans
10 Aspergillus niger
11 Aspergillus oryzae
12 Aspergillus terreus
13 Auricularia delicata
14 Batrachochytrium dendrobatidis
15 Baudoinia compniacensis
16 Bipolaris oryzae
17 Bipolaris sorokiniana
18 Bipolaris zeicola
19 Botrytis cinerea
20 Candida albicans
Or analogous:
# the number of genomes available for each kingdom stored in refseq
ncbi_genomes <- listGenomes(details = TRUE, database = "refseq")
table(ncbi_genomes[ , "kingdoms"])
Archaea Bacteria Eukaryota
220 4167 509
Note that when running the listGenomes()
function for the first time, it might take a while until the function returns any results, because necessary information need to be downloaded from NCBI databases. All subsequent executions of listGenomes()
will then respond very fast, because they will access the corresponding files stored on your hard drive.
After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, or CDS file in fasta
format. The following functions allow users to download proteomes, genomes, and CDS files from several database resources such as: refseq
. When a corresponding proteome, genome, or CDS file was loaded to your hard-drive, a documentation *.txt
file is generated storing File Name
, Organism
, Database
, URL
, and DATE
information. This way a better reproducibility of proteome, genome, and CDS versions used for subsequent data analyses can be achieved.
The easiest way to download a genome is to use the getGenome()
function.
In this example we will download the genome of A. thaliana
.
The getGenome()
function is an interface function to the NCBI refseq database from which corresponding genomes are downloaded.
For this purpose users need to specify the kingdom in which their organism of interest is classified into, e.g. "archaea"
,"bacteria"
, "fungi"
, "invertebrate"
, "plant"
, "protozoa"
, "vertebrate_mammalian"
, or "vertebrate_other"
and then the scientific name of the organism of interest.
# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
getGenome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","genomes") )
The getGenome()
function creates a directory named '_ncbi_downloads/genomes'
into which the corresponding genome named Arabidopsis_thaliana_genome.fna.gz
is downloaded. The read_genome()
function enables users to work with the genome as data.table
object.
# path to genome: '_ncbi_downloads/genomes/Arabidopsis_thaliana_genome.fna.gz'
file_path <- file.path("_ncbi_downloads","genomes","Arabidopsis_thaliana_genome.fna.gz")
# read genome as data.table object
Ath_genome <- read_genome(file_path, format = "fasta")
In case users would like to store the genome file at a different location, they can specify the path = file.path("put","your","path","here")
argument.
The getProteome()
function is also an interface function to the NCBI refseq database from which corresponding proteomes are downloaded. It works analogous to the getGenome()
function.
# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","proteomes") )
The getProteome()
function creates a directory named _ncbi_downloads/proteomes
into which the orresponding proteome named Arabidopsis_thaliana_protein.faa.gz
is downloaded. The read_proteome()
function enables users to work with the proteome as data.table
object.
# path to proteome: '_ncbi_downloads/proteomes/Arabidopsis_thaliana_protein.faa.gz'
file_path <- file.path("_ncbi_downloads","proteomes","Arabidopsis_thaliana_protein.faa.gz")
# read proteome as data.table object
Ath_proteome <- read_proteome(file_path, format = "fasta")
In case users would like to store the proteome file at a different location, they can specify the path = file.path("put","your","path","here")
argument.
The getCDS()
function is also an interface function to the NCBI refseq database from which the corresponding CDS files are downloaded. It works analogous to the getGenome()
and getProteome()
functions but for CDS files.
# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
getCDS( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","CDS") )
The getCDS()
function creates a directory named _ncbi_downloads/CDS
into which corresponding CDS are loaded. The read_cds()
function allows you to read the correspondning CDS as data.table
object.
# path to CDS file: '_ncbi_downloads/CDS/Arabidopsis_thaliana_rna.fna.gz'
file_path <- file.path("_ncbi_downloads","CDS","Arabidopsis_thaliana_rna.fna.gz")
# read CDS as data.table object
Ath_cds <- read_cds(file_path, format = "fasta")
In case users would like to store the CDS file at a different location, they can specify the path = file.path("put","your","path","here")
argument.
Furthermore, the getCDS()
function checks whether all CDS sequences can be divided by 3 (codons). In case sequences of particualar genes cannot be divided by 3, a warning massage is returned to quantify the number of corresponding genes. In case users would like to extract these sequences from their data, they can specify the delete_delete_corrupt = TRUE
argument, which will then delete all corrupt CDS sequences.
For most analyses only subsets of sequences (taken from the entire genome) are needed. This section introduces several approaches to select a set of sequences for furthr analyses.
getProteome()
As seen before, the getProteome()
function allows users to download the entire proteome of a specific organism of interest that is stored in refseq.
# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","proteomes") )
Again, first we download the A. thaliana proteome and furthermore are interested in the following two genes for subsequent analyses, AT1G06090
and AT1G06100
. Both genes are memebers of the fatty acid desaturase family and take over functions in oxidoreductase activity. To access the protein sequences of these genes in a proteome downloaded from refseq, first users need to map the tair_locus
id to the corresponding refseq
ids.
For this purpose users can use the biomart()
function [see Functional Annotation for details].
Here we can use the organismFilter()
function to retrieve the filter argument for tair gene ids
.
# search for filters related to tair
organismFilters("Arabidopsis thaliana", topic = "tair")
Source: local data frame [6 x 4]
name description mart dataset
1 tair_locus TAIR locus ID(s) plants_mart_26 athaliana_eg_gene
2 tair_locus_model TAIR locus model ID(s) plants_mart_26 athaliana_eg_gene
3 tair_symbol TAIR symbol ID(s) plants_mart_26 athaliana_eg_gene
4 with_tair_locus with TAIR locus ID(s) plants_mart_26 athaliana_eg_gene
5 with_tair_locus_model with TAIR locus model ID(s) plants_mart_26 athaliana_eg_gene
6 with_tair_symbol with TAIR symbol ID(s) plants_mart_26 athaliana_eg_gene
Users will observe that we need the filter argument tair_locus
from mart
plants_mart_26
and dataset
athaliana_eg_gene
.
To find the attribute argument for refseq ids we can use the organismAttributes()
function.
# search for attributes related to refseq
organismAttributes("Arabidopsis thaliana", topic = "refseq")
Source: local data frame [3 x 4]
name description mart dataset
1 refseq_mrna RefSeq mRNA ID plants_mart_26 athaliana_eg_gene
2 refseq_ncrna RefSeq ncRNA ID plants_mart_26 athaliana_eg_gene
3 refseq_peptide RefSeq protein ID plants_mart_26 athaliana_eg_gene
Now it is clear that we need the attribute argument refseq_peptide
to get the refseq id of the corresponding filter tair_locus
ids. Please notice that BioMart frequently updates their mart
names, so in case of plants_mart_26
a new mart version might have to be specified to receive a proper result from biomart()
.
bm <- biomart(genes = c("AT3G01360", "AT3G01540"),
mart = "plants_mart_26",
dataset = "athaliana_eg_gene",
attributes = "refseq_peptide",
filters = "tair_locus")
bm
tair_locus refseq_peptide
1 AT3G01360 NP_001030617
2 AT3G01360 NP_186785
3 AT3G01540 NP_850492
4 AT3G01540 NP_974206
5 AT3G01540 NP_566141
6 AT3G01540 NP_001030619
genes_of_interest <- which(sapply(paste0(bm[ , "refseq_peptide"],".1"),
function(gene) stringr::str_detect(gene,Ath_proteome[ , geneids])))
na.omit(Ath_proteome[genes_of_interest , list(geneids, seqs)])
geneids
1: NP_001030617.1
seqs
1: mgilsyavagggfavigawesldssnldpnsssgtadssspmaqirasppksvgsssipvallsslfiansffsffs
sigsrdrvgsmiqlqivavavlflyyailtylvnsknavfvalpssittllllfgfieeflmfylqkkdvsgienryydl
mlvpiaicvfstvfelksdssthrfaklgrgiglilqgtwflqmgvsfftglitnnctfheksrgnftikckghgdyhra
kaiatlqfnchlalmvvvatglfsviankngylrqdhskyrplgaelenlstftldsdeedevreesnvakevglngnsshd