The BioMart project enables users to retrieve a vast diversity of annotation data for specific organisms. Steffen Durinck and Wolfgang Huber provide an powerful interface between the R language and BioMart by providing the R package biomaRt. The following sections will introduce users to the functionality and data retrieval precedures using the biomaRt
package and will then introduce them to the interface functions biomart()
and biomart_organisms()
implemented in biomartr
that are based on the biomaRt
methodology but aim to introduce an more intuitive way of interacting with BioMart.
The best way to get started with the methodology presented by biomaRt is to understand the workflow of data retrieval. The database provided by BioMart is organized in so called: marts
, datasets
, and attributes
. So when users want to retrieve information for a specific organism of interest, first they need to specify the marts
and datasets
in which the information of the corresponding organism can be found and subsequently they can specify the attributes
that shall be returned for that particular organism.
The availability of marts
, datasets
, and attributes
can be checked by the following functions:
# install the biomaRt package
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
# load biomaRt
library(biomaRt)
# look at top 10 databases
head(listMarts(), 10)
biomart version
1 ensembl ENSEMBL GENES 80 (SANGER UK)
2 snp ENSEMBL VARIATION 80 (SANGER UK)
3 regulation ENSEMBL REGULATION 80 (SANGER UK)
4 vega VEGA 60 (SANGER UK)
5 fungi_mart_26 ENSEMBL FUNGI 26 (EBI UK)
6 fungi_variations_26 ENSEMBL FUNGI VARIATION 26 (EBI UK)
7 metazoa_mart_26 ENSEMBL METAZOA 26 (EBI UK)
8 metazoa_variations_26 ENSEMBL METAZOA VARIATION 26 (EBI UK)
9 plants_mart_26 ENSEMBL PLANTS 26 (EBI UK)
10 plants_variations_26 ENSEMBL PLANTS VARIATION 26 (EBI UK)
Users will observe that several marts
providing annotation for specific classes of organisms or groups of organisms are available.
For our example, we will choose the plants_variations_26
mart
and list all available datasets that are element of this mart
.
tail(listDatasets(useMart("plants_mart_26")), 10)
dataset description
30 mtruncatula_eg_gene Medicago truncatula str. A17 genes (MedtrA17_4.0 (2014-06-EnsemblPlants))
31 atrichopoda_eg_gene Amborella trichopoda genes (AMTR1.0 (2014-01-AGD))
32 creinhardtii_eg_gene Chlamydomonas reinhardtii genes (v3.1 (2007-11-ENA))
33 olongistaminata_eg_gene Oryza longistaminata genes (2013-09-Maker)
34 cmerolae_eg_gene Cyanidioschyzon merolae genes (ASM9120v1 (2008-11-ENA))
35 oglaberrima_eg_gene Oryza glaberrima genes (AGI1.1 (2011-05-AGI))
36 tcacao_eg_gene Theobroma cacao genes (Theobroma_cacao_20110822 (2014-05-EnsemblPlants))
37 macuminata_eg_gene Musa acuminata genes (MA1 (2012-08-Cirad))
38 turartu_eg_gene Triticum urartu genes (ASM34745v1 (2013-04-BGI))
39 athaliana_eg_gene Arabidopsis thaliana genes (TAIR10 (2010-09-TAIR10))
version
30 MedtrA17_4.0 (2014-06-EnsemblPlants)
31 AMTR1.0 (2014-01-AGD)
32 v3.1 (2007-11-ENA)
33 2013-09-Maker
34 ASM9120v1 (2008-11-ENA)
35 AGI1.1 (2011-05-AGI)
36 Theobroma_cacao_20110822 (2014-05-EnsemblPlants)
37 MA1 (2012-08-Cirad)
38 ASM34745v1 (2013-04-BGI)
39 TAIR10 (2010-09-TAIR10)
Please notice that BioMart frequently updates their mart
names, so in case of plants_mart_26
a new mart version might have to be specified to receive a proper result from biomart()
.
The useMart()
function is a wrapper function provided by biomaRt
to connect a selected BioMart database (mart
) with a corresponding dataset stored within this mart
.
We select dataset athaliana_eg_gene
and now check for available attributes (annotation data) that can be accessed for Arabidopsis thaliana
genes.
head(listAttributes(useDataset(dataset = "athaliana_eg_gene", mart = useMart("plants_mart_26"))), 10)
name description
1 ensembl_gene_id Gene stable ID
2 ensembl_transcript_id Transcript stable ID
3 ensembl_peptide_id Protein stable ID
4 chromosome_name Chromosome/scaffold name
5 start_position Gene start (bp)
6 end_position Gene end (bp)
7 strand Strand
8 band Band
9 transcript_start Transcript start (bp)
10 transcript_end Transcript end (bp)
Please note the nested structure of this attribute query. For an attribute query procedure an additional wrapper function named useDataset()
is needed in which useMart()
and a corresponding dataset needs to be specified. The result is a table storing the name of available attributes for A. thaliana as well as a short description.
Furthermore, users can retrieve all filters for A. thaliana that can be specified by the actual BioMart query process.
head(listFilters(useDataset(dataset = "athaliana_eg_gene", mart = useMart("plants_mart_26"))), 10)
name description
1 chromosome_name Chromosome name
2 start Start
3 end End
4 strand Strand
5 chromosomal_region e.g. 1:100:10000:-1, 1:100000:2000000:1
6 with_affy_arabidopsis_ath1_121501 with Affymetrix array Arabidopsis ATH1 121501 ID(s)
7 with_arrayexpress with ArrayExpress ID(s)
8 with_CATMA_CATMA with CATMA A. thaliana probe ID(s)
9 with_chembl with ChEMBL ID(s)
10 with_embl with ENA/GenBank ID(s)
After accumulating all this information, it is now possible to perform an actual BioMart query by using the getBM()
function.
In this example we will retrieve attributes: start_position
,end_position
and description
for the A. thaliana genes "AT1G06090", "AT1G06100", "AT1G06110", "AT1G06120", "AT1G06130", and "AT1G06200"
. Since the input genes are tair locus ids
, we need to specify the filters
argument filters = "tair_locus"
.
# 1) select a mart and data set
mart <- useDataset("athaliana_eg_gene", mart = useMart("plants_mart_26"))
# 2) run a biomart query using the getBM() function
# and specify the attributes and filter arguments
geneSet <- c("AT1G06090", "AT1G06100", "AT1G06110", "AT1G06120", "AT1G06130", "AT1G06200")
resultTable <- getBM(attributes = c("start_position","end_position","description"),
filters = "tair_locus", values = geneSet, mart = mart)
resultTable
start_position end_position description
1 1847883 1849693 Fatty acid desaturase family protein [Source:TAIR;Acc:AT1G06090]
2 1851530 1853019 Fatty acid desaturase family protein [Source:TAIR;Acc:AT1G06100]
3 1853077 1855031 SKP1/ASK-interacting protein 16 [Source:TAIR;Acc:AT1G06110]
4 1855943 1857718 Fatty acid desaturase family protein [Source:TAIR;Acc:AT1G06120]
5 1857767 1860730 glyoxalase 2-4 [Source:TAIR;Acc:AT1G06130]
6 1894336 1896935 Peptidase S24/S26A/S26B/S26C family protein [Source:TAIR;Acc:AT1G06200]
When using getBM()
users can pass all attributes retrieved by listAttributes()
to the attributes
argument of the getBM()
function.
This query methodology provided by BioMart
and the biomaRt
package is a very well defined approach for accurate annotation retrieval. Nevertheless, when learning this query methodology it (subjectively) seems non-intuitive from the user perspective. Therefore, the biomartr
package provides another query methodology that aims to be more organism centric.
Taken together, the following workflow allows users to perform fast BioMart queries for attributes using the biomart()
function implemented in this biomartr
package:
get attributes, datasets, and marts via : organismAttributes()
choose available filters via: organismFilters()
specify a set of query genes
specify all arguments of the biomart()
function using steps 1) - 3) and perform a BioMart query
Note that dataset names change very frequently due to the update of dataset versions. So in case some query functions do not work properly, users should check with organismAttributes(update = TRUE)
whether or not their dataset name has been changed. For example, organismAttributes("Arabidopsis thaliana", topic = "id", update = TRUE)
might reveal that the dataset plants_mart_26
has changed to plants_mart_27
.
The getMarts()
function allows users to list all available databases that can be accessed through BioMart interfaces.
# load the biomartr package
library(biomartr)
# list all available databases (top 10)
head(getMarts(), 10)
mart version
1 ensembl ENSEMBL GENES 80 (SANGER UK)
2 snp ENSEMBL VARIATION 80 (SANGER UK)
3 regulation ENSEMBL REGULATION 80 (SANGER UK)
4 vega VEGA 60 (SANGER UK)
5 fungi_mart_26 ENSEMBL FUNGI 26 (EBI UK)
6 fungi_variations_26 ENSEMBL FUNGI VARIATION 26 (EBI UK)
7 metazoa_mart_26 ENSEMBL METAZOA 26 (EBI UK)
8 metazoa_variations_26 ENSEMBL METAZOA VARIATION 26 (EBI UK)
9 plants_mart_26 ENSEMBL PLANTS 26 (EBI UK)
10 plants_variations_26 ENSEMBL PLANTS VARIATION 26 (EBI UK)
Now users can select a specific database to list all available datasets that can be accessed through this database. In this example we choose the plants_mart_26
database.
head(getDatasets(mart = "plants_mart_26") , 5)
dataset description
1 atauschii_eg_gene Aegilops tauschii genes (ASM34733v1 (2013-12-BGI))
2 obrachyantha_eg_gene Oryza brachyantha genes (Oryza_brachyantha.v1.4b (OGEv1.4))
3 ppersica_eg_gene Prunus persica genes (Prupe1_0 (2013-03))
4 ptrichocarpa_eg_gene Populus trichocarpa genes (JGI2.0 (2010-01-JGI))
5 stuberosum_eg_gene Solanum tuberosum genes (SolTub_3.0 (SolTub_3.0))
version
1 ASM34733v1 (2013-12-BGI)
2 Oryza_brachyantha.v1.4b (OGEv1.4)
3 Prupe1_0 (2013-03)
4 JGI2.0 (2010-01-JGI)
5 SolTub_3.0 (SolTub_3.0)
Now you can select the dataset athaliana_eg_gene
and list all available attributes that can be retrieved from this dataset.
tail(getDatasets(mart = "plants_mart_26") , 3)
dataset description version
37 macuminata_eg_gene Musa acuminata genes (MA1 (2012-08-Cirad)) MA1 (2012-08-Cirad)
38 turartu_eg_gene Triticum urartu genes (ASM34745v1 (2013-04-BGI)) ASM34745v1 (2013-04-BGI)
39 athaliana_eg_gene Arabidopsis thaliana genes (TAIR10 (2010-09-TAIR10)) TAIR10 (2010-09-TAIR10)
Now that you have selected a database (plants_mart_26
) and a dataset (athaliana_eg_gene
), users can list all available attributes for this dataset using the getAttributes()
function.
# list all available attributes for dataset: athaliana_eg_gene
head( getAttributes(mart = "plants_mart_26", dataset = "athaliana_eg_gene"), 10 )
name description
1 ensembl_gene_id Gene stable ID
2 ensembl_transcript_id Transcript stable ID
3 ensembl_peptide_id Protein stable ID
4 chromosome_name Chromosome/scaffold name
5 start_position Gene start (bp)
6 end_position Gene end (bp)
7 strand Strand
8 band Band
9 transcript_start Transcript start (bp)
10 transcript_end Transcript end (bp)
Finally, the getFilters()
function allows users to list available filters for a specific dataset that can be used for a biomart()
query.
# list all available filters for dataset: athaliana_eg_gene
head( getFilters(mart = "plants_mart_26", dataset = "athaliana_eg_gene"), 10 )
name description
1 chromosome_name Chromosome name
2 start Start
3 end End
4 strand Strand
5 chromosomal_region e.g. 1:100:10000:-1, 1:100000:2000000:1
6 with_affy_arabidopsis_ath1_121501 with Affymetrix array Arabidopsis ATH1 121501 ID(s)
7 with_arrayexpress with ArrayExpress ID(s)
8 with_CATMA_CATMA with CATMA A. thaliana probe ID(s)
9 with_chembl with ChEMBL ID(s)
10 with_embl with ENA/GenBank ID(s)
In most use cases, users will work with a single or a set of model organisms. In this process they will mostly be interested in specific annotations for this particular model organism. The organismBM()
function addresses this issue and provides users with an organism centric query to marts
and datasets
which are available for a particular organism of interest.
Note that when running the following functions for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt file named _biomart/listDatasets.txt
within the tempdir()
folder, allowing subsequent queries to perform much faster. The tempdir()
folder however, will be deleted after a new R session was established, so in this case the inital call of the subsequent functions, again will take time to retrieve all organism specific data from the BioMart API.
# retrieving all available datasets and biomart connections for
# a specific query organism (scientific name)
organismBM(organism = "Arabidopsis thaliana")
organism_name description
1 Arabidopsis thaliana Arabidopsis thaliana genes (TAIR10 (2010-09-TAIR10))
2 Arabidopsis thaliana Arabidopsis thaliana variations (TAIR10 (2010-09-TAIR10))
3 Arabidopsis thaliana Arabidopsis thaliana structural Variations (TAIR10 (2010-09-TAIR10))
4 Arabidopsis thaliana Arabidopsis thaliana genes (TAIR10 (2010-09-TAIR10))
5 Arabidopsis thaliana Arabidopsis thaliana variations (TAIR10 (2010-09-TAIR10))
6 Arabidopsis thaliana Arabidopsis thaliana structural Variations (TAIR10 (2010-09-TAIR10))
mart dataset version
1 plants_mart_26 athaliana_eg_gene TAIR10 (2010-09-TAIR10)
2 plants_variations_26 athaliana_eg_snp TAIR10 (2010-09-TAIR10)
3 plants_variations_26 athaliana_eg_structvar TAIR10 (2010-09-TAIR10)
4 ENSEMBL_MART_PLANT athaliana_eg_gene TAIR10 (2010-09-TAIR10)
5 ENSEMBL_MART_PLANT_SNP athaliana_eg_snp TAIR10 (2010-09-TAIR10)
6 ENSEMBL_MART_PLANT_SNP athaliana_eg_structvar TAIR10 (2010-09-TAIR10)
The result is a table storing all marts
and datasets
from which annotations can be retrieved for Arabidopsis thaliana. Furthermore, a short description as well as the version of the dataset being accessed (very useful for publications) is returned.
Users will observe that 3 different marts
provide 6 different datasets
storing annotation information for A. thaliana.
Another example using Homo sapiens
.
# retrieving all available datasets and biomart connections for "Homo sapiens"
organismBM(organism = "Homo sapiens")
organism_name description mart
1 Homo sapiens Homo sapiens genes (GRCh38.p2) ensembl
2 Homo sapiens Homo sapiens Short Variation (SNPs and indels) (GRCh38.p2) snp
3 Homo sapiens Homo sapiens Structural Variation (GRCh38.p2) snp
4 Homo sapiens Homo sapiens Somatic Short Variation (SNPs and indels) (GRCh38.p2) snp
5 Homo sapiens Homo sapiens Somatic Structural Variation (GRCh38.p2) snp
6 Homo sapiens Homo sapiens Regulatory Evidence (GRCh38.p2) regulation
7 Homo sapiens Homo sapiens Binding Motifs (GRCh38.p2) regulation
8 Homo sapiens Homo sapiens Regulatory Features (GRCh38.p2) regulation
9 Homo sapiens Homo sapiens miRNA Target Regions (GRCh38.p2) regulation
10 Homo sapiens Homo sapiens Regulatory Segments (GRCh38.p2) regulation
11 Homo sapiens Homo sapiens Other Regulatory Regions (GRCh38.p2) regulation
12 Homo sapiens Homo sapiens genes (GRCh38.p2) vega
dataset version
1 hsapiens_gene_ensembl GRCh38.p2
2 hsapiens_snp GRCh38.p2
3 hsapiens_structvar GRCh38.p2
4 hsapiens_snp_som GRCh38.p2
5 hsapiens_structvar_som GRCh38.p2
6 hsapiens_annotated_feature GRCh38.p2
7 hsapiens_motif_feature GRCh38.p2
8 hsapiens_regulatory_feature GRCh38.p2
9 hsapiens_mirna_target_feature GRCh38.p2
10 hsapiens_segmentation_feature GRCh38.p2
11 hsapiens_external_feature GRCh38.p2
12 hsapiens_gene_vega GRCh38.p2
Similar to the biomaRt
package query methodology, users need to specify attributes
and filters
to be able to perform accurate BioMart queries. Here the functions organismAttributes()
and organismFilters()
provide useful and intuitive concepts to obtain this information.
# return available attributes for "Arabidopsis thaliana"
head(organismAttributes("Arabidopsis thaliana"), 20)
Source: local data frame [20 x 4]
name description mart
1 3_utr_end 3' UTR end plants_mart_26
2 3_utr_start 3' UTR start plants_mart_26
3 3utr 3' UTR plants_mart_26
4 5_utr_end 5' UTR end plants_mart_26
5 5_utr_start 5' UTR start plants_mart_26
6 5utr 5' UTR plants_mart_26
7 affy_ath1_121501 Affymetrix array Arabidopsis ATH1 121501 ID plants_mart_26
8 allele Variant allele plants_mart_26
9 allele_string_2076 Consequence specific allele plants_mart_26
10 alyrata_eg_chrom_end Arabidopsis lyrata end (bp) plants_mart_26
11 alyrata_eg_chrom_start Arabidopsis lyrata start (bp) plants_mart_26
12 alyrata_eg_chromosome Arabidopsis lyrata chromosome/scaffold plants_mart_26
13 alyrata_eg_gene Arabidopsis lyrata gene stable ID plants_mart_26
14 alyrata_eg_homolog_ancestor Ancestor plants_mart_26
15 alyrata_eg_homolog_dn dS plants_mart_26
16 alyrata_eg_homolog_ds dN plants_mart_26
17 alyrata_eg_homolog_ensembl_peptide Arabidopsis lyrata protein stable ID plants_mart_26
18 alyrata_eg_homolog_is_tree_compliant Orthology confidence [0 low, 1 high] plants_mart_26
19 alyrata_eg_homolog_perc_id % identity plants_mart_26
20 alyrata_eg_homolog_perc_id_r1 Arabidopsis lyrata % identity plants_mart_26
Users will observe that the organismAttributes()
function returns a data.frame storing attribute names, datasets, and marts which are available for Arabidopsis thaliana
.
An additional feature provided by organismAttributes()
is the topic
argument. The topic
argument allows users to to search for specific attributes, topics, or categories for faster filtering.
# search for attribute topic "id"
head(organismAttributes("Arabidopsis thaliana", topic = "id"), 20)
Source: local data frame [20 x 4]
name description
1 alyrata_eg_homolog_ensembl_peptide Arabidopsis lyrata protein stable ID
2 alyrata_eg_homolog_perc_id % identity
3 alyrata_eg_homolog_perc_id_r1 Arabidopsis lyrata % identity
4 atauschii_eg_homolog_ensembl_peptide Aegilops tauschii protein stable ID
5 atauschii_eg_homolog_perc_id % identity
6 atauschii_eg_homolog_perc_id_r1 Aegilops tauschii % identity
7 athaliana_eg_paralog_ensembl_peptide Representative protein stable ID
8 athaliana_eg_paralog_paralog_ensembl_peptide Paralog protein stable ID
9 atrichopoda_eg_homolog_ensembl_peptide Amborella trichopoda protein stable ID
10 atrichopoda_eg_homolog_perc_id % identity
11 atrichopoda_eg_homolog_perc_id_r1 Amborella trichopoda % identity
12 bdistachyon_eg_homolog_ensembl_peptide Brachypodium distachyon protein stable ID
13 bdistachyon_eg_homolog_perc_id % identity
14 bdistachyon_eg_homolog_perc_id_r1 Brachypodium distachyon % identity
15 boleracea_eg_homolog_ensembl_peptide Brassica oleracea protein stable ID
16 boleracea_eg_homolog_perc_id % identity
17 boleracea_eg_homolog_perc_id_r1 Brassica oleracea % identity
18 brapa_eg_homolog_ensembl_peptide Brassica rapa protein stable ID
19 brapa_eg_homolog_perc_id % identity
20 brapa_eg_homolog_perc_id_r1 Brassica rapa % identity
Variables not shown: mart (chr), dataset (chr)
Now, all attribute names
having id
as part of their name
are being returned.
Another example is topic = "homolog"
.
# search for attribute topic "homolog"
head(organismAttributes("Arabidopsis thaliana", topic = "homolog"), 20)
Source: local data frame [20 x 4]
name description mart
1 alyrata_eg_homolog_ancestor Ancestor plants_mart_26
2 alyrata_eg_homolog_dn dS plants_mart_26
3 alyrata_eg_homolog_ds dN plants_mart_26
4 alyrata_eg_homolog_ensembl_peptide Arabidopsis lyrata protein stable ID plants_mart_26
5 alyrata_eg_homolog_is_tree_compliant Orthology confidence [0 low, 1 high] plants_mart_26
6 alyrata_eg_homolog_perc_id % identity plants_mart_26
7 alyrata_eg_homolog_perc_id_r1 Arabidopsis lyrata % identity plants_mart_26
8 atauschii_eg_homolog_ancestor Ancestor plants_mart_26
9 atauschii_eg_homolog_dn dS plants_mart_26
10 atauschii_eg_homolog_ds dN plants_mart_26
11 atauschii_eg_homolog_ensembl_peptide Aegilops tauschii protein stable ID plants_mart_26
12 atauschii_eg_homolog_is_tree_compliant Orthology confidence [0 low, 1 high] plants_mart_26
13 atauschii_eg_homolog_perc_id % identity plants_mart_26
14 atauschii_eg_homolog_perc_id_r1 Aegilops tauschii % identity plants_mart_26
15 atrichopoda_eg_homolog_ancestor Ancestor plants_mart_26
16 atrichopoda_eg_homolog_dn dS plants_mart_26
17 atrichopoda_eg_homolog_ds dN plants_mart_26
18 atrichopoda_eg_homolog_ensembl_peptide Amborella trichopoda protein stable ID plants_mart_26
19 atrichopoda_eg_homolog_is_tree_compliant Orthology confidence [0 low, 1 high] plants_mart_26
20 atrichopoda_eg_homolog_perc_id % identity plants_mart_26
Variables not shown: dataset (chr)
Or topic = "dn"
and topic = "ds"
for dn
and ds
value retrieval.
# search for attribute topic "dn"
organismAttributes("Arabidopsis thaliana", topic = "dn")
Source: local data frame [44 x 4]
name description mart dataset
1 alyrata_eg_homolog_dn dS plants_mart_26 athaliana_eg_gene
2 atauschii_eg_homolog_dn dS plants_mart_26 athaliana_eg_gene
3 atrichopoda_eg_homolog_dn dS plants_mart_26 athaliana_eg_gene
4 bdistachyon_eg_homolog_dn dS plants_mart_26 athaliana_eg_gene
5 boleracea_eg_homolog_dn dS plants_mart_26 athaliana_eg_gene
6 brapa_eg_homolog_dn dS plants_mart_26 athaliana_eg_gene
7 cdna cDNA sequences plants_mart_26 athaliana_eg_gene
8 cdna_coding_end cDNA coding end plants_mart_26 athaliana_eg_gene
9 cdna_coding_start cDNA coding start plants_mart_26 athaliana_eg_gene
10 cdna_end Variation end in cDNA (bp) plants_mart_26 athaliana_eg_snp
.. ... ... ... ...
# search for attribute topic "ds"
organismAttributes("Arabidopsis thaliana", topic = "ds")
Source: local data frame [43 x 4]
name description mart dataset
1 alyrata_eg_homolog_ds dN plants_mart_26 athaliana_eg_gene
2 atauschii_eg_homolog_ds dN plants_mart_26 athaliana_eg_gene
3 atrichopoda_eg_homolog_ds dN plants_mart_26 athaliana_eg_gene
4 bdistachyon_eg_homolog_ds dN plants_mart_26 athaliana_eg_gene
5 boleracea_eg_homolog_ds dN plants_mart_26 athaliana_eg_gene
6 brapa_eg_homolog_ds dN plants_mart_26 athaliana_eg_gene
7 cds_end CDS end (within cDNA) plants_mart_26 athaliana_eg_gene
8 cds_end_2076 CDS end plants_mart_26 athaliana_eg_gene
9 cds_start CDS start (within cDNA) plants_mart_26 athaliana_eg_gene
10 cds_start_2076 CDS start plants_mart_26 athaliana_eg_gene
..
Analogous to the organismAttributes()
function, the organismFilters()
function returns all filters that are available for a query organism of interest.
# return available filters for "Arabidopsis thaliana"
head(organismFilters("Arabidopsis thaliana"), 20)
Source: local data frame [20 x 4]
name description mart dataset
1 affy_ath1_121501 Affymetrix array Arabidopsis ATH1 121501 ID(s) plants_mart_26 athaliana_eg_gene
2 arrayexpress ArrayExpress ID(s) plants_mart_26 athaliana_eg_gene
3 biotype Type plants_mart_26 athaliana_eg_gene
4 blastprodom_pf ProDom ID(s) plants_mart_26 athaliana_eg_gene
5 catma_catma CATMA A. thaliana probe ID(s) plants_mart_26 athaliana_eg_gene
6 chembl ChEMBL ID(s) plants_mart_26 athaliana_eg_gene
7 chr_name Chromosome name plants_mart_26 athaliana_eg_snp
8 chrom_start Start plants_mart_26 athaliana_eg_snp
9 chromosomal_region e.g. 1:100:10000:-1, 1:100000:2000000:1 plants_mart_26 athaliana_eg_gene
10 chromosome_name Chromosome name plants_mart_26 athaliana_eg_gene
11 consequence_type Consequence type plants_mart_26 athaliana_eg_snp
12 dgva_study_accession Study description plants_mart_26 athaliana_eg_structvar
13 distance_to_transcript Distance to transcript <= plants_mart_26 athaliana_eg_snp
14 embl ENA/GenBank ID(s) plants_mart_26 athaliana_eg_gene
15 end End plants_mart_26 athaliana_eg_gene
16 ensembl_gene Gene stable ID(s): plants_mart_26 athaliana_eg_snp
17 ensembl_gene_id Gene stable ID(s) plants_mart_26 athaliana_eg_gene
18 ensembl_peptide_id Protein stable ID(s) plants_mart_26 athaliana_eg_gene
19 ensembl_transcript_id Transcript stable ID(s) plants_mart_26 athaliana_eg_gene
20 entrezgene EntrezGene ID(s) plants_mart_26 athaliana_eg_gene
The organismFilters()
function also allows users to search for filters that correspond to a specific topic or category.
# search for filter topic "id"
head(organismFilters("Arabidopsis thaliana", topic = "id"), 20)
Source: local data frame [16 x 4]
name description mart
1 ensembl_gene_id Gene stable ID(s) plants_mart_26
2 ensembl_peptide_id Protein stable ID(s) plants_mart_26
3 ensembl_transcript_id Transcript stable ID(s) plants_mart_26
4 external_gene_id Gene name(s) plants_mart_26
5 go_evidence_code GO evidence code plants_mart_26
6 interpro_id InterPro ID(s) plants_mart_26
7 nasc_gene_id NASC gene ID(s) plants_mart_26
8 po_evidence_code PO evidence code plants_mart_26
9 protein_id ENA/GenBank protein ID(s) plants_mart_26
10 refseq_peptide Refseq protein ID(s) plants_mart_26
11 wikigene_id WikiGene ID(s) plants_mart_26
12 with_affy_arabidopsis_ath1_121501 with Affymetrix array Arabidopsis ATH1 121501 ID(s) plants_mart_26
13 with_nasc_gene_id with NASC gene ID(s) plants_mart_26
14 with_omeridionalis_eg_homolog Orthologous Oryza meridionalis Genes plants_mart_26
15 with_protein_id with ENA/GenBank protein ID(s) plants_mart_26
16 with_refseq_peptide with RefSeq protein ID(s) plants_mart_26
Variables not shown: dataset (chr)
The short introduction to the functionality of organismBM()
, organismAttributes()
, and organismFilters()
will allow users to perform BioMart queries in a very intuitive organism centric way. The main function to perform BioMart queries is biomart()
.
For the following examples we will assume that we are interested in the annotation of specific genes from the A. thaliana proteome. We want to map the corresponding refseq gene id to a set of other gene ids used in other databases. For this purpose, first we need consult the organismAttributes()
function.
organismAttributes("Arabidopsis thaliana", topic = "id")
Source: local data frame [186 x 4]
name description mart dataset
1 alyrata_eg_homolog_ensembl_peptide Arabidopsis lyrata protein stable ID plants_mart_26 athaliana_eg_gene
2 alyrata_eg_homolog_perc_id % identity plants_mart_26 athaliana_eg_gene
3 alyrata_eg_homolog_perc_id_r1 Arabidopsis lyrata % identity plants_mart_26 athaliana_eg_gene
4 atauschii_eg_homolog_ensembl_peptide Aegilops tauschii protein stable ID plants_mart_26 athaliana_eg_gene
5 atauschii_eg_homolog_perc_id % identity plants_mart_26 athaliana_eg_gene
6 atauschii_eg_homolog_perc_id_r1 Aegilops tauschii % identity plants_mart_26 athaliana_eg_gene
7 athaliana_eg_paralog_ensembl_peptide Representative protein stable ID plants_mart_26 athaliana_eg_gene
8 athaliana_eg_paralog_paralog_ensembl_peptide Paralog protein stable ID plants_mart_26 athaliana_eg_gene
9 atrichopoda_eg_homolog_ensembl_peptide Amborella trichopoda protein stable ID plants_mart_26 athaliana_eg_gene
10 atrichopoda_eg_homolog_perc_id % identity plants_mart_26 athaliana_eg_gene
.. ... ... ... ...
# retrieve the proteome of Arabidopsis thaliana from refseq
getProteome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","proteomes") )
file_path <- file.path("_ncbi_downloads","proteomes","Arabidopsis_thaliana_protein.faa.gz")
Ath_proteome <- read_proteome(file_path, format = "fasta")
# remove splice variants from id
gene_set <- unlist(sapply(strsplit(Ath_proteome[1:5 , geneids], ".",fixed = TRUE),function(x) x[1]))
result_BM <- biomart( genes = gene_set,
mart = "plants_mart_26",
dataset = "athaliana_eg_gene",
attributes = c("tair_locus","ensembl_peptide_id",
"external_transcript_id","interpro_id",
"nasc_gene_id"),
filters = "refseq_peptide")
result_BM
refseq_peptide tair_locus ensembl_peptide_id external_transcript_id interpro_id nasc_gene_id
1 NP_001030613 AT3G01100 AT3G01100.2 HYP1 IPR003864 AT3G01100
2 NP_001030613 AT3G01100 AT3G01100.2 HYP1 IPR027815 AT3G01100
3 NP_001030614 AT3G01310 AT3G01310.2 T22N4.6 IPR000560 AT3G01310
4 NP_001030614 AT3G01310 AT3G01310.2 T22N4.6 IPR013651 AT3G01310
5 NP_001030614 AT3G01310 AT3G01310.2 T22N4.6 IPR029033 AT3G01310
6 NP_001030615 AT3G01323 AT3G01323.1 AT3G01323
7 NP_001030616 AT3G01340 AT3G01340.2 T22N4.3 IPR001680 AT3G01340
8 NP_001030616 AT3G01340 AT3G01340.2 T22N4.3 IPR015943 AT3G01340
9 NP_001030616 AT3G01340 AT3G01340.2 T22N4.3 IPR017986 AT3G01340
10 NP_001030616 AT3G01340 AT3G01340.2 T22N4.3 IPR020472 AT3G01340
11 NP_001030617 AT3G01360 AT3G01360.2 T22N4.1 IPR006904 AT3G01360
The biomart()
function takes as arguments a set of genes (gene ids specified in the filter
argument), the corresponding mart
and dataset
, as well as the attributes
which shall be returned.
The biomartr
package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO()
function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO()
function allows GO information retrieval from the BioMart database.
In this example we will retrieve GO information for a set of A. thaliana genes stored as tair locus id
.
The getGO()
function takes several arguments as input to retrieve GO information from BioMart. First, the scientific name of the organism
of interest needs to be specified. Furthermore, a set of gene ids
as well as their corresponding filter
notation (AT1G06090
gene ids have filter
notation tair_locus
; see organismFilters()
for details) need to be specified. The database
argument then defines the database from which GO information shall be retrieved.
# search for GO terms of 6 Arabidopsis thaliana genes in tair locus id notation
GO_tbl <- getGO(organism = "Arabidopsis thaliana",
genes = c("AT1G06100", "AT1G06110", "AT1G06120", "AT1G06130", "AT1G06200"),
filters = "tair_locus",
database = "BioMart")
# look at the GO information
GO_tbl[ , c("tair_locus","go_accession","go_name_1006")]
tair_locus go_accession
1 AT1G06110 GO:0019243
2 AT1G06110 GO:0019005
3 AT1G06110 GO:0005515
4 AT1G06110 GO:0016567
5 AT1G06110 GO:0008150
6 AT1G06120 GO:0016717
7 AT1G06120 GO:0006631
8 AT1G06120 GO:0006636
9 AT1G06120 GO:0005789
10 AT1G06120 GO:0005783
11 AT1G06120 GO:0006629
12 AT1G06120 GO:0006633
13 AT1G06120 GO:0055114
14 AT1G06120 GO:0016491
15 AT1G06120 GO:0016021
16 AT1G06120 GO:0016020
17 AT1G06130 GO:0008270
18 AT1G06130 GO:0006750
19 AT1G06130 GO:0004416
20 AT1G06130 GO:0019243
21 AT1G06130 GO:0019344
22 AT1G06130 GO:0009536
23 AT1G06130 GO:0009507
24 AT1G06130 GO:0046872
25 AT1G06130 GO:0016787
26 AT1G06200 GO:0008236
27 AT1G06200 GO:0006508
28 AT1G06200 GO:0016020
Hence, for each gene id the resulting table stores all annotated GO terms found in BioMart.