The BioMart project enables users to retrieve a vast diversity of annotation data for specific organisms. Steffen Durinck and Wolfgang Huber provide an powerful interface between the R language and BioMart by providing the R package biomaRt. The following sections will introduce users to the functionality and data retrieval precedures using the biomaRt
package and will then introduce them to the interface functions biomart()
and biomart_organisms()
implemented in biomartr
that are based on the biomaRt
methodology but aim to introduce an more intuitive way of interacting with BioMart.
The best way to get started with the methodology presented by the established biomaRt package is to understand the workflow of data retrieval. The database provided by BioMart is organized in so called: marts
, datasets
, and attributes
. So when users want to retrieve information for a specific organism of interest, first they need to specify the marts
and datasets
in which the information of the corresponding organism can be found. Subsequently they can specify the attributes
argument that is ought to be returned for the corresponding organism.
The availability of marts
, datasets
, and attributes
can be checked by the following functions:
# install the biomaRt package
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
# load biomaRt
library(biomaRt)
# look at top 10 databases
head(listMarts(host = "www.ensembl.org"), 10)
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 83
2 ENSEMBL_MART_SNP Ensembl Variation 83
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 83
4 ENSEMBL_MART_VEGA Vega 63
5 pride PRIDE (EBI UK)
Users will observe that several marts
providing annotation for specific classes of organisms or groups of organisms are available.
For our example, we will choose the hsapiens_gene_ensembl
mart
and list all available datasets that are element of this mart
.
head(listDatasets(useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org")), 10)
dataset description version
1 oanatinus_gene_ensembl Ornithorhynchus anatinus genes (OANA5) OANA5
2 cporcellus_gene_ensembl Cavia porcellus genes (cavPor3) cavPor3
3 gaculeatus_gene_ensembl Gasterosteus aculeatus genes (BROADS1) BROADS1
4 lafricana_gene_ensembl Loxodonta africana genes (loxAfr3) loxAfr3
5 itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2) spetri2
6 choffmanni_gene_ensembl Choloepus hoffmanni genes (choHof1) choHof1
7 csavignyi_gene_ensembl Ciona savignyi genes (CSAV2.0) CSAV2.0
8 fcatus_gene_ensembl Felis catus genes (Felis_catus_6.2) Felis_catus_6.2
9 rnorvegicus_gene_ensembl Rattus norvegicus genes (Rnor_6.0) Rnor_6.0
10 psinensis_gene_ensembl Pelodiscus sinensis genes (PelSin_1.0) PelSin_1.0
The useMart()
function is a wrapper function provided by biomaRt
to connect a selected BioMart database (mart
) with a corresponding dataset stored within this mart
.
We select dataset hsapiens_gene_ensembl
and now check for available attributes (annotation data) that can be accessed for Homo sapiens
genes.
head(listAttributes(useDataset(dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))), 10)
name description page
1 ensembl_gene_id Ensembl Gene ID feature_page
2 ensembl_transcript_id Ensembl Transcript ID feature_page
3 ensembl_peptide_id Ensembl Protein ID feature_page
4 ensembl_exon_id Ensembl Exon ID feature_page
5 description Description feature_page
6 chromosome_name Chromosome Name feature_page
7 start_position Gene Start (bp) feature_page
8 end_position Gene End (bp) feature_page
9 strand Strand feature_page
10 band Band feature_page
Please note the nested structure of this attribute query. For an attribute query procedure an additional wrapper function named useDataset()
is needed in which useMart()
and a corresponding dataset needs to be specified. The result is a table storing the name of available attributes for Homo sapiens as well as a short description.
Furthermore, users can retrieve all filters for Homo sapiens that can be specified by the actual BioMart query process.
head(listFilters(useDataset(dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))), 10)
name description
1 chromosome_name Chromosome name
2 start Gene Start (bp)
3 end Gene End (bp)
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
7 marker_end Marker End
8 encode_region Encode region
9 strand Strand
10 chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
After accumulating all this information, it is now possible to perform an actual BioMart query by using the getBM()
function.
In this example we will retrieve attributes: start_position
,end_position
and description
for the Homo sapiens gene "GUCA2A"
.
Since the input genes are ensembl gene ids
, we need to specify the filters
argument filters = "hgnc_symbol"
.
# 1) select a mart and data set
mart <- useDataset(dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))
# 2) run a biomart query using the getBM() function
# and specify the attributes and filter arguments
geneSet <- "GUCA2A"
resultTable <- getBM(attributes = c("start_position","end_position","description"),
filters = "hgnc_symbol",
values = geneSet,
mart = mart)
resultTable
start_position end_position
1 42162691 42164718
description
1 guanylate cyclase activator 2A (guanylin) [Source:HGNC Symbol;Acc:HGNC:4682]
When using getBM()
users can pass all attributes retrieved by listAttributes()
to the attributes
argument of the getBM()
function.
biomartr
This query methodology provided by BioMart
and the biomaRt
package is a very well defined approach for accurate annotation retrieval. Nevertheless, when learning this query methodology it (subjectively) seems non-intuitive from the user perspective. Therefore, the biomartr
package provides another query methodology that aims to be more organism centric.
Taken together, the following workflow allows users to perform fast BioMart queries for attributes using the biomart()
function implemented in this biomartr
package:
get attributes, datasets, and marts via : organismAttributes()
choose available biological features (filters) via: organismFilters()
specify a set of query genes: e.g. retrieved with getGenome()
, getProteome()
or getCDS()
specify all arguments of the biomart()
function using steps 1) - 3) and perform a BioMart query
Note that dataset names change very frequently due to the update of dataset versions. So in case some query functions do not work properly, users should check with organismAttributes(update = TRUE)
whether or not their dataset name has been changed. For example, organismAttributes("Homo sapiens", topic = "id", update = TRUE)
might reveal that the dataset ENSEMBL_MART_ENSEMBL
has changed.
The getMarts()
function allows users to list all available databases that can be accessed through BioMart interfaces.
# load the biomartr package
library(biomartr)
# list all available databases
getMarts()
mart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 87
2 ENSEMBL_MART_MOUSE Mouse strains 87
3 ENSEMBL_MART_SEQUENCE Sequence
4 ENSEMBL_MART_ONTOLOGY Ontology
5 ENSEMBL_MART_GENOMIC Genomic features 87
6 ENSEMBL_MART_SNP Ensembl Variation 87
7 ENSEMBL_MART_FUNCGEN Ensembl Regulation 87
8 ENSEMBL_MART_VEGA Vega 67
Now users can select a specific database to list all available datasets that can be accessed through this database. In this example we choose the ENSEMBL_MART_ENSEMBL
database.
head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 5)
dataset description version
1 oanatinus_gene_ensembl Platypus genes (OANA5) OANA5
2 cporcellus_gene_ensembl Guinea Pig genes (cavPor3) cavPor3
3 gaculeatus_gene_ensembl Stickleback genes (BROAD S1) BROAD S1
4 lafricana_gene_ensembl Elephant genes (Loxafr3.0) Loxafr3.0
5 itridecemlineatus_gene_ensembl Squirrel genes (spetri2) spetri2
Now you can select the dataset hsapiens_gene_ensembl
and list all available attributes that can be retrieved from this dataset.
tail(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 38)
dataset
32 hsapiens_gene_ensembl
33 pformosa_gene_ensembl
34 tbelangeri_gene_ensembl
35 mfuro_gene_ensembl
36 ggallus_gene_ensembl
37 xtropicalis_gene_ensembl
38 ecaballus_gene_ensembl
39 pabelii_gene_ensembl
40 xmaculatus_gene_ensembl
41 drerio_gene_ensembl
42 tnigroviridis_gene_ensembl
43 lchalumnae_gene_ensembl
44 amelanoleuca_gene_ensembl
45 mmulatta_gene_ensembl
46 pvampyrus_gene_ensembl
47 panubis_gene_ensembl
48 mdomestica_gene_ensembl
49 acarolinensis_gene_ensembl
50 vpacos_gene_ensembl
51 tsyrichta_gene_ensembl
52 ogarnettii_gene_ensembl
53 dmelanogaster_gene_ensembl
54 loculatus_gene_ensembl
55 mmurinus_gene_ensembl
56 olatipes_gene_ensembl
57 oprinceps_gene_ensembl
58 ggorilla_gene_ensembl
59 dordii_gene_ensembl
60 oaries_gene_ensembl
61 mmusculus_gene_ensembl
62 mgallopavo_gene_ensembl
63 gmorhua_gene_ensembl
64 saraneus_gene_ensembl
65 aplatyrhynchos_gene_ensembl
66 sharrisii_gene_ensembl
67 btaurus_gene_ensembl
68 meugenii_gene_ensembl
69 cfamiliaris_gene_ensembl
description version
32 Human genes (GRCh38.p7) GRCh38.p7
33 Amazon molly genes (Poecilia_formosa-5.1.2) Poecilia_formosa-5.1.2
34 Tree Shrew genes (tupBel1) tupBel1
35 Ferret genes (MusPutFur1.0) MusPutFur1.0
36 Chicken genes (Gallus_gallus-5.0) Gallus_gallus-5.0
37 Xenopus genes (JGI 4.2) JGI 4.2
38 Horse genes (Equ Cab 2) Equ Cab 2
39 Orangutan genes (PPYG2) PPYG2
40 Platyfish genes (Xipmac4.4.2) Xipmac4.4.2
41 Zebrafish genes (GRCz10) GRCz10
42 Tetraodon genes (TETRAODON 8.0) TETRAODON 8.0
43 Coelacanth genes (LatCha1) LatCha1
44 Panda genes (ailMel1) ailMel1
45 Macaque genes (Mmul_8.0.1) Mmul_8.0.1
46 Megabat genes (pteVam1) pteVam1
47 Olive baboon genes (PapAnu2.0) PapAnu2.0
48 Opossum genes (monDom5) monDom5
49 Anole lizard genes (AnoCar2.0) AnoCar2.0
50 Alpaca genes (vicPac1) vicPac1
51 Tarsier genes (tarSyr1) tarSyr1
52 Bushbaby genes (OtoGar3) OtoGar3
53 Fruitfly genes (BDGP6) BDGP6
54 Spotted gar genes (LepOcu1) LepOcu1
55 Mouse Lemur genes (Mmur_2.0) Mmur_2.0
56 Medaka genes (HdrR) HdrR
57 Pika genes (OchPri2.0) OchPri2.0
58 Gorilla genes (gorGor3.1) gorGor3.1
59 Kangaroo rat genes (dipOrd1) dipOrd1
60 Sheep genes (Oar_v3.1) Oar_v3.1
61 Mouse genes (GRCm38.p5) GRCm38.p5
62 Turkey genes (Turkey_2.01) Turkey_2.01
63 Cod genes (gadMor1) gadMor1
64 Shrew genes (sorAra1) sorAra1
65 Duck genes (BGI_duck_1.0) BGI_duck_1.0
66 Tasmanian devil genes (Devil_ref v7.0) Devil_ref v7.0
67 Cow genes (UMD3.1) UMD3.1
68 Wallaby genes (Meug_1.0) Meug_1.0
69 Dog genes (CanFam3.1) CanFam3.1
Now that you have selected a database (hsapiens_gene_ensembl
) and a dataset (hsapiens_gene_ensembl
), users can list all available attributes for this dataset using the getAttributes()
function.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# list all available attributes for dataset: hsapiens_gene_ensembl
head( getAttributes(mart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl"), 10 )
name description
1 ensembl_gene_id Gene ID
2 ensembl_transcript_id Transcript ID
3 ensembl_peptide_id Protein ID
4 ensembl_exon_id Exon ID
5 description Description
6 chromosome_name Chromosome/scaffold name
7 start_position Gene Start (bp)
8 end_position Gene End (bp)
9 strand Strand
10 band Band
Finally, the getFilters()
function allows users to list available filters for a specific dataset that can be used for a biomart()
query.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# list all available filters for dataset: hsapiens_gene_ensembl
head( getFilters(mart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl"), 10 )
name description
1 chromosome_name Chromosome name
2 start Gene Start (bp)
3 end Gene End (bp)
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
7 marker_end Marker End
8 encode_region Encode region
9 strand Strand
10 chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
In most use cases, users will work with a single or a set of model organisms. In this process they will mostly be interested in specific annotations for this particular model organism. The organismBM()
function addresses this issue and provides users with an organism centric query to marts
and datasets
which are available for a particular organism of interest.
Note that when running the following functions for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt
file named _biomart/listDatasets.txt
within the tempdir()
folder, allowing subsequent queries to be performed much faster. The tempdir()
folder, however, will be deleted after a new R session was established. In this case the inital call of the subsequent functions again will take time to retrieve all organism specific data from the BioMart database.
This concept of locally storing all organism specific database linking information available in BioMart into an internal file allows users to significantly speed up subsequent retrieval queries for that particular organism.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieving all available datasets and biomart connections for
# a specific query organism (scientific name)
organismBM(organism = "Homo sapiens")
organism_name description
<chr> <chr>
1 hsapiens Human genes (GRCh38.p7)
2 hsapiens homo_sapiens sequences (GRCh38.p7)
3 hsapiens Human Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p7)
4 hsapiens Human Structural Variants (GRCh38.p7)
5 hsapiens Human Somatic Structural Variants (GRCh38.p7)
6 hsapiens Human Somatic Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p7)
7 hsapiens Human Regulatory Evidence (GRCh38.p7)
8 hsapiens Human Binding Motifs (GRCh38.p7)
9 hsapiens Human Regulatory Features (GRCh38.p7)
10 hsapiens Human miRNA Target Regions (GRCh38.p7)
11 hsapiens Human Other Regulatory Regions (GRCh38.p7)
12 hsapiens Human genes (GRCh38.p7)
with 3 more variables: mart <chr>, dataset <chr>, version <chr>
The result is a table storing all marts
and datasets
from which annotations can be retrieved for Homo sapiens. Furthermore, a short description as well as the version of the dataset being accessed (very useful for publications) is returned.
Users will observe that 3 different marts
provide 6 different datasets
storing annotation information for Homo sapiens.
Please note, however, that scientific names of organisms must be written correctly! For ex. “Homo Sapiens” will be treated differently (not recognized) than “Homo sapiens” (recognized).
Similar to the biomaRt
package query methodology, users need to specify attributes
and filters
to be able to perform accurate BioMart queries. Here the functions organismAttributes()
and organismFilters()
provide useful and intuitive concepts to obtain this information.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# return available attributes for "Homo sapiens"
head(organismAttributes("Homo sapiens"), 20)
name description
<chr> <chr>
1 ensembl_gene_id Gene ID
2 ensembl_transcript_id Transcript ID
3 ensembl_peptide_id Protein ID
4 ensembl_exon_id Exon ID
5 description Description
6 chromosome_name Chromosome/scaffold name
7 start_position Gene Start (bp)
8 end_position Gene End (bp)
9 strand Strand
10 band Band
11 transcript_start Transcript Start (bp)
12 transcript_end Transcript End (bp)
13 transcription_start_site Transcription Start Site (TSS)
14 transcript_length Transcript length (including UTRs and CDS)
15 transcript_tsl Transcript Support Level (TSL)
16 transcript_gencode_basic GENCODE basic annotation
17 transcript_appris APPRIS annotation
18 external_gene_name Associated Gene Name
19 external_gene_source Associated Gene Source
20 external_transcript_name Associated Transcript Name
with 2 more variables: dataset <chr>, mart <chr>
Warning messages:
1: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
2: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
3: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
4: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
5: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
Users will observe that the organismAttributes()
function returns a data.frame storing attribute names, datasets, and marts which are available for Homo sapiens
. After the ENSEMBL release 87 the ENSEMBL_MART_SEQUENCE
service provided by Ensembl does not work properly and thus the organismAttributes()
function prints out warning messages to make the user aware when certain marts provided bt Ensembl do not work properly, yet.
An additional feature provided by organismAttributes()
is the topic
argument. The topic
argument allows users to to search for specific attributes, topics, or categories for faster filtering.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "id"
head(organismAttributes("Homo sapiens", topic = "id"), 20)
name description dataset
<chr> <chr> <chr>
1 ensembl_gene_id Gene ID hsapiens_gene_ensembl
2 ensembl_transcript_id Transcript ID hsapiens_gene_ensembl
3 ensembl_peptide_id Protein ID hsapiens_gene_ensembl
4 ensembl_exon_id Exon ID hsapiens_gene_ensembl
5 study_external_id Study External Reference hsapiens_gene_ensembl
6 go_id GO Term Accession hsapiens_gene_ensembl
7 dbass3_id Database of Aberrant 3 Splice Sites (DBASS3) IDs hsapiens_gene_ensembl
8 dbass5_id Database of Aberrant 5 Splice Sites (DBASS5) IDs hsapiens_gene_ensembl
9 hgnc_id HGNC ID(s) hsapiens_gene_ensembl
10 mirbase_id miRBase ID(s) hsapiens_gene_ensembl
11 mim_morbid MIM MORBID hsapiens_gene_ensembl
12 protein_id Protein (Genbank) ID [e.g. AAA02487] hsapiens_gene_ensembl
13 refseq_peptide RefSeq Protein ID [e.g. NP_001005353] hsapiens_gene_ensembl
14 refseq_peptide_predicted RefSeq Predicted Protein ID [e.g. XP_001720922] hsapiens_gene_ensembl
15 wikigene_id WikiGene ID hsapiens_gene_ensembl
16 ensembl_gene_id Gene ID hsapiens_gene_ensembl
17 ensembl_transcript_id Transcript ID hsapiens_gene_ensembl
18 ensembl_peptide_id Protein ID hsapiens_gene_ensembl
19 ensembl_exon_id Exon ID hsapiens_gene_ensembl
20 ensembl_gene_id Gene ID hsapiens_gene_ensembl
with 1 more variables: mart <chr>
Now, all attribute names
having id
as part of their name
are being returned.
Another example is topic = "homolog"
.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "homolog"
head(organismAttributes("Homo sapiens", topic = "homolog"), 20)
name
<chr>
1 vpacos_homolog_ensembl_gene
2 vpacos_homolog_associated_gene_name
3 vpacos_homolog_ensembl_peptide
4 vpacos_homolog_chromosome
5 vpacos_homolog_chrom_start
6 vpacos_homolog_chrom_end
7 vpacos_homolog_canonical_transcript_protein
8 vpacos_homolog_subtype
9 vpacos_homolog_orthology_type
10 vpacos_homolog_perc_id
11 vpacos_homolog_perc_id_r1
12 vpacos_homolog_goc_score
13 vpacos_homolog_wga_coverage
14 vpacos_homolog_dn
15 vpacos_homolog_ds
16 vpacos_homolog_orthology_confidence
17 pformosa_homolog_ensembl_gene
18 pformosa_homolog_associated_gene_name
19 pformosa_homolog_ensembl_peptide
20 pformosa_homolog_chromosome
with 3 more variables: description <chr>, dataset <chr>, mart <chr>
Or topic = "dn"
and topic = "ds"
for dn
and ds
value retrieval.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "dn"
head(organismAttributes("Homo sapiens", topic = "dn"))
A tibble: 6 × 4
name description dataset mart
<chr> <chr> <chr> <chr>
1 cdna_coding_start cDNA coding start hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
2 cdna_coding_end cDNA coding end hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
3 vpacos_homolog_dn dN with Alpaca hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
4 pformosa_homolog_dn dN with Amazon molly hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
5 acarolinensis_homolog_dn dN with Anole lizard hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
6 dnovemcinctus_homolog_ensembl_gene Armadillo gene stable ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "ds"
head(organismAttributes("Homo sapiens", topic = "ds"))
name description dataset mart
<chr> <chr> <chr> <chr>
1 ccds CCDS ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
2 cds_length CDS Length hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
3 cds_start CDS Start hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
4 cds_end CDS End hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
5 vpacos_homolog_ds dS with Alpaca hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
6 pformosa_homolog_ds dS with Amazon molly hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
Analogous to the organismAttributes()
function, the organismFilters()
function returns all filters that are available for a query organism of interest.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# return available filters for "Homo sapiens"
head(organismFilters("Homo sapiens"), 20)
name description
<chr> <chr>
1 chromosome_name Chromosome name
2 start Gene Start (bp)
3 end Gene End (bp)
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
7 marker_end Marker End
8 encode_region Encode region
9 strand Strand
10 chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
11 with_hgnc with HGNC ID(s)
12 with_hgnc_transcript_name with HGNC transcript name(s)
13 with_ox_arrayexpress with ArrayExpress ID(s)
14 with_ccds with CCDS ID(s)
15 with_chembl with ChEMBL ID(s)
16 with_ox_clone_based_ensembl_gene with clone based Ensembl gene ID(s)
17 with_ox_clone_based_ensembl_transcript with clone based Ensembl transcript ID(s)
18 with_ox_clone_based_vega_gene with clone based VEGA gene ID(s)
19 with_ox_clone_based_vega_transcript with clone based VEGA transcript ID(s)
20 with_dbass3 with DBASS3 ID(s)
with 2 more variables: dataset <chr>, mart <chr>
The organismFilters()
function also allows users to search for filters that correspond to a specific topic or category.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for filter topic "id"
head(organismFilters("Homo sapiens", topic = "id"), 20)
name description
<chr> <chr>
1 with_go_id with GO Term Accession(s)
2 with_mim_morbid with MIM MORBID ID(s)
3 with_protein_id with protein (Genbank) ID(s)
4 with_refseq_peptide with RefSeq protein ID(s)
5 with_refseq_peptide_predicted with RefSeq predicted protein ID(s)
6 ensembl_gene_id Gene ID(s) [e.g. ENSG00000139618]
7 ensembl_transcript_id Transcript ID(s) [e.g. ENST00000380152]
8 ensembl_peptide_id Protein ID(s) [e.g. ENSP00000369497]
9 ensembl_exon_id Exon ID(s) [e.g. ENSE00001508081]
10 hgnc_id HGNC ID(s) [e.g. HGNC:8030]
11 go_id GO Term Accession(s) [e.g. GO:0005515]
12 mim_morbid MIM MORBID ID(s) [e.g. 100100]
13 mirbase_id miRBase ID(s) [e.g. hsa-mir-137]
14 protein_id Protein (Genbank) ID(s) [e.g. ACU09872]
15 refseq_peptide RefSeq protein ID(s) [e.g. NP_001005353]
16 refseq_peptide_predicted RefSeq predicted protein ID(s) [e.g. XP_011520427]
17 wikigene_id WikiGene ID(s) [e.g. 115286]
18 go_evidence_code GO Evidence code
19 with_itridecemlineatus_homolog Orthologous Squirrel Genes
20 with_tnigroviridis_homolog Orthologous Tetraodon Genes
with 2 more variables: dataset <chr>, mart <chr>
The short introduction to the functionality of organismBM()
, organismAttributes()
, and organismFilters()
will allow users to perform BioMart queries in a very intuitive organism centric way. The main function to perform BioMart queries is biomart()
.
For the following examples we will assume that we are interested in the annotation of specific genes from the Homo sapiens proteome. We want to map the corresponding refseq gene id to a set of other gene ids used in other databases. For this purpose, first we need consult the organismAttributes()
function.
# show all elements of the data.frame
options(tibble.print_max = Inf)
head(organismAttributes("Homo sapiens", topic = "id"))
name description dataset mart
<chr> <chr> <chr> <chr>
1 ensembl_gene_id Gene ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
2 ensembl_transcript_id Transcript ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
3 ensembl_peptide_id Protein ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
4 ensembl_exon_id Exon ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
5 study_external_id Study External Reference hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
6 go_id GO Term Accession hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieve the proteome of Homo sapiens from refseq
file_path <- getProteome( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","proteomes") )
Hsapiens_proteome <- read_proteome(file_path, format = "fasta")
# remove splice variants from id
gene_set <- unlist(sapply(strsplit(Hsapiens_proteome@ranges@NAMES[1:5], ".",fixed = TRUE), function(x) x[1]))
result_BM <- biomart( genes = gene_set,
mart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
attributes = c("ensembl_gene_id","ensembl_peptide_id"),
filters = "refseq_peptide")
result_BM
refseq_peptide ensembl_gene_id ensembl_peptide_id
1 NP_000005 ENSG00000175899 ENSP00000323929
2 NP_000006 ENSG00000156006 ENSP00000286479
3 NP_000007 ENSG00000117054 ENSP00000359878
4 NP_000008 ENSG00000122971 ENSP00000242592
5 NP_000009 ENSG00000072778 ENSP00000349297
The biomart()
function takes as arguments a set of genes (gene ids specified in the filter
argument), the corresponding mart
and dataset
, as well as the attributes
which shall be returned.
The biomartr
package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO()
function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO()
function allows GO information retrieval from the BioMart database.
In this example we will retrieve GO information for a set of Homo sapiens genes stored as hgnc_symbol
.
The getGO()
function takes several arguments as input to retrieve GO information from BioMart. First, the scientific name of the organism
of interest needs to be specified. Furthermore, a set of gene ids
as well as their corresponding filter
notation (GUCA2A
gene ids have filter
notation hgnc_symbol
; see organismFilters()
for details) need to be specified. The database
argument then defines the database from which GO information shall be retrieved.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for GO terms of an example Homo sapiens gene
GO_tbl <- getGO(organism = "Homo sapiens",
genes = "GUCA2A",
filters = "hgnc_symbol")
GO_tbl
hgnc_symbol goslim_goa_description goslim_goa_accession
1 GUCA2A cellular_component GO:0005575
2 GUCA2A extracellular region GO:0005576
3 GUCA2A biological_process GO:0008150
4 GUCA2A cellular nitrogen compound metabolic process GO:0034641
5 GUCA2A small molecule metabolic process GO:0044281
6 GUCA2A biosynthetic process GO:0009058
7 GUCA2A molecular_function GO:0003674
8 GUCA2A organelle GO:0043226
9 GUCA2A enzyme regulator activity GO:0030234
Hence, for each gene id the resulting table stores all annotated GO terms found in BioMart.