This document summarizes work to October 2018 in demonstrating the concept of remote HDF5. There are two main components to this document
A section on Bioconductor-oriented object designs and methods for
using HDF5 data in server or object store, culminating in
DelayedArray
interfaces to remote HDF5 datasets that
are 2-d arrays of numbers. The HDF5 server can be locally
deployed by interested users; in
this vignette, code involving it is not evaluated.
A section on a more general R interface to the h5py/h5pyd APIs for working with remote or local HDF5
N.B. All python modules that are imported in this document
are imported with convert=FALSE
, so that there are no
unintended translations of python data into R data. You
will see py_to_r
used below to accomplish such transitions
when desired.
The rhdf5client package is a basis for using HDF Server and HDF Object store with R/Bioconductor.
The primary software base for working with the HDF Scalable Data Service is the h5pyd Python library. Below we discuss how to use that library with R. Here we illustrate the basic interfaces in rhdf5client. John Readey of the HDF Group has provided a public repository of HDF5 data in an HSDS folder called /shared/bioconductor that will be used here.
Use HSDSSource
to create an object that
can route queries to HSDS. The resource
is structured like a file system. You can
enumerate available domains using listDomains
,
which can be thought of as files that
contain HDF5 datasets.
## [1] "http://hsdshdflab.hdfgroup.org"
## [1] "/shared/bioconductor/darmgcls.h5"
## [2] "/shared/bioconductor/patelGBMSC.h5"
## [3] "/shared/bioconductor/tabmuris70k20t.h5"
## [4] "/shared/bioconductor/htxcomp_genes.h5"
## [5] "/shared/bioconductor/bano_meQTLex.h5"
## [6] "/shared/bioconductor/pbmc68k.h5"
## [7] "/shared/bioconductor/tenx_full.h5"
## [8] "/shared/bioconductor/gtex_tissues.h5"
## [9] "/shared/bioconductor/gtex_tissues2.h5"
The HSDSFile
method provides focused access.
When a dataset is selected, square-brackets can
be used to obtain content.
## rhdf5client HSDSFile instance from source http://hsdshdflab.hdfgroup.org
## domain: /shared/bioconductor/darmgcls.h5
## use listDatasets(...) and HSDSDataset(..., [dataset name]) for more content.
## rhdf5client HSDSDataset instance, with shape c(3584, 65218)
## use getData(...) or square brackets to retrieve content.
## [,1] [,2] [,3] [,4]
## [1,] 0.0000 0 0 5.335452
## [2,] 0.0000 0 0 11.685833
## [3,] 112.3944 0 0 0.000000
The HSDSArray
method gives a preview of
‘corners’ of the dataset.
## <65218 x 3584> HSDSMatrix object of type "double":
## [,1] [,2] [,3] ... [,3583] [,3584]
## [1,] 0.000000 0.000000 112.394374 . 0.00000 0.00000
## [2,] 0.000000 0.000000 0.000000 . 0.00000 0.00000
## [3,] 0.000000 0.000000 0.000000 . 0.00000 0.00000
## [4,] 5.335452 11.685833 0.000000 . 0.00000 14.01612
## [5,] 0.000000 0.000000 0.000000 . 0.00000 0.00000
## ... . . . . . .
## [65214,] 0.00000 0.00000 0.00000 . 0.00000 0.00000
## [65215,] 480.68946 1228.13851 112.75566 . 0.00000 0.00000
## [65216,] 0.00000 0.00000 0.00000 . 0.00000 0.00000
## [65217,] 0.00000 610.82997 46.86639 . 0.00000 0.00000
## [65218,] 10155.80336 25366.30099 2068.63983 . 4.01555 2531.88862
The File
API for the object store is a little different from the one
for local HDF5.
if (.Platform$OS.type != "windows") {
library(reticulate)
Rh5pyd = import("h5pyd", as="h5py", convert=FALSE)
assays = Rh5pyd$File("/shared/bioconductor/bano_meQTLex.h5", "r",
endpoint=URL_hsds())
assays
assays$keys() # only python
py_to_r(assays$keys()) # the strings of interest
}
## [1] "assay001"
The following function obtains a slice from a dataset
in the object store. The index expression must be
appropriate for the dataset and follows the
convention for h5pyd: start:end:stride
for each dimension,
with [:end]
and [:stride]
optional.
if (.Platform$OS.type != "windows") {
getslice = function(endpoint, mode, domain, dsname, indexstring="[0,0]") {
py_run_string("import h5pyd", convert=FALSE)
py_run_string(paste0("f = h5pyd.File(domain=", sQuote(domain),
", mode = ", sQuote(mode), ", endpoint=", sQuote(endpoint), ")"))
py_run_string(paste0("g = f['", dsname, "']", indexstring))$g
}
mr = getslice(URL_hsds(), "r",
"/home/stvjc/assays.h5", "assay001", "[0:4, 0:27998]")
apply(mr,1,sum)
}
## [1] 980292.7 1020768.3 1010458.6 986350.3
The reticulate package makes it easy to convey python infrastructure directly to R users. However, we want to shape aspects of the interaction to simplify statistical computing. We’ll start by considering how to use local HDF5 with the h5py python modules, and then transition to remote HDF5.
Some of the basic strategies are adumbrated in the BiocSklearn, a proof of concept of use of scikit modules in R.
A note on documentation. For many python concepts
imported into an R session via reticulate::import
,
py_help
may be used to obtain documentation as
recorded in python docstrings. Thus after the import
defining np
below, py_help(np)
will return
a paged high-level document on numpy to the session.
We’ll start with imports of key R and python packages.
The _hl
modules are fundamental infrastructure.
## [1] "absolute_import" "attrs" "base" "dataset"
## [5] "datatype" "files" "filters" "group"
## [9] "selections" "selections2" "attrs" "base"
## [13] "dataset" "datatype" "files" "filters"
## [17] "group" "selections" "selections2"
The following codes demonstrate ways of interfacing
to HDF5 via python. h5file
simply returns a python
reference to a File
instance.
h5dsref
builds python commands to
facilitate manipulation of an HDF5 dataset in R via numpy.
h5file = function( file )
Rh5py$files$File( file, mode="r" )
fn = system.file("hdf5/numiris.h5", package="rhdf5client")
m1 = h5file(fn)
m1
## <HDF5 file "numiris.h5" (mode r)>
## [1] "h5py._hl.files.File" "h5py._hl.group.Group"
## [3] "h5py._hl.base.HLObject" "h5py._hl.base.CommonStateObject"
## [5] "h5py._hl.base.MutableMappingHDF5" "h5py._hl.base.MappingHDF5"
## [7] "_abcoll.MutableMapping" "_abcoll.Mapping"
## [9] "_abcoll.Sized" "_abcoll.Iterable"
## [11] "_abcoll.Container" "python.builtin.object"
The File
instance can be regarded as a python dictionary.
We can learn the names of the datasets in the file:
## [u'numiris']
The h5dsref
function was devised to give convenient
access to a dataset representing a matrix.
h5dsref = function(filename, dsname="numiris") {
py_run_string("import h5py", convert=FALSE)
py_run_string(paste0("f = h5py.File('", filename, "')"))
mref = py_run_string(paste0("g = f['", dsname, "']"))
mref$g
}
We’ll focus on the h5dsref
approach for now.
We can get slices of the target array using
numpy’s take
.
numir = h5dsref(fn)
ta = np$take # can't use `$` on the fly
numirsli = ta(numir, 0:2, 1L)
class(numirsli)
## [1] "numpy.ndarray" "python.builtin.object"
## [[ 5.1 4.9 4.7]
## [ 3.5 3. 3.2]
## [ 1.4 1.4 1.3]
## [ 0.2 0.2 0.2]]
So numirsli
is a submatrix of the iris data
in /tmp/RtmpIejocP/Rinst73619a709a1/rhdf5client/hdf5/numiris.h5, with class numpy.ndarray
. We
can learn about available methods using names
,
and try some out.
## [1] "T" "all" "any" "argmax"
## [5] "argmin" "argpartition" "argsort" "astype"
## [9] "base" "byteswap" "choose" "clip"
## [13] "compress" "conj" "conjugate" "copy"
## [17] "ctypes" "cumprod" "cumsum" "data"
## [21] "diagonal" "dot" "dtype" "dump"
## [25] "dumps" "fill" "flags" "flat"
## [29] "flatten" "getfield" "imag" "item"
## [33] "itemset" "itemsize" "max" "mean"
## [37] "min" "nbytes" "ndim" "newbyteorder"
## [41] "nonzero" "partition" "prod" "ptp"
## [45] "put" "ravel" "real" "repeat"
## [49] "reshape" "resize" "round" "searchsorted"
## [53] "setfield" "setflags" "shape" "size"
## [57] "sort" "squeeze" "std" "strides"
## [61] "sum" "swapaxes" "take" "tobytes"
## [65] "tofile" "tolist" "tostring" "trace"
## [69] "transpose" "var" "view"
## 2
## (4, 3)
## (3, 4)
Furthermore, we can create an R matrix with the
HDF5 numerical content as sliced via take
using
py_to_r
from reticulate:
## [1] 4 3
Thus, given an HDF5 dataset that can
be regarded as a numpy array, we can interrogate its
attributes and retrieve slices from R using h5dsref
.
if (.Platform$OS.type != "windows") {
tf = tempfile()
nf = h5py$File(tf, "w")
irmat = data.matrix(iris[,1:4])
nf$create_dataset('irisH5', data=r_to_py(irmat))
chk = h5dsref(tf, "irisH5")
ta(chk, 0:4, 0L)
nf$file$close() # no more reading, but
try(ta(chk, 0:4, 0L)) # is the close operation working?
}
## [[ 5.1 3.5 1.4 0.2]
## [ 4.9 3. 1.4 0.2]
## [ 4.7 3.2 1.3 0.2]
## [ 4.6 3.1 1.5 0.2]
## [ 5. 3.6 1.4 0.2]]
Details on the File
interface
are provided in h5py docs.
The Rh5py
interface defined here would appear to be an
adequate approach to interfacing between R and HDF5, but
we already have plenty of mileage in rhdf5.
Our real interest is in providing a comprehensive interface
to the HDF Server and Object Store APIs, and we turn to
this now.
The getslice
function will work with references to an HDF Server.
However, in the context of the vignette compilation, I see
an authentication error triggered. It is not clear why; if
the two getslice commands are isolated and run in a single
R session, no problem arises.
We’ll focus on the object store. After importing
h5pyd
using reticulate, we can learn about available
infrastructure.
## [1] "AttributeManager" "Config"
## [3] "Dataset" "Datatype"
## [5] "ExternalLink" "File"
## [7] "Folder" "Group"
## [9] "HardLink" "Reference"
## [11] "RegionReference" "SoftLink"
## [13] "UserDefinedLink" "absolute_import"
## [15] "check_dtype" "config"
## [17] "enable_ipython_completer" "getServerInfo"
## [19] "special_dtype" "version"
## [21] "_hl" "config"
## [23] "version"
With py_help(Rh5pyd$Dataset)
, we obtain extensive
documentation in our R session.
Help on class Dataset in module h5pyd._hl.dataset:
class Dataset(h5pyd._hl.base.HLObject)
| Represents an HDF5 dataset
|
| Method resolution order:
| Dataset
| h5pyd._hl.base.HLObject
| h5pyd._hl.base.CommonStateObject
| __builtin__.object
|
| Methods defined here:
|
| __array__(self, dtype=None)
| Create a Numpy array containing the whole dataset. DON'T THINK
| THIS MEANS DATASETS ARE INTERCHANGABLE WITH ARRAYS. For one thing,
| you have to read the whole dataset everytime this method is called.
|
| __getitem__(self, args)
| Read a slice from the HDF5 dataset.
|
| Takes slices and recarray-style field names (more than one is
| allowed!) in any order. Obeys basic NumPy rules, including
| broadcasting.
...
In what follows, we show the code that creates a new dataset
in the object store. With py_help(Rh5pyd$File)
, we find:
| create_dataset(self, name, shape=None, dtype=None, data=None, **kwds)
| Create a new HDF5 dataset
|
| name
| Name of the dataset (absolute or relative). Provide None to make
| an anonymous dataset.
| shape
| Dataset shape. Use "()" for scalar datasets. Required if "data"
| isn't provided.
| dtype
| Numpy dtype or string. If omitted, dtype('f') will be used.
| Required if "data" isn't provided; otherwise, overrides data
| array's dtype.
| data
| Provide data to initialize the dataset. If used, you can omit
| shape and dtype arguments.
|
| Keyword-only arguments:
|
| chunks
| (Tuple) Chunk shape, or True to enable auto-chunking.
| maxshape
| (Tuple) Make the dataset resizable up to this shape. Use None for
| axes you want to be unlimited.
and we make use of the create_dataset
method. (Following code
is unevaluated, just for illustration, as it was tested and created the
persistent content.)
if (.Platform$OS.type != "windows") {
nf = Rh5pyd$File(endpoint=URL_hsds(), mode="w",
domain="/home/stvjc/iris_demo.h5")
nf$create_dataset('irisH5', data=r_to_py(irmat))
}
We can read back with:
if (.Platform$OS.type != "windows") {
getslice(URL_hsds(), mode="r",
domain="/home/stvjc/iris_mat.h5", "irisH5", "[0:3, 0:3]")
}
We can run create_group
as well. See
## [1] "DELETE" "GET" "POST" "PUT"
## [5] "allocated_bytes" "attrs" "clear" "close"
## [9] "copy" "create_dataset" "create_group" "created"
## [13] "driver" "fid" "file" "filename"
## [17] "flush" "get" "getACL" "getACLs"
## [21] "get_link_json" "id" "items" "iteritems"
## [25] "iterkeys" "itervalues" "keys" "libver"
## [29] "mode" "modified" "move" "name"
## [33] "num_chunks" "num_datasets" "num_datatypes" "num_groups"
## [37] "owner" "parent" "pop" "popitem"
## [41] "putACL" "ref" "regionref" "require_dataset"
## [45] "require_group" "setdefault" "update" "userblock_size"
## [49] "values" "verifyCert" "visit" "visititems"
As of March 2018, we can use HDF Server with R in several ways. With support from an NCI grant, we maintain a server in AWS EC2 that employs the RESTful API defined for the HDF Server.
The server defines a hierarchical structure for all server contents. There are groups, linksets, and datasets.
We use the double-bracket operator to derive
a reference to an HDF5 dataset from an H5S_source
instance. We installed an image of the 10x genomics
1.3 million neuron dataset, that we can refer to
as:
This is sufficient to do arithmetic using familiar R programming steps. Note that the data image here has ‘neurons’ as ‘rows’.