Rmonad: an introduction

Zebulun Arendsee

Rmonad offers

Monadic pipelines

I will introduce Rmonad with a simple sequence of squares

# %>>% corresponds to Haskell's >>=
1:5      %>>%
    sqrt %>>%
    sqrt %>>%
    sqrt
## R> "1:5"
## R> "sqrt"
## R> "sqrt"
## R> "sqrt"
## 
##  ----------------- 
## 
## [1] 1.000000 1.090508 1.147203 1.189207 1.222845

So what exactly did Rmonad do with your data? It is still there, sitting happily inside the monad.

In magrittr you could do something similar:

1:5      %>%
    sqrt %>%
    sqrt %>%
    sqrt
## [1] 1.000000 1.090508 1.147203 1.189207 1.222845

%>% takes the value on the left and applies it to the function on the right. %>>%, takes a monad on the left and a function on the right, then builds a new monad from them. This new monad holds the computed value, if the computation succeeded. It collates all errors, warnings, and messages. These are stored in step-by-step a history of the pipeline.

%>% is an application operator, %>>% is a monadic bind operator. magrittr and Rmonad complement eachother. %>% can be used inside a monadic sequence to perform operations on monads, whereas %>>% performs operations in them. If this is all too mystical, just hold on, the examples are sensical even without an understanding of monads.

Below, we store an intermediate value in the monad:

1:5      %>>%
    sqrt %v>% # store this result
    sqrt %>>%
    sqrt
## R> "1:5"
## R> "sqrt"
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
## 
## R> "sqrt"
## R> "sqrt"
## 
##  ----------------- 
## 
## [1] 1.000000 1.090508 1.147203 1.189207 1.222845

The %v>% variant of the monadic bind operator stores the results as they are passed.

Following the example of magrittr, arbirary anonymous functions of ‘.’ are supported

1:5 %>>% { o <- . * 2 ; { o + . } %>% { . + o } }
## R> "1:5"
## R> "function (.) 
## {
##     o <- . * 2
##     {
##         o + .
##     } %>% {
##         . + o
##     }
## }"
## 
##  ----------------- 
## 
## [1]  5 10 15 20 25

Warnings are caught and stored

-1:3     %>>%
    sqrt %v>%
    sqrt %>>%
    sqrt
## R> "-1:3"
## R> "sqrt"
##  * WARNING: NaNs produced
## [1]      NaN 0.000000 1.000000 1.414214 1.732051
## 
## R> "sqrt"
## R> "sqrt"
## 
##  ----------------- 
## 
## [1]      NaN 0.000000 1.000000 1.090508 1.147203

Similarly for errors

"wrench" %>>%
    sqrt %v>%
    sqrt %>>%
    sqrt
## R> ""wrench""
## R> "sqrt"
##  * ERROR: non-numeric argument to mathematical function
## 
##  ----------------- 
## 
## [1] "wrench"
##  *** FAILURE ***

The first sqrt failed, and this step was coupled to the resultant error. Contrast this with magrittr, where the location of the error is lost:

"wrench" %>%
    sqrt %>%
    sqrt %>%
    sqrt
## Error in sqrt(.): non-numeric argument to mathematical function

Also note that a value was still produced. This value will never be used in the downstream monadic sequence (except when explicitly doing error handling). However it, and all other information in the monad, can be easily accessed.

Extracting data from an Rmonad

If you want to extract the terminal result from the monad, you can use the esc function:

1:5 %>>% sqrt %>% esc
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068

esc is our first example of a class of functions that work on monads, rather than the values they wrap. We use magrittr’s application operator %>% here, rather than the monadic bind operator %>>%, because we are passing a literal monad to esc.

If the monad is in a failed state, esc will raise an error.

"wrench" %>>% sqrt %>>% sqrt %>% esc
## Error: in "sqrt":
##   non-numeric argument to mathematical function

If you prefer a tabular summary of your results, you can pipe the monad into the mtabulate function.

1:5      %>>%
    sqrt %v>%
    sqrt %>>%
    sqrt %>% mtabulate
##   id   OK cached  time space is_nested nbranch nnotes nwarnings error doc
## 1 35 TRUE  FALSE 0.000    72     FALSE       0      0         0     0   0
## 2 36 TRUE   TRUE 0.000    88     FALSE       0      0         0     0   0
## 3 37 TRUE  FALSE 0.000    88     FALSE       0      0         0     0   0
## 4 38 TRUE   TRUE 0.004    88     FALSE       0      0         0     0   0

An internal states can be accessed by converting the monad to a list of past states and simple indexing out the ones you want.

All errors, warnings and notes can be extracted with the missues command

-2:2 %>>% sqrt %>>% colSums %>% missues
##    id    type                                           issue
## 2  40 warning                                   NaNs produced
## 21 41   error 'x' must be an array of at least two dimensions

The id column refers to row numbers in the mtabulate output. Internal values can be extracted by converting the monad to a list and indexing:

result <- 1:5 %v>% sqrt %v>% sqrt %v>% sqrt
as.list(result)[[2]] %>% esc
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068

An Rmonad can be converted to a DiagrammeR graph (or igraph); in this way, standard tools for network analysis can be applied to pipelines. rmonad currently has built in support for plotting piplines and making markdown reports (experimental).

as_dgr_graph(result)

Handling effects

The %>_% operator is useful when you want to include a function inside a pipeline that should be bypassed, but you want the errors, warnings, and messages to pass along with the main.

You can cache an intermediate result

cars %>_% write.csv(file="cars.tab") %>>% summary

Or plot a value along with a summary

cars %>_% plot(xlab="index", ylab="value") %>>% summary %>% forget

## R> "summary"
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

I pipe the final monad into forget, which is (like esc) a function for operating on monads. forget removes history from a monad. I do this just to de-clutter the output.

You can call multiple effects

cars                                 %>_%
    plot(xlab="index", ylab="value") %>_%
    write.csv(file="cars.tab")       %>>%
    summary %>% forget

Since state is passed, you can make assertions about the data inside a pipeline.

iris                                    %>_%
    { stopifnot(is.data.frame(.))     } %>_%
    { stopifnot(sapply(.,is.numeric)) } %>>%
    colSums %|>% head
## R> "iris"
## R> "function (.) 
## {
##     stopifnot(is.data.frame(.))
## }"
## R> "function (.) 
## {
##     stopifnot(sapply(., is.numeric))
## }"
##  * ERROR: sapply(., is.numeric) are not all TRUE
## R> "head"
## 
##  ----------------- 
## 
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The above code will enter a failed state if the input is either not a data frame or the columns are not all numeric. The braced expressions are anonymous functions of ‘.’ (as in magrittr). The final expression %|>% catches an error and performs head on the last valid input (iris).

Error handling

Errors needn’t be viewed as abnormal. For example, we might want to try several alternatives functions, and use the first that works.

1:10 %>>% colSums %|>% sum
## R> "1:10"
## R> "colSums"
##  * ERROR: 'x' must be an array of at least two dimensions
## R> "sum"
## 
##  ----------------- 
## 
## [1] 55

Here we will do either colSums or sum. The pipeline fails only if both fail.

Sometimes you want to ignore the previous failure completely, and make a new call – for example in reading files:

# try to load a cached file, on failure rerun the analysis
read.table("analyasis_cache.tab") %||% run_analysis(x)

This can also be used to replace if-else if-else strings

x <- list()
# compare
if(length(x) > 0) { x[[1]] } else { NULL }
## NULL
# to 
x[[1]] %||% NULL %>% esc
## NULL

Or maybe you want to support multiple extensions for an input file

read.table("a.tab") %||% read.table("a.tsv") %>>% dostuff

Used together with %|>% we can build full error handling pipelines

letters[1:10] %v>% colSums %|>% sum %||% message("Can't process this")
## Can't process this
## R> "letters[1:10]"
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
## 
## R> "colSums"
##  * ERROR: 'x' must be an array of at least two dimensions
## R> "sum"
##  * ERROR: invalid 'type' (character) of argument
## R> "message("Can't process this")"
## 
##  ----------------- 
## 
## NULL

Overall, in Rmonad, errors are well-behaved. It is reasonable to write functions that return an error rather than one of the myriad default values (NULL, NA, logical(0), list(), FALSE). This approach is unambiguous. Rmonad can catch the error and allow allow the programmer to deal with it accordingly.

Branching pipelines

If you want to perform an operation on a value inside the chain, but don’t want to pass it, you can use the branch operator %>^%.

rnorm(30) %>^% qplot(xlab="index", ylab="value") %>>% mean

This stores the result of qplot in a branch off the main pipeline. This means that plot could fail, but the rest of the pipeline could continue. You can store multiple branches.

rnorm(30) %>^% qplot(xlab="index", ylab="value") %>^% summary %>>% mean

Branches can be used as input, as well.

x <- 1:10 %>^% dgamma(10, 1) %>^% dgamma(10, 5) %^>% cor
x
## R> "dgamma(10, 5)"
## Has 1 branches
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## R> "do.call(func, args, envir = env)"
##  [1] 1.013777e-06 1.909493e-04 2.700504e-03 1.323119e-02 3.626558e-02
##  [6] 6.883849e-02 1.014047e-01 1.240769e-01 1.317556e-01 1.251100e-01
## 
## R> "do.call(func, args, envir = env)"
##  [1] 1.813279e-01 6.255502e-01 1.620358e-01 1.454077e-02 7.299700e-04
##  [6] 2.537837e-05 6.847192e-07 1.534503e-08 2.984475e-10 5.190544e-12
## 
## R> "cor"
## 
##  ----------------- 
## 
## [1] -0.5838848
unbranch(x)
## [[1]]
## R> "dgamma(10, 5)"
## Has 1 branches
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## R> "do.call(func, args, envir = env)"
##  [1] 1.013777e-06 1.909493e-04 2.700504e-03 1.323119e-02 3.626558e-02
##  [6] 6.883849e-02 1.014047e-01 1.240769e-01 1.317556e-01 1.251100e-01
## 
## R> "do.call(func, args, envir = env)"
##  [1] 1.813279e-01 6.255502e-01 1.620358e-01 1.454077e-02 7.299700e-04
##  [6] 2.537837e-05 6.847192e-07 1.534503e-08 2.984475e-10 5.190544e-12
## 
## R> "cor"
## 
##  ----------------- 
## 
## [1] -0.5838848
## 
## [[2]]
## R> "do.call(func, args, envir = env)"
##  [1] 1.013777e-06 1.909493e-04 2.700504e-03 1.323119e-02 3.626558e-02
##  [6] 6.883849e-02 1.014047e-01 1.240769e-01 1.317556e-01 1.251100e-01
## 
## [[3]]
## R> "do.call(func, args, envir = env)"
##  [1] 1.813279e-01 6.255502e-01 1.620358e-01 1.454077e-02 7.299700e-04
##  [6] 2.537837e-05 6.847192e-07 1.534503e-08 2.984475e-10 5.190544e-12

Note the branches could be long monadic chains themselves, which might have their own branches. The unbranch function recursively extracts all branches from the tree.

Chains of chains

If you want to connect many chains, all with independent inputs, you can do so with the %__% and %v__% operators.

runif(10) %>>% sum %v__%
rnorm(10) %>>% sum %v__%
rexp(10)  %>>% sum
## R> "runif(10)"
## R> "sum"
## [1] 4.046273
## 
## R> "rnorm(10)"
## R> "sum"
## [1] 2.153803
## 
## R> "rexp(10)"
## R> "sum"
## 
##  ----------------- 
## 
## [1] 4.498764

The %v__% operator records the output of the lhs and evaluates the rhs into an Rmonad. If you don’t care about the output, you can use %__%, which simply replaces the rhs value with lhs value (while, of course, propagating state).

The %__% operator is a little like a semicolon, in that it demarcates independent statements. Each statement, though, is wrapped into a graph of operations. This graph is itself data, and can be computed on. You could take any analysis and recompose it as %v__% delimited blocks. The result of running the analysis would be a data structure containing all results and errors.

program <-
{
    x = 2
    y = 5
    x * y
} %v__% {
    letters %>% sqrt
} %v__% {
    10 * x
}

You can link chunks of code, with their results, and performance information.

Multiple inputs

So far our pipelines have been limited to either linear paths or the somewhat awkward branch merging. An easier approach is to read inputs from a list. But we want to be able to catch errors resulting from evaluation of each member of the list. We can do this with list_meval.

funnel(
    "yolo",
    stop("stop, drop, and die"),
    runif("simon"),
    k = 2
)
## R> "2"
## [1] 2
## 
## R> "runif("simon")"
##  * ERROR: invalid arguments
##  * WARNING: NAs introduced by coercion
## R> "stop("stop, drop, and die")"
##  * ERROR: stop, drop, and die
## R> "yolo"
## [1] "yolo"
## 
## R> "funnel("yolo", stop("stop, drop, and die"), runif("simon"), k = 2)"
## 
##  ----------------- 
## 
## [[1]]
## [1] "yolo"
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## $k
## [1] 2
## 
##  *** FAILURE ***

This returns a monad which fails if any of the components evaluate to an error. But it does not toss the rest of the inputs, instead returning a clean list with a NULL filling in missing pieces. Constrast this with normal list evaluation:

list( "yolo", stop("stop, drop, and die"), runif("simon"), 2)
## Error in eval(expr, envir, enclos): stop, drop, and die

funnel records each failure in each element of the list independently.

This approach can also be used with the infix operator %*>%.

funnel(read.csv("a.csv"), read.csv("b.csv")) %*>% merge

Now, of course, we can add monads to the mix

funnel(
    a = read.csv("a.csv") %>>% do_analysis_a,
    b = read.csv("b.csv") %>>% do_analysis_b,
    k = 5
) %*>% joint_analysis

Monadic list evaluation is the natural way to build large programs from smaller pieces.

Annotating steps

As our pipelines become more complex, it becomes essential to document them. We can do that as follows:

runif(5) %>>% abs %>% doc(

    "Alternatively, the documentation could go into a text block below the code
    in a knitr document. The advantage of having documentation here, is that it
    is coupled unambiguously to the generating function. These annotations,
    together with the ability to chain chains of monads, allows whole complex
    workflows to be built, with the results collated into a single object. All
    errors propagate exactly as errors should, only affecting downstream
    computations. The final object can be converted into a markdown document
    and automatically generated function graphs."

                  ) %>^% sum %__%
rnorm(6)   %>>% abs %>^% sum %v__%
rnorm("a") %>>% abs %>^% sum %__%
rexp(6)    %>>% abs %>^% sum %T>%
  {print(mtabulate(.)) } %>% missues
##    id    OK cached  time space is_nested nbranch nnotes nwarnings error
## 1  77  TRUE  FALSE 0.000    88     FALSE       0      0         0     0
## 2  78  TRUE   TRUE 0.004    88     FALSE       1      0         0     0
## 3  79  TRUE   TRUE 0.004    48     FALSE       0      0         0     0
## 4  80  TRUE  FALSE 0.000    88     FALSE       0      0         0     0
## 5  81  TRUE   TRUE 0.004    88     FALSE       1      0         0     0
## 6  82  TRUE   TRUE 0.000    48     FALSE       0      0         0     0
## 7  83 FALSE  FALSE 0.000     0     FALSE       0      0         1     1
## 8  84  TRUE  FALSE 0.000    88     FALSE       0      0         0     0
## 9  85  TRUE   TRUE 0.000    88     FALSE       1      0         0     0
## 10 86  TRUE   TRUE 0.000    48     FALSE       0      0         0     0
##    doc
## 1    0
## 2    0
## 3    0
## 4    0
## 5    0
## 6    0
## 7    0
## 8    0
## 9    0
## 10   0
##   id    type                      issue
## 1 83   error          invalid arguments
## 2 83 warning NAs introduced by coercion

Note %T>% is a magrittr operator. It allows execution of the rhs function on the lhs monad and bypasses the result.