About recosystem Package

recosystem is an R wrapper of the LIBMF library developed by Yu-Chin Juan, Yong Zhuang, Wei-Sheng Chin and Chih-Jen Lin (http://www.csie.ntu.edu.tw/~cjlin/libmf/), an open source library for recommender system using marix factorization. (Lin et al. 2015)

A Quick View of Recommender System

The main task of recommender system is to predict unknown entries in the rating matrix based on observed values, as is shown in the table below:

item_1 item_2 item_3 item_n
user_1 2 3 ?? 5
user_2 ?? 4 3 ??
user_3 3 2 ?? 3
user_m 1 ?? 5 4

Each cell with number in it is the rating given by some user on a specific item, while those marked with question marks are unknown ratings that need to be predicted. In some other literatures, this problem may be given other names, e.g. collaborative filtering, matrix completion, matrix recovery, etc.

A popular technique to solve the recommender system problem is the matrix factorization method. The idea is to approximate the whole rating matrix \(R_{m\times n}\) by the product of two matrices of lower dimensions, \(P_{k\times m}\) and \(Q_{k\times n}\), such that

\[R\approx P'Q\]

Let \(p_u\) be the \(u\)-th column of \(P\), and \(q_v\) be the \(v\)-th column of \(Q\), then the rating given by user \(u\) on item \(v\) would be predicted as \(p'_u q_v\).

A typical solution for \(P\) and \(Q\) is given by the following optimization problem (Chin et al. 2015a; Chin et al. 2015b):

\[\min_{P,Q} \sum_{(u,v)\in R} \left((r_{u,v}-p'_u q_v)^2+\lambda_P ||p_u||^2+\lambda_Q ||q_v||^2\right)\]

where \((u,v)\) are locations of observed entries in \(R\), \(r_{u,v}\) is the observed rating, and \(\lambda_P,\lambda_Q\) are penalty parameters to avoid overfitting. Usually we take \(\lambda_P\) and \(\lambda_Q\) to be the same, i.e., equal to a common value \(\lambda\).

Highlights of LIBMF and recosystem

LIBMF itself is a parallelized library, meaning that users can take advantage of multicore CPUs to speed up the computation. It also utilizes some advanced CPU features to further improve the performance. (Lin et al. 2015)

recosystem is a wrapper of LIBMF, hence the features of LIBMF are all included in recosystem. Also, unlike most other R packages for statistical modeling which store the whole dataset and model object in memory, LIBMF (and hence recosystem) is much hard-disk-based, for instance the constructed model which contains information for prediction can be stored in the hard disk, and prediction result can also be directly written into a file rather than kept in memory. That is to say, recosystem will have a comparatively small memory usage.

Data Format

The data file for training set needs to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers

user_id item_id rating

Testing data file is similar to training data, but since the ratings in testing data are usually unknown, the rating entry in testing data file can be omitted, or can be replaced by any placeholder such as 0 or ?.

Be careful with the convention that user_id and item_id start from 0, so the training data file for the example in the beginning will look like

0 0 2
0 1 3
1 1 4
1 2 3
2 0 3
2 1 2
...

And testing data file is

0 2
1 0
2 2
...

Since ratings for testing data are unknown, here we simply omit the third entry. However if their values are really given, the testing data will serve as a validation set on which RMSE of prediction can be calculated.

Example data files are contained in the recosystem/dat (or recosystem/inst/dat, for source package) directory.

Usage of recosystem

The usage of recosystem is quite simple, mainly consisting of the following steps:

  1. Create a model object (a Reference Class object in R) by calling Reco().
  2. (Optionally) call the $tune() method to select best tuning parameters along a set of candidate values.
  3. Train the model by calling the $train() method. A number of parameters can be set inside the function, possibly coming from the result of $tune().
  4. (Optionally) output the model, i.e. write the factorized \(P\) and \(Q\) matrices info files.
  5. Use the $predict() method to compute predictions and write results into a file.

Below is an example on some simulated data:

library(recosystem)
set.seed(123) # This is a randomized algorithm
trainset = system.file("dat", "smalltrain.txt", package = "recosystem")
testset = system.file("dat", "smalltest.txt", package = "recosystem")
r = Reco()
opts = r$tune(trainset, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
                                    nthread = 1, niter = 10))
opts
## $min
## $min$dim
## [1] 10
## 
## $min$cost
## [1] 0.1
## 
## $min$lrate
## [1] 0.1
## 
## 
## $res
##    dim cost lrate      rmse
## 1   10 0.01   0.1 1.0019376
## 2   20 0.01   0.1 1.0211937
## 3   30 0.01   0.1 0.9925157
## 4   10 0.10   0.1 0.9689426
## 5   20 0.10   0.1 0.9946385
## 6   30 0.10   0.1 0.9834870
## 7   10 0.01   0.2 1.1032429
## 8   20 0.01   0.2 1.0500873
## 9   30 0.01   0.2 1.0128726
## 10  10 0.10   0.2 1.0360415
## 11  20 0.10   0.2 1.0211815
## 12  30 0.10   0.2 1.0008837
r$train(trainset, opts = c(opts$min, nthread = 1, niter = 20))
## iter   tr_rmse          obj
##    0    2.1112   4.9318e+04
##    1    0.9777   1.5343e+04
##    2    0.8347   1.2890e+04
##    3    0.8049   1.2445e+04
##    4    0.7886   1.2224e+04
##    5    0.7750   1.2043e+04
##    6    0.7613   1.1877e+04
##    7    0.7469   1.1728e+04
##    8    0.7289   1.1522e+04
##    9    0.7114   1.1350e+04
##   10    0.6920   1.1159e+04
##   11    0.6731   1.0973e+04
##   12    0.6557   1.0819e+04
##   13    0.6394   1.0678e+04
##   14    0.6252   1.0571e+04
##   15    0.6111   1.0448e+04
##   16    0.6001   1.0364e+04
##   17    0.5893   1.0278e+04
##   18    0.5802   1.0212e+04
##   19    0.5717   1.0148e+04
## real tr_rmse = 0.5356
outfile = tempfile()
r$predict(testset, outfile)
## prediction output generated at /tmp/RtmpSmreiy/fileb9b3a0e8951
## Compare the first few true values of testing data
## with predicted ones
# True values
print(read.table(testset, header = FALSE, sep = " ", nrows = 10)$V3)
##  [1] 3 4 2 3 3 4 3 3 3 3
# Predicted values
print(scan(outfile, n = 10))
##  [1] 4.18981 2.98077 2.82748 3.69854 2.59907 2.94004 2.91219 3.03800
##  [9] 2.25076 3.48538

Detailed help document for each function is available in topics ?recosystem::Reco, ?recosystem::tune, ?recosystem::train, ?recosystem::output and ?recosystem::predict.

Installation Issue

LIBMF utilizes some compiler and CPU features that may be unavailable in some systems. To build recosystem from source, one needs a C++ compiler that supports C++11 standard.

Also, there are some flags in file src/Makevars (src/Makevars.win for Windows system) that may have influential effect on performance. It is strongly suggested to set proper flags according to your type of CPU before compiling the package, in order to achieve the best performance:

  1. The default Makevars provides generic options that should apply to most CPUs.
  2. If your CPU supports SSE3 (a list of supported CPUs), add

    PKG_CPPFLAGS += -DUSESSE
    PKG_CXXFLAGS += -msse3
  3. If not only SSE3 is supported but also AVX (a list of supported CPUs), add

    PKG_CPPFLAGS += -DUSEAVX
    PKG_CXXFLAGS += -mavx

After editing the Makevars file, run R CMD INSTALL recosystem on the package source directory to install recosystem.

References

Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015a. “A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems.” ACM TIST. http://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf.

———. 2015b. “A Learning-Rate Schedule for Stochastic Gradient Methods to Matrix Factorization.” PAKDD. http://www.csie.ntu.edu.tw/~cjlin/papers/libmf/mf_adaptive_pakdd.pdf.

Lin, Chih-Jen, Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin. 2015. “LIBMF: A Matrix-Factorization Library for Recommender Systems.” http://www.csie.ntu.edu.tw/~cjlin/libmf/.