The RSSL package contains a selection of methods for semi-supervised classification. This vignette serves as an overview of the available classifiers and their example outputs using a simple 2D datasets to illustate their behaviour.
In this document we will investigate the behaviour of different semi-supervised classifiers using two simple 2D datasets. In both datasets, labels are missing completely at random. This means the probability of an object missing a label does not depend on the value of its label or its features.
In the first dataset, there are two clearly separated gaussian clusters belonging to two separate classes. Different assumptions made various semi-supervised learning approaches hold for this dataset: clusters correspond to classes, the classes are separated by a boundary through a region of low data density and the labels vary smoothly over the manifolds where the data lives.
In the second example, we consider a single gaussian sliced through the middle by a region of low density. The decision boundary, does not overlap with the region of low density. Rather, it is perpendicular to it. This example violates many semi-supervised learning assumptions. The decision boundary is not in a region of low data density, the classes do not correspond to clusters and the labels do not vary smoothly over the data manifold.
Self-learning refers to the process of using the predictions of an initial learner as additional data for this learner in an iterative way. Here, for a given classifier, we initially train the classifier on only the labeled examples. We then use the predictions of this classifier on the unlabeled objects as the true labels and retrain the classifier with this additional data. This is iterated until the labels predicted by the classifier no longer change.
g_ls <- LeastSquaresClassifier(problem1$X,problem1$y)
g_self <- SelfLearning(problem1$X,problem1$y,problem1$X_u,LeastSquaresClassifier)
g_ls <- LeastSquaresClassifier(problem2$X,problem2$y)
g_self <- SelfLearning(problem2$X,problem2$y,problem2$X_u,LeastSquaresClassifier)
Note that self-learning can be a very effective strategy, but it may also degrade performance if the initial labeling of the unlabeled objects is far from the true labeling. One solution to this problem is to only label object for which the classifier is very sure. This is currently not implemented in this package.
For likelihood based models in particular, an obvious approach to semi-supervised learning is to treat the data as missing parameters to optimize over using maximum likelihood. For LDA, as is often the case in analyses with missing data, this leads to a non-convex function for the likelihood. A default strategy is to approximate this maximization using Expectation Maximization (EM). For EMLDA, the expectation step corresponds to updating the labels of the unlabeled objects with their probabilities. The maximization step is to update the parameters of the model (means, variances and priors of the classes) based on these estimates of the labeled of the unlabeled objects. These steps are iterated until convergence.
g_lda <- LinearDiscriminantClassifier(problem1$X,problem1$y)
g_emlda <- EMLinearDiscriminantClassifier(problem1$X,problem1$y,problem1$X_u)
g_sllda <- SelfLearning(problem1$X,problem1$y,problem1$X_u,LinearDiscriminantClassifier)
g_lda <- LinearDiscriminantClassifier(problem2$X,problem2$y)
g_emlda <- EMLinearDiscriminantClassifier(problem2$X,problem2$y,problem2$X_u)
g_sllda <- SelfLearning(problem2$X,problem2$y,problem2$X_u,LinearDiscriminantClassifier)
Note the similarity between Expectation Maximization and Self-Learning. In fact, Expectation Maximization can be considered a soft self-learning, where objects are assigned responsibilities, probabilities of belonging to one class or another. In self-learning, objects are assignment to a particular class, so only responsibilities of \(0\) and \(1\) are allowed. As is clear from the example above, when the classifier assigns large posterior probabilities (as in the first dataset), the two procedures are almost equivalent.
One possibly useful assumption to use unlabeled data in estimating a classifier is that the decision boundary that one is looking for is likely to run through a region of low density. There are several procedures that make use of this assumption, most notably Transductive SVMs and entropy regularization.
Transductive SVMs attempt to find a labeling of the unlabeled objects such that the margin between two classes is maximized. If points are outside the margin, assignment is easy: these points can be assigned to the class whose side of the margin they reside on. For unlabeled points inside the margin, a cost is incurred: assignment to either class still does not lead to \(0\) error for this object. The transductive SVM thus attempts to move the decision boundary away from these object, looking for a region of low density such that the margin around the decision boundary contains as few unlabeled (and labeled) objects as possible.
g_svm <- LinearSVM(problem2$X,problem2$y,scale=TRUE)
g_tsvm <- TSVMcccp_lin(problem2$X,problem2$y,problem2$X_u,C=1,Cstar=100)
g_tsvm2 <- TSVMcccp(problem2$X,problem2$y,problem2$X_u,C=1,Cstar=1)
p2 + geom_classifier("SVM"=g_svm,"TSVM"=g_tsvm)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
g_ls <-LeastSquaresClassifier(problem2$X,problem2$y,scale=TRUE,y_scale=TRUE)
Xt <- as.matrix(expand.grid(X1=seq(-4,4,length.out = 100),X2=seq(-4,4,length.out = 100)))
data.frame(Xt,pred=decisionvalues(g_ls,Xt)) %>%
ggplot(aes(x=X1,y=X2,z=pred)) + geom_contour(size=5,breaks=c(0.5)) +
geom_classifier(g_ls)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
Xt <- as.matrix(expand.grid(X1=seq(-4,4,length.out = 100),X2=seq(-4,4,length.out = 100)))
data.frame(Xt,pred=decisionvalues(g_svm,Xt)) %>%
ggplot(aes(x=X1,y=X2,z=pred)) + geom_contour(size=5,breaks=c(0)) +
geom_classifier(g_svm)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
#g_s4vm <- S4VM(problem1$X,problem1$y,problem1$X_u) # Returns NULL
g_lr <- LogisticLossClassifier(problem2$X,problem2$y)
g_erlr <- ERLogisticLossClassifier(problem2$X,problem2$y,problem2$X_u,lambda_entropy = 0,init = rnorm(3))
Gaussian Random Field classifier, assumes a similarity measure between objects is given. The idea is to minimize the quadratic distance between the labeling assigned to nearby points. This can be achieved by minimizing the following Energy function: \[ E(f) = \frac{1}{2} \sum_{i,j} w_{i,j} (f(i)-f(j))^2 \] If we fix the the \(f(i)\)’s for the labeled objects to their given values and minimize this energy function, one can derive that the minimizing function is harmonic. An harmonic function is a twice differentiable function whose laplacian is \[ \Delta f = 0 \]. This solution can be found directly using: \[f_u=(D_{uu}-W_{uu})^{-1} W_{ul} f_l = (I-P_{uu})^{-1} P_{ul} f_l\] An example of the imputated labels for a given problem is
set.seed(42)
generateTwoCircles(200,0.05) %>%
ggplot(aes(X1,X2,color=Class)) %>%
+ geom_point() %>%
+ coord_equal()
problem_circle <-
generateTwoCircles(200, 0.05) %>%
SSLDataFrameToMatrices(Class~.,.)
problem_circle <- split_dataset_ssl(problem_circle$X, problem_circle$y,frac_ssl=0.98)
g_grf <- GRFClassifier(problem_circle$X,problem_circle$y,problem_circle$X_u,sigma=0.05)
data.frame(problem_circle$X,Class=problem_circle$y) %>%
ggplot(aes(x=X1,y=X2,color=Class)) +
geom_responsibilities(g_grf,problem_circle$X_u) +
geom_point(aes(shape=Class,color=Class),size=6) +
coord_equal()
Let \(w\) be the weight vector in a linear model. In ICLS, we minimize the distance of a weight vector that could be attained by a possible labeling of the unlabeled objects to the supervised weight vector \(w_{sup}\).
This distance can be calculated in multiple ways. The option “supervised” corresponds to \[ d(w,w_{sup}) = (w-w_{sup})^{\top} X^{\top} X (w-w_{sup}) \]. Conceptually, this is the same as minimizing the squared loss on the labeled objects, over the space of \(w\)’s that can be attained by a possible labeling of the unlabeled objects.
The option “semisupervised” corresponds to: \[ d(w,w_{sup}) = (w-w_{sup})^{\top} X_e^{\top} X_e (w-w_{sup}) \]. where \(X_e\) is the combined labeled and unlabeled data design matrix. This is a much more conservative estimator. It conceptually corresponds to finding the w that given the lowest squared loss on labeled and unlabeled data even for the worst possible labeling of the unlabeled objects.
The option “euclidean” corresponds to Euclidean projection and has the property that the Euclidean distance of \(w\) to \(w_{oracle}\) (the solution when we would have known all possible labelings) improves, although this might not translate into improved performance in quadratic loss sense.
g_sup <- LeastSquaresClassifier(problem1$X,problem1$y)
g_proj_sup <- ICLeastSquaresClassifier(problem1$X,problem1$y,problem1$X_u,projection = "supervised")
g_proj_semi <- ICLeastSquaresClassifier(problem1$X,problem1$y,problem1$X_u,projection = "semisupervised")
g_proj_euc <- ICLeastSquaresClassifier(problem1$X,problem1$y,problem1$X_u,projection = "euclidean")
g_sup <- LinearDiscriminantClassifier(problem1$X, problem1$y)
g_semi <- ICLinearDiscriminantClassifier(problem1$X,problem1$y,problem1$X_u)
problem <-
generate2ClassGaussian(n=1000,d = 2,var = 0.3,expected = TRUE) %>%
SSLDataFrameToMatrices(Class~.,.) %>%
{split_dataset_ssl(.$X,.$y,frac_train=0.5,frac_ssl=0.98) }
sum(loss(LeastSquaresClassifier(problem$X,problem$y),problem$X_test,problem$y_test))
## [1] 18.37767
sum(loss(USMLeastSquaresClassifier(problem$X,problem$y,problem$X_u),problem$X_test,problem$y_test))
## [1] 60.37441
sum(loss(ICLeastSquaresClassifier(problem$X,problem$y,problem$X_u),problem$X_test,problem$y_test))
## [1] 18.47848
sum(loss(ICLeastSquaresClassifier(problem$X,problem$y,problem$X_u,projection="semisupervised"),problem$X_test,problem$y_test))
## [1] 18.33266
mean(predict(ICLeastSquaresClassifier(problem$X,problem$y,problem$X_u,projection="semisupervised"),problem$X_test)==problem$y_test)
## [1] 0.992
mean(predict(LeastSquaresClassifier(problem$X,problem$y),problem$X_test)==problem$y_test)
## [1] 0.992
sum(loss(SelfLearning(X=problem$X,y=problem$y,X_u=problem$X_u,method=LeastSquaresClassifier),problem$X_test,problem$y_test))
## [1] 17.75019
# Nearest Mean
sum(loss(NearestMeanClassifier(X=problem$X,y=problem$y),problem$X_test,problem$y_test))
## [1] 1397.985
sum(loss(ICNearestMeanClassifier(X=problem$X,y=problem$y,X_u=problem$X_u),problem$X_test,problem$y_test))
## [1] 1439.116
sum(loss(EMNearestMeanClassifier(X=problem$X,y=problem$y,X_u=problem$X_u),problem$X_test,problem$y_test))
## [1] 1215.692
sum(loss(SelfLearning(X=problem$X,y=problem$y,X_u=problem$X_u,method=NearestMeanClassifier),problem$X_test,problem$y_test))
## [1] 1215.709
# LDA
sum(loss(LinearDiscriminantClassifier(X=problem$X,y=problem$y),problem$X_test,problem$y_test))
## [1] 2083.22
sum(loss(ICLinearDiscriminantClassifier(X=problem$X,y=problem$y,X_u=problem$X_u),problem$X_test,problem$y_test))
## [1] 1216.024
sum(loss(EMLinearDiscriminantClassifier(X=problem$X,y=problem$y,X_u=problem$X_u),problem$X_test,problem$y_test))
## [1] 1215.831
sum(loss(SelfLearning(X=problem$X,y=problem$y,X_u=problem$X_u,method=LinearDiscriminantClassifier),problem$X_test,problem$y_test))
## [1] 1215.851
# Logistic Regression
#sum(loss(LogisticRegression(X=problem$X,y=problem$y),problem$X_test,problem$y_test))
sum(loss(LogisticLossClassifier(X=problem$X,y=problem$y,x_center=TRUE),problem$X_test,problem$y_test))
## Warning in Ops.factor(y, 1.5): '-' not meaningful for factors
## [1] NA
sum(loss(ERLogisticLossClassifier(X=problem$X,y=problem$y,X_u=problem$X_u),problem$X_test,problem$y_test))
## Warning in Ops.factor(y, 1.5): '-' not meaningful for factors
## [1] NA
mean(predict(LogisticLossClassifier(X=problem$X,y=problem$y),problem$X_test)==problem$y_test)
## [1] 0.494
mean(predict(ERLogisticLossClassifier(X=problem$X,y=problem$y,X_u=problem$X_u),problem$X_test)==problem$y_test)
## [1] 0.494
# SVM
sum(loss(LinearSVM(X=problem$X,y=problem$y),problem$X_test,problem$y_test))
## [1] 37.27419