chapter  5
28 Pages

Learning Constrained Task Similarities in Graph-Regularized Multi-Task Learning

WithMulti-Task Learning Rémi Flamary, Alain Rakotomamonjy, and Gilles Gasso

Rémi Flamary Laboratoire Lagrange, Observatoire de la Côte d’Azur, Université de Nice Sophia-Antipolis

Alain Rakotomamonjy LITIS, Université de Rouen

Gilles Gasso LITIS, INSA de Rouen

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Similarity Based Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.1 Multi-Task Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.2 Similarity-Based Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.3 Solving the Multi-Task Learning Problem . . . . . . . . . . . . . . 108

5.3 Non-Convex Proximal Algorithm for Learning Similarities . . . . . 110 5.3.1 Bilevel Optimization Framework . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.2 Gradient Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.3 Constraints on P and λt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4.1 Toy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.2 Real-World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4.2.1 Experimental Set-Up . . . . . . . . . . . . . . . . . . . . . . . 119 5.4.2.2 School Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4.2.3 Brain Computer Interface Dataset . . . . . . . . 121 5.4.2.4 OCR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Vector

This chapter addresses the problem of learning constrained task relatedness in a graph-regularized multi-task learning framework. In such a context, the weighted adjacency matrix of a graph encodes the knowledge on task similarities and each entry of this matrix can be interpreted as a hyperparameter of the learning problem. This task relation matrix is learned via a bilevel optimization procedure where the outer level optimizes a proxy of the generalization errors over all tasks with respect to the similarity matrix and the inner level estimates the parameters of the tasks knowing this similarity matrix. Constraints on task similarities are also taken into account in this optimization framework and they allow the task similarity matrix to be more interpretable, for instance, by imposing a sparse similarity matrix. Since the global problem is non-convex, we propose a non-convex proximal algorithm that provably converges to a stationary point of the problem. Empirical evidence illustrates the approach is competitive compared to existing methods that also learn task relation and exhibits an enhanced interpretability of the learned task similarity matrix.