Fast Gradient Algorithms for Structured Sparsity

Doctoral dissertation
University of Alberta
machine learning, structured sparsity, optimization

This thesis provides important new insights into scalable algorithms for solving composite minimization problems. Composite minimization is fundamental to current work in machine learning, where a smooth loss is combined with a structured non-smooth regularizer to achieve desired forms of "structured sparsity" in the solution; that is, such an approach produces solutions that have few non-zero components that follow desired patterns. In "big data" scenarios, gradient based methods have become the dominant approach due to their simplicity and efficiency; among these methods, the proximal gradient approach significantly improves the rate of convergence over basic methods when the loss satisfies smoothness properties. Unfortunately, proximal gradient methods require the computation of a so-called "proximal map" of the non-smooth regularizer, which is rarely available in closed form and constitutes the primary computational bottleneck in practice.

In his thesis, Yaoliang Yu develops a number of clever and insightful ways to overcome the proximal map bottleneck, particularly when the regularizer enforces "structured sparsity". His first main contribution concerns the case when the regularizer itself consists of a sum of multiple terms. Here he identifies conditions under which the proximal map reduces to the composition of the proximal map of each individual summand; a result that is not only deep and useful, Yaoliang provides an elegant proof that generalizes many known results while uncovering new ones. His second main contribution is to prove that the "proximal average" approximates the proximal map in these cases, providing the surprising result that this approximation method can lead to strictly better convergence rates than the usual smoothing strategy, with no additional overhead. Yaoliang's final contribution is to generalize the so-called "conditional gradient algorithm", also known as the Frank-Wolfe algorithm, for composite minimization problems. Although this approach might require more iterations than proximal gradient, Yaoliang has shown that each iteration can be significantly cheaper, which leads to significant speed ups in "big data" problems. In particular, Yaoliang provides a careful analysis of the method and demonstrates its superiority in a number of real world problems, including matrix completion, multi-class and multi-task learning, and dictionary learning.