Blackwell approachability meets Online Convex Optimization

David Blackwell was still roaming the corridors of UC Berkeley’s stat department during my postdoc years, circa 2009. Jake Abernethy, Sasha Rakhlin, Peter Bartlett and myself were discussing his results, and his seminal contributions to prediction theory were already well known.

David Blackwell
James Hannan

At that time, still the early days of online convex optimization, we were contemplating everything from adaptive gradient methods to bandit convex optimization. Blackwell’s famed approachability theorem was looming in the background, considered to be one of the strongest theorems in the ML-theorists toolkit.

I’ve recently added a new draft chapter to to V2.0 of Introduction to Online Convex Optimization, about the connection between OCO and Blackwell approachability: they are algorithmically equivalent! More precisely, an efficient algorithm for approachability gives rise to an efficient algorithm for OCO with sublinear regret, and vice versa.

Details are in the chapter draft, and I recommend this music for reading it: (the song is called “together we won despite everything”, dedicated to music and science)


New Methods in Control: The Gradient Perturbation Controller

by Naman Agarwal, Karan Singh and Elad Hazan, based on this paper and this paper

The mainstream thought in classical control operates under the simplifying principle that nature evolves according to well-specified dynamics perturbed by i.i.d. Gaussian noise, a field called Optimal Control. While this principle has proven very successful in some real world systems, it does not lend itself to the construction of truly robust controllers. In the real world one can expect all sorts of deviations from this setting, including:

  • Inherent correlations between perturbations across time due to the environment.
  • Modeling misspecification, which can introduce unforeseeable, non-stochastic deviations from the assumed dynamics.
  • Completely adversarial perturbations introduced by a malicious entity.

Indeed, in the world of classical supervised learning, the presence of the above factors has led to the adoption of online learning in lieu of statistical learning, which allows for robust noise models and game-theoretic / strategic learning. A natural question arises: what is the analogue of online learning and worst-case regret in robust control?

Linear Dynamical Systems and Optimal Control

Dynamical systems constitute the formalism to describe time-dependent, dynamic real-world phenomenon, ranging from fluid flow through a pipe, the movement of a pendulum, to the change of weather. The archetypal example here is a linear dynamical system, which permits a particularly intuitive interpretation as a (configurable) vector field (from wikipedia):

A linear dynamical system (LDS) posits the state of a system evolves via the following dynamical relation:

x_{t+1} = Ax_t + Bu_t + w_t

(video by @robertghrist)

Here x_t represents the state of the system, u_t represents a control input and w_t is a perturbation introduced to the dynamics, for the time step t. The vector fields above illustrate the behavior of the system for a few different choices of A, when the control input u_t is set to zero. The control inputs, when correctly chosen, can modify the dynamics to induce a particular desired behavior. For example, controlling the thermostat in a data center to achieve a particular temperature, applying a force to a pendulum to keep it upright, or driving a drone to a destination.

The goal of the controller is to produce a sequence of control actions u_1 \ldots u_T aimed at minimizing the cumulative loss \sum_t c_t(x_t, u_t). In classical control the loss functions are convex quadratics and known ahead of time. However, in the quest for more robust controllers, henceforth we will consider a broader class of general (possibly non-quadratic) convex cost functions, which may be adversarially chosen, and only be revealed online per-time-step to the controller.

Previous notions: LQR and H-infinity control

Perhaps the most fundamental setting in control theory is a LDS is with quadratic costs c_t and i.i.d Gaussian perturbations w_t. The solution known as the Linear Quadratic Regulator, derived by solving the Riccati equation, is well understood and corresponds to a linear policy (i.e. the control input is a linear function of the state).

The assumption of i.i.d perturbations has been relaxed in classical control theory, with the introduction of a min-max notion, in a subfield known as H_{\infty} control. Informally, the idea behind H_{\infty} control is to design a controller which performs well against all sequences of bounded perturbations. There are two shortcomings of this approach; firstly computing such a controller can be a difficult optimization problem in itself. While a closed form for such a controller is known for linear dynamical systems with quadratic costs, no such form is known when costs can be non-quadratic. Secondly and perhaps more importantly, the min-max notion is non-adaptive and hence often too pessimistic. It is possible that during the execution of the system, nature produces a benign (predictable) sequence of perturbations (say, for instance, the perturbations are, in truth, highly correlated). The H_{\infty} controller in this setting could be overly pessimistic and lead to a higher cost than what could be achieved via a potentially more adaptive controller. This motivates our approach of minimizing regret, which we describe next.

A less pessimistic metric: Policy regret

Our starting point for more robust control is regret minimization in games. Regret minimization is a well-accepted metric in online learning, and we consider applying it to online control.

Formally, we measure the efficacy of a controller through the following notion of policy regret.

\mathrm{Regret} = \sum_{t} c_t(x_t, u_t) - \min_{\mathrm{stable } K} c_t(x_t(K), -Kx_t)

Here in, x_t(K) represents the state encountered when executing the linear policy $K$. In particular, the second term represents the total cost paid by the best(in hindsight) stable linear policy had it executed in the system facing the same sequence of perturbations as the controller. In this regard the above notion of regret is counterfactual and hence more challenging than the more common stateless notion of regret.
(Remark: a stable linear policy K is one which ensures that the spectral radius of  A-BK is strictly less than one. For a formal quantitative definition, please see the paper.)

Furthermore, the choice of comparing to a linear policy is made keeping tractability in mind. Note that this nevertheless generalizes the standard Linear Quadratic Regulator and H_{\infty} control on Linear Dynamical Systems, as the optimal choice of controllers in those settings is indeed a linear policy (if the cost functions are quadratic, our setting allows more general loss functions).

Algorithms which achieve low regret on the above objective are naturally adaptive as they perform as well as the best controller (upto a negligible additive regret) oblivious to whether the sequence is friendly or adversarial.

A new type of controller

In successive recent works ([1] and [2]), we show that the above notion of policy regret is tractable and can be sublinear (even logarithmic under certain conditions). Using a new parameterization for controller that we detail next, we obtain the following results:

  • Robustness to Adversarial Perturbations ([1]) – For arbitrary, but bounded, perturbations $w_t$, which may even be chosen adversarially, we provide an efficient algorithm that achieves regret of O(\sqrt{T}).
  • Fast Rates under Stochastic Perturbations ([2]) – Further, for strongly convex costs, and i.i.d. perturbations, we give an efficient algorithm with a poly-logarithmic (in T) regret guarantee.

The former result is the first such result to establish vanishing average regret in an online setting for control with adversarial perturbations. The latter result offers a significant improvement over the work of Cohen et al. who established a rate of O(\sqrt{T}) in a similar (stochastic) setting.

We assume that the dynamical system i.e. (A,B) is completely known to us and the state is fully observable. These assumptions have been eliminated in further work with our collaborators, and will be a subject of a future blog post. We now describe the core ideas which go into achieving both our results.

The Gradient Perturbation Controller

Consider how the choice of the linear controller determines the control input.

The above display demonstrates that the action, state, and hence the cost are non-convex functions of the linear controller K. The first piece of our result evades this lack of convexity by over-parametrization, ie. relaxing the notion of a linear policy as described next.

In place of learning a direct representation of a linear controller K, we learn a sequence of matrices M_1, M_2 \ldots which represent the dependence of u_t on w_{t - i} under the execution of some linear policy. Schematically, we parameterize our policy as follows

Note that, the assumption that the dynamics is known permits the exact inference of w_t’s and thereby making the execution of such a controller possible. While the choice of the controller ensures that control inputs are linear in \{M_i\}, the superposition principle for linear dynamical systems ensures that the state sequence too is linear in the same variables. Therefore, the cost of execution of such a controller parameterized via \{M_i\} is convex and we can hope to learn them using standard techniques like Gradient Descent. Observe that, by construction, it is possible to emulate the behavior any linear controller K by a suitable choice of \{M_i\}’s.

There are two barriers though. First, the number of parameters grows with time $T$ and so can regret and running time. Secondly, since there is a state, the decision a controller makes at a particular instance affects the future. We deal with these two problems formally in the paper and show how they may be circumvented. One of the important techniques we use is Online Convex Optimization with Memory.

With all the core components in place, we provide a brief specification of our algorithm. Let’s fix some memory length H, and a time step t. Define a proxy cost as

c_t(M_1, \ldots M_H) =

Cost Paid at Time t, had we execute the policy \{M_1, \ldots M_H\} for t-H steps, with the system reset to 0

The algorithm now as suggested by Anava, Hazan and Manor, is to take a gradient step on the above proxy.

The above algorithm achieves O(\sqrt{T}) regret against worst-case noise. For the full proof and the precise derivation of the algorithm, check out the paper.

Achieving Logarithmic Regret

Algorithms for (state-less) online learning often enjoy logarithmic regret guarantees in the presence of strongly convex loss functions. Prima facie, one could hope that for dynamical systems with strongly convex costs, such a stronger regret guarantee would hold. The cautionary note here is that strong convexity of the cost c_t(x_t, u_t) in the control input does not imply strong convexity in the controller representation.,

In fact, the strong convexity of our proxy cost function c(M_1, \ldots M_H) in its natural parameterization over M’s is further hindered by the fact that the space of M’s constitutes a highly over-parametrized representation (say, vs. K). Indeed in the case of adversarial perturbations, it is easy to construct a case where the strong convexity does not transfer over from the control to the controller.

In a future blog post we detail how this technical difficulty can be overcome using the natural gradient method, giving rise to a logarithmic regret algorithm for control.

Experimental Validation

Below we provide empirical evaluations of the GPC controller in various scenarios, demonstrating it is indeed more robust to correlated noise than previous notions. (All the plots below are produced by a new research-first control and time series library in the works — Alpha version scheduled to be released soon! STAY TUNED!!!)

To start, let’s reproduce the classical setting of controlling an LDS with i.i.d. Gaussian noise. Here, the LQR program is optimal. Yet, the plot below attests that our new control methodology offers a slightly suboptimal, yet comparable, performance. This observation also underlines the message the the proposed controller tracks the optimal controller without any foreknowledge of the structure of the perturbations. Here, for example, GPC is not made aware of the i.i.d. nature of the perturbations.

Moving on to correlated perturbations across time, a simple such setting is when the perturbations are drawn from a Gaussian Random Walk (as opposed to independent Gaussians). Note that here it is possible to derive the optimal controller via the separation principle. Here is the performance plot:

Note that the default LQR controller performs much worse, where as GPC is able to track the performance of the optimal controller quite closely. Next up, sinusoidal noise:

Here the full power of regret minimization is exhibited as the consecutive samples of the sine function are very correlated, yet predictable. GPC vastly outperforms LQR as one would expect. But, is this fundamentally a hard setting? What would happen if we use the H-infinity controller, or, god forbid, no control at all? We see this in the next plot (on the log-linear scale):

What’s next?

We have surveyed a new family of controls that are fundamentally different than classical optimal control: they are adaptive and attain regret guarantees vs. adversarial noise. Their preliminary experimental tests seem quite promising.

In the coming posts we will:

  1. Explain how GPC can obtain the first logarithmic regret in the tracking/Kalman setting,
  2. Reveal our new research-first library for control / time-series prediction,
  3. Tackle more general settings.

Stay tuned! Preferably using the GPC… 🙂


Boosting for Dynamical Systems

by Nataly Brukhim, Naman Agarwal and Elad Hazan, based on this paper 

In a famous 1906 competition at a local fair in Plymouth, England, participants were asked to guess the weight of an ox. Out of a crowd of hundreds, no one came close to the ox’s actual weight, but the average of all guesses was almost correct. How is it that combining the opinions of laymen can somehow arrive at highly reasoned decisions, despite the weak judgment of individual members? This concept of harnessing wisdom from weak rules of thumb to form a highly accurate prediction rule, is the basis of ensemble methods and boosting. Boosting is a theoretically sound methodology that has transformed machine learning across a variety of applications; in classification and regression tasks, online learning, and many more.

In the case of online learning, examples for training a predictor are not available in advance, but are revealed one at a time. Online boosting combines a set of online prediction rules, or weak learners. At every time step, each weak learner outputs a prediction, suffers some loss and is then updated accordingly. The performance of an online learner is measured using the regret criterion, which compares the accumulated loss over time with that of the best fixed decision in hindsight. A Boosting algorithm can choose which examples are fed to each of the weak learners, as well as the losses they incur. Intuitively, the online booster can encourage some weak learners to become really good in predicting certain common cases, while allowing others to focus on edge cases that are harder to predict. Overall, the online boosting framework can achieve low regret guarantees based on the learners’ individual regret values.

However, online learning can become more challenging when our actions have consequences on the environment. This can be illustrated with the following experiment: imagine learning to balance a long pole on your hand. When you move your hand slightly, the pole tilts. You then move your hand in the opposite direction, and it bounces back and tilts to the other side. One jerk the wrong way might have you struggling for a good few seconds to rebalance. In other words, a sequence of decisions you made earlier determines whether or not the pole is balanced at any given time, rather than the single decision you make at that point.

More generally, consider cases when our environment has a state, and is in some sense “remembering” our past choices. A stateful framework, able to model a wide range of such phenomena, is a dynamical system. A dynamical system can be thought of as a function that determines, given the current state, what the state of the system will be in the next time step. Think of the physical dynamics that determines our pole’s position based on sequential hand movements. Other intuitive examples are the fluctuations of stock prices in the stock market, or the local weather temperatures; these can all be modeled with dynamical systems.

So how can boosting help us make better predictions for a dynamical system? In recent work we propose an algorithm, which we refer to as DynaBoost, that achieves this goal. In the paper we provide theoretical regret bounds, as well as an empirical evaluation in a variety of applications, such as online control and time-series prediction.

Learning for Online Control

Control theory is a field of applied mathematics that deals with the control of various physical processes and engineering systems. The objective is to design an action rule, or controller, for a dynamical system such that steady state values are achieved quickly, and the system maintains stability.

Consider a simple Linear Dynamical System (LDS):

x_{t+1} = A x_t + B u_t + w_t

where x_t,u_t are the state and control values at time t, respectively. Assume a known transition dynamics specified by the matrices A and B, and an arbitrary disturbance to the system given by w_t. The goal of the controller is to minimize a convex cost function c_t(x_t,u_t).

A provably optimal controller for the Gaussian noise case (where w_t are normally distributed) and when the cost functions are quadratic, is the Linear Quadratic Regulator (LQR). LQR computes a pre-fixed matrix K such that u_t^{LQR} = K x_t. In other words, LQR computes a linear controller – which linearly maps the state into a control at every time step.

A recent advancement in online control considers arbitrary disturbances, as opposed to normally distributed noise. In this more general setting, there is no closed form for the optimal controller. Instead, it is proposed to use a weighted sum of previously observed noises, i.e., u_t^{WL} = K x_t + \sum_{i=1}^H M_i w_{t-i} , where M_1,...,M_H are learned parameters, updated in an online fashion. This method is shown to attain vanishing average regret compared to the best fixed linear controller in hindsight, and is applicable for general convex cost functions as opposed to only quadratics.

Crucially, the state-dependent term Kx_t is not learned. Since the learned parameters of the above controller therefore considers only a fixed number of recent disturbances, we can apply existing online convex optimization techniques developed for learning with loss functions that have bounded memory.

Boosting for Online Control

Using the insights described above to remove state in online control, we can now use techniques from online boosting. DynaBoost maintains multiple copies of the base-controller above, with each copy corresponding to one stage in boosting. At every time step, the control u_t is obtained from a convex combination of the base-controllers’ outputs u_t^{WL(1)},...,u_t^{WL(N)}. To update each base-controller’s parameters, DynaBoost feeds each controller with a residual proxy cost function, and seeks to obtain a minimizing point in the direction of the residual loss function’s gradient. Stability ensures that minimizing regret over the proxy costs (which have finite memory) suffices to minimize overall regret. See our paper for the detailed description of the algorithm and its regret guarantees.

Sanity check experiment

We first conducted experiments for the standard LQR setting with i.i.d. Gaussian noise and known dynamics. We applied our boosting method to the non-optimal controller with learned parameters (Control-WL), and we observe that boosting improves its loss and achieves near-optimal performance (here the optimal controller is given by the fixed LQR solution). We have tested an LDS of different dimensions d = 1,10,100, and averaged results over multiple runs.

Correlated noise experiment

When the disturbances are not independently drawn, the LQR controller is not guaranteed to perform optimally. We experimented with two LDS settings with correlated disturbances in which (a) the disturbances w_t are generated from a Gaussian random-walk, and (b) where they are generated by a sine function applied to the time index. In these cases, boosted controllers perform better compared to the “weak” learned controller, and can also outperform the fixed LQR solution. We have also tested a Recurrent Neural Network, and observed that boosting is effective for RNNs as well.

Inverted Pendulum experiment

A more challenging experiment with a non-linear dynamical system is the Inverted Pendulum experiment. This is very similar to the pole balancing example we discussed above, and is a standard benchmark for control methods. The goal is to balance the inverted pendulum by applying torque that will stabilize it in a vertically upright position, in the presence of noise. In our experiments, we used correlated disturbances from a Gaussian random-walk. We follow the dynamics implemented in OpenAI Gym, and test the performance of different controllers: LQR, a learned controller, and boosting. The video below visualizes this experiment:

When averaging the loss value over multiple experiment runs, we get the following plot:

It can be seen that the learned controller performs much better than the LQR in the presence of correlated noise, and that boosting can improve its stability and achieve lower average loss.

Boosting for Time-Series Prediction

Similarly to the control setting, in time-series prediction tasks it is sufficient to use fixed horizons, and online boosting can be efficiently applied here as well. In time-series prediction, the data is often assumed to be generated from an autoregressive moving average (ARMA) model:

x_{t} = \sum_{i=1}^k \alpha_i x_{t-i} + \sum_{j=1}^q \beta_j w_j + w_t

In words, each data point x_t is given by a weighted sum of previous points, previous noises and a new noise term w_t, where \alpha,\beta are the coefficients vectors.

To test our boosting method, We experimented with 4 simulated settings: 1) normally distributed noises, 2) coefficients of the dynamical system slowly change over time, 3) a single, abrupt, change of the coefficients, and, 4) correlated noise: Gaussian random walk.

The weak learners tested here are the ARMA-ONS (online newton step) and ARMA-OGD (online gradient descent) algorithms for time-series prediction (See this paper for more details). We applied our boosting method, as well as a fast version of it, which applies to quadratic loss functions (we used squared difference in this case).

1) Gaussian Noise 2) Changing Coefficients

3) Abrupt Change 4) Correlated Noise

We can see in the plots above that all weak learners’ loss values (red) can be improved by online boosting methods (blue). A similar observation arises when experimenting with real-world data; we experimented with the Air Quality dataset from the UCI Machine Learning repository, that contains hourly averaged measurements of air quality properties from an Italian city throughout one year, as measured by chemical sensors. We apply similar weak learners to this task, as well as our boosting algorithms. Here we again obtain better averaged losses for boosted methods (blue) compared to the baselines (red).

Lecture notes: optimization for ML

Spring semester is over, yay! To celebrate summer, I’ve compiled lecture notes from the graduate course COS 598D, a.k.a. “optimization for machine learning“.

The course is an aftermath of a few lectures and summer school tutorials given in various locations, in which  lectures goal of the course was to present the most useful methods and ideas in a rigorous-but-not-tedious way:

  • Suitable for a mathematically-prepared undergraduate student, and/or researcher beginning their endeavor into machine learning.
  • Focus on the important: we start with stochastic gradient descent for training deep neural networks.
  • Bring out the main ideas: i.e. projection-free methods, second-order optimization, the three main acceleration techniques, qualitatively discuss the advantages and applicability of each.
  • *short* proofs, that everyone can follow and extend in their future research. I prefer being off by a constant/log factor from the best known bound with a more insightful proof.
  • Not loose track of the goal: generalization/regret, rather than final accuracy. This means we talked about online learning, generalization and precision issues in ML.

The most recent version can be downloaded here:

This is still work in progress, please feel free to send me typos/corrections, as well as other topics you’d like to see (on my todos already: lower bounds, quasi-convexity, and the homotopy method).

Note: zero-order/bandit optimization is an obvious topic that’s not address. The reason is purely subjective – it appears as a chapter in this textbook (that also started as lecture notes!).


Reinforcement learning without rewards

by Abby van Soest and Elad Hazan, based on this paper 

When humans interact with the environment, we receive a series of signals that indicate the value of our actions. We know that chocolate tastes good, sunburns feel bad, and certain actions generate praise or disapproval from others. Generally speaking, we learn from these signals and adapt our behavior in order to get more positive “rewards” and fewer negative ones.

Reinforcement learning (RL) is a sub-field of machine learning that formally models this setting of learning through interaction in a reactive environment. In RL, we have an agent and an environment. The agent observes its position (or “state”) in the environment and takes actions that transition it to a new state. The environment looks at an agent’s state and hands out rewards based on a hidden set of criteria.

Source: Reinforcement Learning: An Introduction. Sutton & Barto, 2017.

Typically, the goal of RL is for the agent to learn behavior that maximizes the total reward it receives from the environment. This methodology has led to some notable successes: machines have learned how to play Atari games, how to beat human masters of Go, and how to write long-form responses to an essay prompt.

But what can we learn without an external reward signal?

It seems like a paradoxical question to ask, given that RL is all about rewards. But even though the reward paradigm is fundamentally flexible in many ways, it is also brittle and limits the agent’s ability to learn about its environment. This is due to several reasons. First, a reward signal directs the agents towards a single specific goal that may not generalize. Second, the reward signal may be sparse and uninformative, as we illustrate below.

Imagine that you want a robot to learn to navigate through the following maze.

Case 1: Sparse Rewards. The agent gets a reward of +1 when it exits the maze, and a reward of 0 everywhere else. The agent doesn’t learn anything until it stumbles upon the exit.

⇒ Clear reward signals are not always available.

Case 2: Misleading Rewards. The agent gets a reward of +1 at the entrance and a reward of +10 at the exit. The agent incorrectly learns to sit at the entrance because it hasn’t explored its environment sufficiently.

⇒ Rewards can prevent discovery of the full environment.

These issues are easy to overcome in the small maze on the left. But what about the maze on the right? As the size of the environment grows, it’ll get harder and harder to find the correct solution — the intractability of the problem scales exponentially.

pasted image 0.png

So what we find is that there is power in being able to learn effectively in the absence of rewards. This intuition is supported by a body of research that shows learning fails when rewards aren’t dense or are poorly shaped; and fixing these problems can require substantial engineering effort.

By enabling agents to discover the environment without the requirement of a reward signal, we create a more flexible and generalizable form of reinforcement learning. This framework can be considered a form of “unsupervised” RL. Rather than relying on explicit and inherently limited signals (or “labels”), we can deal with a broad, unlabelled pool of data. Learning from this pool facilitates a more general extraction of knowledge from the environment.

Our new approach: Maximum Entropy

In recent work, we propose finding a policy that maximizes entropy (which we refer to as a MaxEnt policy), or another related and concave function of the distribution. This objective is reward-independent and favors exploration.

In the video below, a two-dimensional cheetah robot learns to run backwards and forwards, move its legs fast and in all different directions, and even do flips. The cheetah doesn’t have access to any external rewards; it only uses signals from the MaxEnt policy.


Entropy is a function of the distribution over states. A high entropy distribution visits all states with near-equal frequency — it’s a uniform distribution. On the other hand, a low entropy distribution is biased toward visiting some states more frequently than others. (In the maze example, a low entropy distribution would result from the agent sitting at the entrance of the maze forever.)


So given that policy \pi creates a distribution d_\pi over states, the problem we are hoping to solve is:

\pi^* = \arg \max_{\pi} \text{entropy}(d_\pi)

When we know all the states, actions, and dynamics of a given environment, finding the policy with maximum entropy is a concave optimization problem. This type of problem can be easily and exactly solved by convex programming.

But we very rarely have all that knowledge available to use. In practice, one of several complications usually arise:

  1. The states are non-linearly approximated by a neural network or some other function approximator.
  2. The transition dynamics are unknown.
  3. The state space is intractably large. (As an interesting example, the game of Go has more than one googol or 10^{100} possible states. That’s more than the number of atoms in the universe, according to this blog from 2016, and definitely more than can fit on a computer.)

In such cases, the problem of finding a max-entropy policy becomes non-convex and computationally hard.

So what to do? If we look at many practical RL problems (Atari, OpenAI Gym), we see that there are many known, efficient solvers that can construct an optimal (or nearly-optimal) policy when they are given a reward signal.

We thus consider an oracle model: let’s assume that we have access to one of these solvers, so that we can pass it an explicit reward vector and receive an optimal policy in return. Can we now maximize entropy in a provably efficient way? In other words, is it possible to reduce this high complexity optimization problem to that of “standard” RL?

Our approach does exactly that! It is based on the Frank-Wolfe method. This is a gradient-based optimization algorithm that is particularly suited for oracle-based optimization. Instead of moving in the direction of the steepest decline of the objective function, the Frank-Wolfe method iteratively moves in the direction of the optimal point in the direction of the gradient. This is depicted below (and deserves a separate post…). The Frank-Wolfe method is a projection-free algorithm, see this exposition about its theoretical properties.

For the exact specification of the algorithm and its performance guarantee, see our paper.


To complement the theory, we also created some experiments to test the MaxEnt algorithm on simulated robotic locomotion tasks (open source code available here). We used test environments from OpenAI Gym and Mujoco and trained MaxEnt experts for various environments.

These are some results from the Humanoid experiment, where the agent is a human-like bipedal robot. The behavior of the MaxEnt agent (blue) is baselined against a random agent (orange), who explores by sampling randomly from the environment. This random approach is often used in practice for epsilon-greedy RL exploration.

In this figure, we see that over the course of 25 epochs, the MaxEnt agent progressively increases the total entropy over the state space.

Here, we see a visualization of the Humanoid’s coverage of the $xy$-plane, where the shown plane is of size 40-by-40. After one epoch, there is minimal coverage of the area. But by the 5th, 10th, and 15th epoch, we see that the agent has learned to visit all the different states in the plane, obtaining full and nearly uniform coverage of the grid!

Taking control by convex optimization

The picture to the right depicts Rudolf Kalman, a pioneer of modern control, receiving the presidential medal of freedom from president Barack

The setting which Kalman considered, and has been a cornerstone of modern control, is that of stateful time-series. More precisely, consider a sequence of inputs and outputs, or measurements, coming from some real-world machine
x_1,x_2,...,x_T,...   \ \ \Rightarrow  \ \ \ y_1,y_2, y_3 , ..., y_T , ...
Examples include:

  • y_t are measurements of ocean temperature, in a study of global warming. x_t are various environmental knowns, such as time of the year, water acidity, etc.
  • Language translation, x_t are words in Hebrew, and y_t is their English translation
  • x_t are physical controls of a robot, i.e. force in various directions, and y_t are the location to which it moves

Such systems are generally called dynamical systems, and one of the simplest and most useful mathematical models to capture a subset of them is called  linear dynamical systems (LDS), in which the output is generated according to the following equations:

h_{t+1}  = A h_t + B x_t + \eta_t
y_{t+1}   = C h_t + D x_t + \zeta_t

Here h_t is a hidden state, which can have very large or even infinite dimensionality, A,B,C,D are linear transformations and \eta_t,\zeta_t are noise vectors.

This setting is general enough to capture a variety of machine learning models previously considered in isolation, such as HMM (hidden Markov models), principle and independent component analysis, mixture of Gaussian clusters and many more, see this survey by Roweis and Ghahramani.

Alas – without any further assumptions the Kalman filtering model is non-convex and computationally hard, thus general polynomial time algorithms for efficient worst-case prediction were not known, till some exciting recent developments I’ll survey below.

If the linear transformations A,B,C,D are known, then it is possible to efficiently predict y_t given all previous inputs and outputs of the system using convex optimization. Indeed, this is what the Kalman filter famously achieves, and in an optimal way. Similarly, if there is no hidden state, the problem becomes trivially convex, and thus easily solvable (see recent blog posts and papers by the Recht group further analyzing this case).

However, if the system given by the transformations A,B,C,D is unknown, recovering them from observations is called “system identification”, a notoriously hard problem. Even for HMMs, which can be seen as a special case of LDS (see the RG survey), the identification problem is known to be NP-hard without additional assumptions. (Aside: HMMs are identifiable with further distributional assumption, see the breakthrough paper of Hsu, Kakade and Zhang, which started the recent tensor revolution in theoretical ML).

In the face of NP-hardness, state-of-the-art reduces to a set of heuristics. Amongst them are:

  • By far the most common is the use of EM, or expectation maximization, to alternately solve for the system and the hidden states. Since this is a heuristic for non-convex optimization, global optimality is not guaranteed, and indeed EM can settle on a local optimum, as shown here.
  • Gradient descent: this wonderful heuristic can be applied for the non-convex problem of finding both the system and hidden states simultaneously. In a striking recent result, Hardt and Ma showed that under some generative assumptions, gradient descent converges to the global minimum.

This is where we go back to the convex path: using the failsafe methodology of convex relaxation and regret minimization in games, in a recent set of results our group has found assumption-free, efficient algorithms for predicting in general LDS, as well as some other control extensions. At the heart of these results is a new method of spectral filtering. This method requires some mathematical exposition, we’ll defer it to a future longer post. For the rest of this post, we’ll describe the autoregressive (a.k.a. regression) methodology for LDS, which similar to our result, is a worst-case and convex-relaxation based method.

To see how this works, consider the LDS equations above, and notice that for zero-mean noise, the expected output at time t+1 is given by: (we take D=0 for simplicity)
E[y_{t+1} ] =  \sum_{i=1}^t  C A^i B x_{t-i}
This has few parameters to learn, namely only three matrices A,B,C, but has a very visible non-convex component of raising A to high powers.

The obvious convex relaxation takes M_i = C A^i B as a set of matrices for i=1,...,t. We now have a set of linear equalities of the form:

$latex   y_{t+1}  = \sum_{i=1}^t M_i x_{t-1}  $

which can be solved by your favorite convex linear-regression optimizer. The fallacy is also clear: we have parameters that scale as the number of observations, and thus cannot hope to generalize.

Luckily, many systems are well behaved in the following way. First, notice that the very definition of an LDS requires the spectral norm of A to be bounded by one, otherwise we have signal explosion – it’s magnitude grows indefinitely with time. It is thus not too unreasonable to assume a strict bound away from one, say |A| \leq 1- \delta, for some small constant \delta >0. This allows us to bound the magnitude of the i’th matrix M_i, that is supposed to represent C A^i B, by (1-\delta)^i.

With this assumption, we can then take only k = O(\frac{1}{\delta} \log \frac{1}{\epsilon}) matrices, and write
$latex  y_{t+1} = \sum_{i=1}^k M_i x_{t-1} + O(\epsilon) $
This folklore method, called an autoregressive model or simply learning LDS by regression, is very powerful despite its simplicity. It has the usual wonderful merits of online learning by convex relaxation: is worst-case (no generative assumptions), agnostic, and online. Caveats? we have a linear dependence on the eigengap of the system, which can be very large for many interesting systems.

However, it is possible to remove this gap completely, and learn arbitrarily ill-defined LDS without dependence on the condition number. This method is also online, agnostic, and worst case, and is based on a new tool: wave-filtering. In essence it applies the online regression method on a specially-designed fixed filter of the input. To see how these filters are designed, and their theoretical properties, check out our recent papers!


unsupervised learning III: worst-case compression-based metrics that generalize

This post is a long-overdue sequel to parts I and II of previous posts on unsupervised learning.  The long time it took me to publish this one is not by chance – I am still ambivalent on the merit, or lack of, for the notion of a worst-case PAC/statistical learnability (or, god forbid, regret minimization) without supervision.

And yet,
1. I owe a followup to the stubborn few readers of the blog who have made it thus far.
2. The algorithms we’ll discuss are mathematically correct, and even if completely useless in practice, can serve as amusement to us, fans of regret minimization in convex games.
3. I really want to get on to more exciting recent developments in control that have kept me busy at night…

So, here goes:

The idea is to define unsupervised learning as PAC learning in the usual sense, with a real-valued objective function, and specially-structured hypothesis classes.

  • A hypothesis is a pair of functions, one from domain to range and the other vice versa, called encoder-decoder pair.
  • The real-valued loss is also composed of two parts, one of which is the reconstruction error of the encoding-decoding scheme, and the other proportional to encoding length.

Et-voila, all notions of generalizability carry over from statistical learning theory together with dimensionality concepts such as Rademacher complexity.

More significantly – these worst-case definitions allow improper learning of NP-hard concepts. For the fully-precise definitions, see  the paper with Tengyu Ma. The latter also gives a striking example of a very well-studied NP-hard problem, namely dictionary learning, showing it can be learned in polynomial time using convex relaxations.

Henceforth, I’ll give a much simpler example that perhaps illustrates the concepts in an easier way, one that arose in conversations with our post-docs Roi Livni and Pravesh Kothari.

This example is on learning the Mixture-Of-Gaussians (MOG) unsupervised learning model, one of the most well-studied problems in machine learning. In this setting, a given distribution over data is assumed to be from a mixture distribution over normal random variables, and the goal is to associate each data point with the corresponding Gaussian, as well as to identify these Gaussians.

The MOG model has long served as a tool for scientific discovery, see e.g. this blog post by Moritz Hardt and links therein. However, computationally it is not well behaved in terms of worst-case complexity. Even the simpler “k-means” model is known to be NP-hard.

Here is where things get interesting: using reconstruction error and compression as a metric, we can learn MOG by a seemingly different unsupervised learning method – Principle Component Analysis (PCA)! The latter admits very efficient linear-algebra-based algorithms.

More formally: let X \in R^{m \times d} be a data matrix sampled i.i.d from some arbitrary distribution x \sim D. Assume w.l.o.g. that all norms are at most one. Given a set of k centers of Gaussians, \mu_1,...,\mu_k, minimize reconstruction error given by
E_{x \sim D} [ \min_i \| \mu_i - x \|^2  ] 
A more general formulation, allowing several means to contribute to the reconstruction error, is
E_{x \sim D} [ \min_{ \| \alpha_x\|_2 \leq 1  }  \| \sum_{j} \alpha_j  \mu_j - x \|^2  ] 
Let M \in R^{d \times k} be the matrix of all $\mu$ vectors. Then we can write the optimization problem as
$latex  \min_{M} E_{x \sim D} [ \min_{ \| \alpha_x\|_2 \leq 1  }  \|   x  – M \alpha_x \|^2  ]  $
This corresponds to an encoding by vector $\alpha$ rather than by a single coordinate. We can write the closed form solution to $\alpha_x$ as:
$latex  \arg \min_{\alpha_x }    \|   x  – M \alpha_x \|^2   = M^{-1} x $
and the objective becomes
$latex  \min_{\alpha_x }    \|   x  – M \alpha_x \|^2   = \| x – M M^{-1} x\|_\star^2 $
for \| \|_\star being the spectral norm.
Thus, we’re left with the optimization problem of
$latex  \min_{M} E_{x \sim D} [  \|   x  – M M^{\dagger} x \|_\star^2  ]  $
which amounts to PCA.

We have just semi-formally seen how MOG can be learned by PCA!

A more general theorem applies also to MOG that are not spherical, some details appear in the paper with Tengyu, in the section on spectral auto-encoders.

The keen readers will notice we lost something in compression along the way: the encoding is now a k-dimensional vector as opposed to a 1-hot k-dimensional vector, and thus takes $k \log \frac{1}{\epsilon}$ bits to represent in binary, as opposed to $O(\log k)$. We leave it as an open question to come up with an efficient poly-log(k) compression that matches MOG. The solver gets a free humus at Sayeed’s, my treat!

Hyperparameter optimization and the analysis of boolean functions

I’ve always admired the beautiful theory on the analysis of boolean functions (see the excellent book of Ryan O’Donnell, as well Gil Kalai’s lectures). Heck, it used to be my main area of investigation back in the early grad-school days we were studying hardness of approximation, the PCP theorem, and the “new” (back at the time) technology of Fourier analysis for boolean functions.  This technology also gave rise to seminal results on learnability with respect to the uniform distribution. Uniform distribution learning has been widely criticized as unrealistic, and the focus of the theoretical machine learning community has shifted to (mostly) convex, real-valued, agnostic learning in recent years.

It is thus particularly exciting to me that some of the amazing work on boolean learning can be used for the very practical problem of Hyperparameter Optimization (HO), which has on the face of it nothing to do with the uniform distribution. Here is the draft, joint work with Adam Klivans and Yang Yuan.

To describe hyperparameter optimization: many software packages, in particular with the rise of deep learning, come with gazillions of input parameters. An example is the TensorFlow software package for training deep nets. To actually use it the user needs to specify how many layers, which layers are convolutional / full connected, which optimization algorithm to use (stochastic gradient descent, AdaGrad, etc.), whether to add momentum to the optimization or not…. You get the picture – choosing the parameters is by itself an optimization problem!

It is a highly non-convex optimization problem, over mostly discrete (but some continuous) choices. Evaluating the function, i.e. training a deep net over a large dataset with a specific configuration, is very expensive, and you cannot assume any other information about the function such as gradients (that are not even defined for discrete functions).  In other words, sample complexity is of the essence, whereas computation complexity can be relaxed as long as no function evaluations are required.

The automatic choice of hyperparameters has been hyped recently with various names such as “auto tuning”, “auto ML”,  and so forth. For an excellent post on existing approaches and their downsides see these posts Ben Recht’s blog.

This is where harmonic analysis and compressed sensing can be of help! While in general hyperparameter optimization is computationally hard, we assume a sparse Fourier representation of the objective function. Under this assumption, one can use compressed sensing techniques to recover and optimize the underlying function with low sample complexity.

Surprisingly, using sparse recovery techniques one can even improve the known sample complexity results for well-studied problems such as learning decision trees under the uniform distribution, improving upon classical results by Linial, Mansour and Nisan.

Experiments show that the sparsity assumptions in the Fourier domain are justified for certain hyperparameter optimization settings, in particular, when training of deep nets for vision tasks. The harmonic approach (we call the algorithm “Harmonica”) attains a better solution than existing approaches based on multi-armed bandits and Bayesian optimization, and perhaps more importantly, significantly faster. It also beats random search 5X (i.e. random sampling with 5 times as many samples allowed as for our own method, a smart benchmark proposed by Jamieson).

Those of you interested in code, my collaborator Yang Yuan has promised to release it on GitHub, so please go ahead and email him 😉

PS. We’re grateful to Sanjeev Arora for helpful discussions on this topic.

PPS. It has been a long time since my last post, and I still owe the readers an exposition on the data compression approach to unsupervised learning.  In the mean time, you may want to read this paper with Tengyu Ma in NIPS 2016.

Unsupervised Learning II: the power of improper convex relaxations

In this continuation post I’d like to bring out a critical component of learning theory that is essentially missing in today’s approaches for unsupervised learning: improper learning by convex relaxations.

Consider the task of learning a sparse linear separator. The sparsity, or $\ell_0$, constraint is non-convex and computationally hard. Here comes the Lasso – replace the $\ell_0$ constraint with $\ell_1$ convex relaxation — et voila — you’ve enabled polynomial-time solvable convex optimization.

Another more timely example: latent variable models for recommender systems, a.k.a. the matrix completion problem. Think of a huge matrix, with one dimension corresponding to users and the other corresponding to media items (e.g. movies as in the Netflix problem). Given a set of entries in the matrix, corresponding to ranking of movies that the users already gave feedback on, the problem is to complete the entire matrix and figure out the preferences of people on movies they haven’t seen.
This is of course ill posed without further assumptions. The low-rank assumption intuitively states that people’s preferences are governed by few factors (say genre, director, etc.).  This corresponds to the user-movie matrix having low algebraic rank.

Completing a low-rank matrix is NP-hard in general. However, stemming from the compressed-sensing literature, the statistical-reconstruction approach is to assume additional statistical and structural properties over the user-movie matrix. For example, if this matrix is “incoherent”, and the observations of the entires are obtained via a uniform distribution, then this paper by Emmanuel Candes and Ben Recht shows efficient reconstruction is still possible.

But is there an alternative to incoherent assumptions such as incoherence and i.i.d. uniform observations?

A long line of research has taken the “improper learning by convex relaxation approach” and applied it to matrix completion in works such as Srebro and Shraibman, considering convex relaxations to rank such as the trace norm and the max norm. Finally, in  joint work with Satyen Kale and Shai Shalev-Shwartz, we take this approach to the extreme by not requiring any distribution at all, giving regret bounds in the online convex optimization model (see previous post).

By the exposition above one may think that this technique of improper convex relaxations applies only to problems whose hardness comes from “sparsity”.  This is far from the truth, and in the very same paper referenced above, we show how the technique can be applied to combinatorial problems such as predicting cuts in graphs, and outcomes of matches in tournaments.

In fact, improper learning is such a powerful concept, that problems such that the complexity of problems such as learning DNF formulas has remained open for quite a long time, despite the fact that showing proper learnability was long known to be NP-hard.

On the other hand, improper convex relaxation is not an all powerful magic pill. When designing convex relaxations to learning problems, special care is required so not to increase sample complexity. Consider for example the question of predicting tournaments, as given in this COLT open problem formulation by Jake Abernethy. The problem, loosely stated, is to iteratively predict the outcome of a match between two competing teams from a league of N teams. The goal is to compete, or make as few mistakes, as the best ranking of teams in hindsight. Since the number of rankings scales as $N!$, the multiplicative weights update method can be used to guarantee regret that scales as $\sqrt{T \log N!} = O(\sqrt{T N \log N})$. However, the latter, naively implemented, runs in time proportional to $N!$. Is there an efficient algorithm for the problem attaining similar regret bounds?

A naive improper learning relaxation would treat each pair of competing teams as an instance of binary prediction, for which we have efficient algorithms. The resulting regret bound, however, would scale as the number of pairs of teams over $N$ candidate teams, or as $\sqrt{T N^2}$, essentially removing any meaningful prediction property. What has gone wrong?

By removing the restriction of focus from rankings to pairs of teams, we have enlarged the decision set significantly, and increased the number of examples needed to learn. This is a general and important concern for improper convex relaxations: one needs to relax the problem in such a way that sample complexity (usually measured in terms of Rademacher complexity) doesn’t explode. For the aforementioned problem of predicting tournaments, a convex relaxation that preserves sample complexity up to logarithmic factors is indeed possible, and described in the same paper.

In the coming posts we’ll describe a framework for unsupervised learning that allows us to use the power of improper convex relaxations.

A mathematical definition of unsupervised learning?

Extracting structure from data is considered by many to be the frontier of machine learning. Yet even defining “structure”, or the very goal of learning structure from unlabelled data, is not well understood or rigorously defined.

In this post we’ll give a little background to unsupervised learning and argue that, compared to supervised learning, we lack a well-founded theory to tackle even the most basic questions. In the next post, we’ll introduce a new candidate framework.

Unsupervised learning is a commonplace component of many undergraduate machine learning courses and books. At the core, the treatment of unsupervised learning is a collection of methods for analyzing data and extracting hidden structure from this data.  Take for example John Duchi’s “Machine Learning” course at Stanford. Under “Unsupervised learning”, we have the topics: Clustering, K-means, EM. Mixture of Gaussians, Factor analysis, PCA (Principal components analysis), ICA (Independent components analysis). This is an excellent representation of the current state-of-the-art:

  • Clustering and K-means:  by grouping “similar” elements together, usually according to some underlying metric, we can create artificial “labels” for data elements in hope that this rough labelling will be of future use, either in a supervised learning task on the same dataset, or in “understanding” the data. K-means is a widely used algorithm for finding k representative centers of the data, allowing for k “labels”.
  • Mixture of Gaussians: an exemplary and widely used method stemming from the generative approach to unsupervised learning. This approach stipulates that the data is generated from a distribution, in this case a mixture of normal random variables. There are numerous algorithms for finding the centers of these Gaussians, thereby learning the structure of the data.
  • Factor analysis, PCA and ICA: the spectral approach to unsupervised learning is perhaps the most widely used, in which the covariance matrix of the data is approximated by the largest few singular vectors. This type of clustering is widely successful in text (word embeddings) and many other forms of data. 

The above techniques are widely used and successful in practice, and under suitable conditions polynomial-time algorithms are known. The common thread amongst them is finding structure in the data, which turns out for many practical applications to be useful in a variety of supervised learning tasks.

Notable unsupervised learning techniques that are missing from the above list are Restricted Boltzman machines (RBMs), Autoencoders and Dictionary Learning (which was conceived in the context of deep neural networks). The latter techniques attempt to find a succinct representation of the data. Various algorithms and heuristics for RBMs, Autotencoders and Dictionary Learning exist, but these satisfy at least one of the following:

1) come without rigorous performance guarantees.

2) run in exponential time in the worst case.

3) assume strong generative assumptions about the input.

Recent breakthroughs in polynomial time unsupervised learning due to Sanjeev and his group address points (1) & (2) above and require only (3). Independently the same is also achievable by the method of moments, see e.g. this paper, originally from Sham’s group @ MSR New England, and many more recent advances. What’s the downside of this approach? The Arora-Hazan debate over this point, which the theory-ML students are exposed to in our weekly seminar, is subject for a longer post…

What is missing then? Compare the above syllabus with that of supervised learning, in most undergrad courses and books, the difference in terms of theoretical foundations is stark. For supervised learning  we have Vapnik’s statistical learning theory –  a deep and elegant mathematical theory that classifies exactly the learnable from labeled real-valued examples. Valiant’s computational learning theory adds the crucial element of computation, and over the combined result we have hundreds of scholarly articles describing in detail problems that are learnable efficiently / learnable but inefficiently /  improperly learnable / various other notions.

Is there meaning, in pure mathematical sense such as Vapnik and Valiant instilled into supervised learning, to the unsupervised world?

I like to point out in my ML courses that computational learning theory say something philosophical about science: the concept of “learning”, or a “theory” such as a physical theory of the world, has a precise mathematical meaning independent of humans. While many of us scientists seek a “meaningful” theory in the human sense, there need not be one! It could very well be that a physical theory, for example that of condensed matter, has a “theory” in the Vapnik-Valiant sense, but not one that would be as “meaningful” and “elegant” in the Feynman sense.

How can we then, give mathematical meaning to unsupervised learning in a way that:

  1. Allows natural notions of generalization from examples
  2. Improves future supervised learning tasks regardless of their nature
  3. Allows for natural family of algorithms (hopefully but not necessarily – based on convex optimization)

This will be the starting point for our next post…