December 16, 2016Ilya Kuzovkin

Categories:Machine Learning, AI, Computer Science

Notes on NIPS 2016

Neural Information Processing Systems (NIPS) conference is a place where computational neuroscience meets machine learning. Due to the rise of deep learning (DL) in the recent years the conference is now flooded with presentations on deep anything, while the brain part is left somewhat behind. Although some presentations here and there draw the analogies and several workshops try to revive^Thanks! the tradition.

Notes are organized into five topics coinciding with the major trends represented at the conference:

General Deep Learning
Reinforcement Learning
Neuroscience and ML
Learning to Learn
Non-Deep Learning

Let’s begin!

General Deep Learning

Tutorial: Building Applications using Deep Learning by Andrew Ng

A presentation about practical tips on applying deep learning and the ways how to think about ML experimental pipeline, evaluation measures, and the concepts of bias and variance in the era of human-level performance. Content-wise same as https://www.youtube.com/watch?v=F1ka6a13S9I.

Tutorial: Generative Adversarial Networks by Ian Goodfellow

Introduction to Generative Adversarial Networks (GAN) and DCGAN Architecture. Demonstration of semantic algebra operations in images space (picture on the right). Noise Contrastive Estimation (NCE) is a similar idea from the past. On regularization: weight decay affects training accuracy of a network, label smoothing works better for regularization with GANs, batch normalization introduces correlation within a batch, reference batch normalization and virtual batch normalization to avoid that problem. Minibatch GAN generates more diverse samples. Extra discriminator outputs: instead of usual binary “real” / “fake” a descriminator is trained to disringuish “cat” / “dog” / “fake” — enters the domain of semi-supervised learning. GANs can be used for imitation learing. Plug and Play Generative Networks are the state of art in terms of quality of generated images (see picture on the left).

Keynote: Predictive Learning (by Yann LeCunn)

A general talk stating results from last year(s). Cool-looking SharpMask for image segmentation.

Award: Value Iteration Networks

Value iteration is an old algorithm to estimate V*. The idea of the article is to add learnable value iteration module into a neural network, giving it the ability to learn planning without having a model of the world and thus effectively being model-free method, but with planning. Trained a trivial agent in a grid world that has to reach end location from start location. Architecture presumably learns concept of planning instead of memorizing the path from start location to end location.

Session: Deep Learning

Ba, Hinton, Mnih Using Fast Weights to Attend to the Recent Past
Using two weights for each connection in RNNs: slow for long-term information, fast for rapid learning. Fast associative weights act as Hopfield network by attracting a state most similar to the current input. Performs 4% better that other sequential models (including LSTM) on recursive vision task (worse than usual convnet though) of face emotion classification.

Sequential Neural Models with Stochastic Layers

Stochastic transitions allow to model uncertainty. Combining advantages of RNNs and SSMs.

Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences
Extremely long sequences (10k time steps). In many application that involve long sequences we would like to exploit all of that data, but since sequences are so long we subsample/average/etc. They add “kronos” k gate to LSTM unit that controls when the cell value can be updated. Outperforms usual LSTMs in high resolution sampling and asynchronous sampling conditions (synthetic data).

Weight normalization

Represent each weight vector \(\mathbf{w}\) as vector \(\mathbf{v}\) and scalar \(g\) as follows: \(\mathbf{w} = \displaystyle\frac{g}{||\mathbf{v}||}\mathbf{v}\) and use usual stochastic gradient descent over new parameters \(\mathbf{v}\) and \(g\). The results show that this approach speeds up training of a neural network in various architectures, does not introduce noise as batch normalization and thus can be used with LSTMs and in reinforcement learning domain (where extra noise can cause troubles).

Differentiable Neural Computer by Alex Graves

Presentation about quite famous Nature article by DeepMind. It is a descendant from Neural Turing Machines and continues on the idea of adding memory to a neural network. Differently from LSTMs this memory is actually read-write device and the neural network learns from data how to use, what to write and where to read. Personally I think this is a fascinating work, however with no practical applications at its current level of development. DeepMind has promised to publish the code in April 2017.

Credit Assignment Beyond Backpropagation by Yoshua Bengio

There are several algorithms for credit assignment: Boltzmann machines — high variance; REINFORCE algorithm — very high variance, does not scale, susceptible to noise; Actor-Critic models — lower variance, but potentially high bias; Backpropagation — wins over all previous algorithms as considers only one direction of update and does not waste time exploring other possible directions. However backprop has limitations as well: with increasing depth of deep neural networks is becomes harder and harder to train them (with exception for ResNet which is a nice trick), small changes in the output do not always reach the place that has to be changed to correct for them. In this talk he proposes new novel concept of equilibrium propagation: given a real physical system (brain, for example) that we can observe we perform a pull on the output away from the “correct” energy minimum and observe to which changes in the system it lead. We will notice the places that were most responsive toward the pull we made and this will know which places (parameters) of the system must be changed to influence that particular output. The very nice property is that if we can monitor the system via sufficient statistics then we can alter it even without knowing how exactly the system works. The talk outlines the mathematical framework which allows to reason about this method. Limited by a requirement that there is an energy function (so that we can pull output away from energy minima to introduce perturbations).

Reinforcement Learning and Robotics

Recent successes of deep reinforcement learning (DRL), such as Atari, AlphaGo, etc. created a wave of interest towards that field, explosion in the number of proposed deep RL architectures and the strive to solve the ultimate^{in my opinion} goal of RL — control of a real agent in a real environment, which is an open problem due to high sample complexity of deep RL algorithms.

Tutorial: Deep Reinforcement Learning through Policy Optimization

One of the main obstacles for applying RL in real world is sample complexity.
An interesting difference between supervised learning and RL: in supervised learning the training data distribution is fixed, while in RL change in policy leads to sampling from another distribution and too big step in a wrong direction might break everything.

Learning to poke by poking

Robotic hand is trained with reinforcement learning algorithm to poke object in such a way that eventually they end up in a required position. Trained in physical world. Novel idea is to optimize simultaneously the forward prediction generative model: given state and action what will be the next state and inverse model (given current state s and the next state s’ what is the action to be taken to get from s to s’? Learning such joint model improves performance and decreases sample complexity.

Guided Policy Search

Explored the possibility of demonstrating correct trajectory that is given by a human operator or an expert algorithm to a reinforcement learning agent. Starting the training process from such a demonstration helps the agent to find good local minima and drastically decreases sample complexity needed to learn the task.

Using a slow RL algorithm to learn a fast RL algorithm using recurrent neural networks by Ilya Sutskever

All notable RL success stories are relying on huge sample complexity. To more RL forward we need to overcome this hunger. The core of the agent is a recurrent neural network, that gives the agent the ability to incorporate previous states and experiences in its current state. During the training an agent is subject to training examples from different MDPs (different games) and its objective is averaged across all possible MDPs. One can argue that a resulting agent learns how to create reinforcement learning algorithm, however what is actually learn by an agent is a matter of philosophical debate. In order to get familiar with the topic read

“Learning to Navigate in Complex Environments” by P. Mirowski et al.,
“Learning to reinforcement learn” by J. X Wang et al.,
“RL²: Fast Reinforcement Learning via Slow Reinforcement Learning” by Y. Duan et al., and
“Learning to Learn for Global Optimization of Black Box Functions” by Y. Chen et al.

All these look into the problem of complex environments and/or multi-environment agents.

OpenAI Universe

Is a platform where you can train your RL agents in computer-based environment (games, browser, word processors, anything!). Provides the interface and has large number of built-in games and browser-like tasks. The hopes are that Universe can serve as next evaluation benchmark for RL research and once “solved” we will have on our hands agents that can do whatever you and I can do via usual keybord + mouse + screen interface. Which, in our current world, is a lot! In the light of strong research towards multi-environment agents and transfer learning it seems like a very useful benchmark indeed.

Representation learning by SGD in cross-validation by Rich Sutton

An idea presented by author of the textbook on reinforcement learning is about a better way to learn representations in deep nets. The proposed methods is an alternative to backpropagation and is called crossprop: SGD is used to find gradient for outgoing weights wrt incoming weights of each node. On synthetically designed data, that is generated from known representations, this approach makes the nodes of the network to better coincide with those representation than if trained via backprop.

The Nuts and Bolts of Deep RL Research by John Schulman

A set of practical advices on how to do and think about RL.

Approaching new problems: be able to run experiments quickly, visualize a lot, make problem easier and easier until you see first signs of life from the learning algorithm, at the early stages shape the reward to be favorable for learning.
POMDP design: it is hard to describe whole new formalization of the environment, states and rewards, make random agents and visualize their policies to see how environment is behaving, use human control to explore the environment, plot histograms of useful stats, expect to need more samples than you initially would expect.
Tuning phase: once your algorithm works check for sensitivity to parameters — a good one should be robust, use multiple randoms seeds and do not trust articles with learning curves based on single seed, run set of benchmarks (other environments and tasks) to keep sure that algorithm still works and nothing is broken by recent changes, data whitening can be more useful that smarter algorithms, automate your experiments as much as possible — it pays off, estimate require frame and action frequency by human control (meaning try to play it yourself), track mean / max / min / stdev of episode reward.

Learning to Navigate in Complex Environments by Raia Hadsell

The backbone of the system consists of the following blocks: Convolutional network for feature learning -> Stacked LSTM for action planning -> Additional inputs -> A3C RL algorithm -> Auxiliary target such as depth prediction. Such an agent is trained on a maze navigation task. The core observation is that having an auxiliary task that is relevant to understanding the environment increases total reward more that two times. Presumably due to better internal representations, learning which was enforces by auxiliary task.

Learning to Experiment

An agent has to give answers about physical properties of an object, such as weight. This property cannot be learned from static visual input and in order to collect the information an agents has to interact with the world.

Combining Policy Gradient with Q-Learning

The task is to combine the power of off-policy RL method (Q-Learning) with on-policy Policy Gradient approach. Achieved and demonstrated better data efficiency.

Sample Efficient and Stable Deep Reinforcement Learning by Sergey Levine

As mentioned multiple times before the main problem for real-world RL is data efficiency. Proposed new method called Q-Prop that combines policy gradients with off-policy methods. Policy gradients are unbiased and stable, however have high variance, are sample inefficient and tend to forget previously learned things. Off-policy methods have low variance and are sample efficient, but can be biased and are less stable. Seems only natural to combine the two. Q-Prop has on-policy critic, off-policy critic with replay, combines Policy Gradient and Actor-Critic model with TRPO-GAE method. Demonstrated sample efficiency in MuJoCo with smaller batch size and is more stable on harder problems. In real world is trained on 7 DOF robotic arm to open a door and with 2.5 hours was able to learn to^clankily do that.

Large-Scale Self-Supervised Robotic Learning by Chelsea Finn

The application is robotic vision + control. The problem with human supervision is that is does not scale to more complex or more numerous tasks. Proposed a concept of self-supervision via reinforcement learning. One of the important aspects is a convolutional LSTM that predicts next visual frame. The task of grasping different objects is learnt from 50K sequences (1M+ frames). Current limitations are speed of execution and problems with more complex tasks. The dataset is available for download.

Course on Deep Reinforcement Learning: http://rll.berkeley.edu/deeprlcourse

Multi-Agent and Multi-Robot Coordination with Uncertainty and Limited Communication by Chris Amato

Nice picture to illustrate the difference between MDP and POMDP. The core idea of the presentation is that given complex robotic units and environment we might not want to waste RL efforts on exploring and learning how to operate each particular tool out of a toolbox a robot has. It makes more sense in practice to use hardcoded control logic for basic operations and apply RL on larger scale — to learn how to achieve the desired goal by using the tools and existing primitives. In “Planning with Macro-Actions in Decentralized POMDPs” by Amato et al., 2014 they introduce MacDec-POMDP approach, that uses subcontrollers for subtasks and the RL algorithm learn the general strategy, not the whole state space, which makes learning much more sample efficient.

Deep Robotic Learning by Sergey Levine

End-to-end in robotics. Current robotic applications consist of specialized blocks for different stages of operation: one for vision and feature extraction, one for planning, another for servo control. From examples of computer vision and speech recognition we’ve seen how such block-based pipelines were replaced by end-to-end deep learning. Does it mean that end-to-end is the way to go for robotics as well? Sergey thinks that yes and in his talk outlines his steps towards end-to-end robotics. First motivating example demonstrates an example of a motor task^{was it door opening or something else? does not matter} where using the block-based pipeline the robot does not learn to perform the task at all, while using end-to-end learning it achieves 96.3% accuracy in performance. The explanation for such drastic difference is that if you learn piece by piece is does not guarantee that what you’ve learnt will assemble into functioning pipeline, while when learning end-to-end the performance on the end task is exactly what you are optimizing for and therefore is learning is successful the robot will perform the task successfully.

Data efficiency. The first challenge and difference between end-to-end supervised learning and end-to-end reinforcement learning is^{yes, again} amount of data. In his experiments they took several robotic hands and put them to collect grasping data, but having multiple using the data acquisition process is faster. DNN behind the learning algorithm consisted of 16 convolutional layers. Deep Deterministic Policy Gradient algorithm (DDPG) and Normalized Advantage Function algorithm (NAF) can achieve training times that are suitable for real robotic systems (see “Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates“).

Safety is an important aspect in real-world robotics. There are articles that show how to incorporate model uncertainty into the planning in such way that a robot will operate in safe conditions (low speed for example) if uncertain.

Reality gap occurs when you want to take your agent that is trained in a simulator into a real world. One approach for bridging the gap is proposed in “Real Single-Image Flight Without a Single Real Image” where a drone was trained to fly in a simulator and later tested in real environment: was able to fly very successfully until encountered a glass door. Another work in same direction is “EPOpt: Learning Robust Neural Network Policies Using Model Ensembles” where by use of policy ensembles and adversarial training the agent that was trained in a simulated environment is robust towards and generalize to real world including the effects that were not modeled by the simulator.

Deep Learning for Robotics by Peter Abbeel

We might need something else that wheel or track to locomote in unknown environments like other planets and such. Nasa has build tensegrity Super Ball Bot^{see picture on the right} that is able to move on larger number of surfaces, is much more resilient and can be deployed without a parachute from greater heights. Since no one has a clue how this things should use its motors to move RL is the obvious choice^{and it works!}. The learning force is again R² with TRPO as base algorithm.

RL on simple species

It would be cool to see RL algorithm that is trained to replicate behavior of simple species like c.elegans.

Sim-to-Real using Progressive Neural Networks by Raia Hadsell

One NN that is trained to perform one task is called a “column”^{reference to cortical columns?}.

The idea is to train a new column for each new task. Each new column is connected to all previous columns so that features that were learnt for the previous tasks can be reused in the new tasks. Once a new column is added the weights in the previous ones are freezed. Advantages: no forgetting of previous tasks, feature transfer, now capacity for each new task. Disadvantages: separate mechanism is need to know when to switch between the columns, number of parameters grows with the number of tasks (whole new network for each task). The approach demonstrates the usefulness of relevant intermediate tasks (a bit like curriculum learning). Even better generalization and robustness can be achieved via data augmentation: changing shapes, size, color, lighting of the objects in the simulator leads to better final agent.

Speeding up the training in a physical environment can be achieved using this technology: simulator is the first columns, each new task in a real world is an additional column. Imitation learning is also good way to beat sample complexity. Training in the simulator happens with A3C and is fast and parallelizable. Comparing training for the same task: 24h in simulator with subsequent PNN for bridging the gap, 55d in real physical system.

Alternative approach is finetuning to the real world after training in a simulator, but PNN approach beats that with a large margin.

Bayesian Reasoning and Deep Learning in Agent-based Systems by Shakir Mohamed

A very nice talk that puts Bayesian ML and Deep Learning together and demonstrates how they fit together. Mention of Bayesian framework being naturally related to memory, which seems to be an important topic at the moment in the context of Memory Networks and Neural Turing Machines.

Panel: Lawrence, Goodfellow, Mohamed, Welling, Adams

Is ensembling a practical replacement for Bayesian methods? Estimating confidence intervals and measuring the uncertainty were important points of discussion. How to specify good priors, why don’t we do that? Actually we do — the hole hierarchical structure of deep convolutional neural networks is a prior we imposed and it happened to be useful one. New important trend is to build generative models into algorithms and agents and use their predictions about the world in training and deployment.

CoRL 2017: 1st Annual Conference on Robot Learning

Neuroscience and Machine Learning

Comparison between actual brain and DNN seems to be very popular theme. This is something I am doing myself and it woke mixed feelings to see 3 posters doing almost the same thing I am doing. On the other hand it means that this direction in research seemed important to several groups, therefore it could have merit. Or just effect of the hype, who knows…

Keynote: Engineering Principles From Stable and Developing Brains

Synapse pruning happens with age. Why that could be useful? Initially everything is connected, with time active connections persist, useless disappear. They experimentally tested pruning approach (start with fully connected and remove) vs. growth approach (start with no connections and add as needed) — pruning is better in terms of “efficiency” and “robustness” measures. Next thing to explore is what is the optimal rate of pruning. Conducted experiments on mouse whiskers (and somatosensory columns) — slides the brain, applied automatic synapse detection to count number of synapses. The experiment a) confirmed the synoptical decrease over time, b) with optimal pruning rate^{expressed with a curve} the resulting network is 20% more efficient.

Another topic they’ve explored is brain mechanism for nearest neighbors search. Locality-sensitive hashing (LSH) is common way to assign similar things close to each other in the hash space (same bin). Done via random projection method. How does a fly do that? For its odor mechanism it maps representation vector (molecule, 54-D) to extended space (2000-D), applies sparsity constraint (5% survives) on the output vector of 2000-D space. The resulting code is a “hash” of the input. Compared those two approaches on MNIST: nearly identical mean average precision; while on nearest neighbor specific tasks (SIFT10k, GIST10k datasets) the “Fly algorithm” gives considerably higher average precision. Same sparse locality-preserving algorithms seem to appear in other brain areas as well, might be an important underlying mechanism for some of the brain processes.

Keynote: Learning About the Brain: Neuroimaging and Beyond

Mental state recognition using ML: take data from the brain and try to predict something. For neuroscience it is important to Focus on interpretable ML. Regarding automatic feature extraction — manual feature engineering is still important in brain sciences. Transform channels x time matrix of EEG signal into an image and apply convolutional networks and LSTMs. Using deconvolution gave a bit of that desired interpretability.
Another question they addressed is whether cheap wearable devices measure anything related to mental states? Using MUSE 75% to classify weather a video a subject is watching is a cat video or a Khan Academy video.
Paths towards computational psychiatry: analyze patient interview, build story graphs — shown that complexity of the graphs is different for schizophrenia / control / mania cases.

Showing versus Doing: Teaching by Demonstration

When you (a human) demonstrates behavior he tends to exaggerate to highlight behavior patterns that are important. If your task is just to perform, not to demonstrate, you do it quite differently and focus on optimality. I’ve expected the next logical step, that would be to demonstrate haw an agent trained with “showing” performs better than the one trained with “doing”, but that did not happen in the talk.

Predictive Coding for Unsupervised Learning by David Cox

Analogy between visual cortex and CNNs. New larger visual DNNs match better and better to ventral stream. The talk is about temporal coding and predictive coding in the context of deep learning. Next frame prediction works well on synthetics datasets, natural road images, comma.ai steering wheel dataset.

Panel: Demis Hassabis, Terrence Sejnowski, Yoshua Bengio, Jakob Macke, Andreas Tolias

Synergy of neuroscience and machine learning
It would be nice to have more in vivo neuroscience experiments that produce more data for neuroinspired algorithms to borrow ideas from. Brain-like constraints (speed, size, memory efficiency) could be added to algorithmic approaches and that could lead to more brain-like algorithms. On the other hand if the goal is not to understand the neural code, but to build intelligence we might not need to limit ourselves by mere biological constraints. Maybe it is more beneficial to focus on understanding the learning process vs. trying to understand particular units of computation. Imagination-based planning, that is model-based learning and generative learning seems important at the moment.

Neuroscientists criticize the simplicity of ANNs
ML people are having hard time trying to understand what exactly ANNs are doing, despite the fact that we have access to each particular neuron — imagine how hard it is for neuroscience to figure out what the neurons are doing without such level of access. Neuroscience could record more area-wide data as opposed to single unit recordings. Dense networks (not hierarchical) can outperform^{there was a paper on DenseNets} hierarchical ones with smaller number of neurons. Real synapse can be seen as having a state and this is a promising area to explore.

What neuroscience has to do to be useful for ML?
Test the algorithms (especially biologically plausible ones) proposed ML in in vivo experiments.

Hawkes process

The precesses where intensity of an event depend on the previous events are called hawkes processes. There are examples of such in our world, for example spiking of neurons. There are models designed to fit such processes.

Approximating Human Vision with Deep NNs

Rapid image categorization is done well by humans: conducted set of experiments using Amazon Mechanical Turk. Same set of images fed to VGG-16 and compared how “difficult”^{note sure how define, probably number of error per category} it is for a human and for the net to classify same image. Until 5th layer of CNN diffuculty correlates, after that correlation drops indicating that DNN is doing something different from what a human is doing. Simpler AlexNet correlates better than more complex VGG. Also performed attention analysis: CNNs focus on different locations/features from humans.

Learning to Learn

Meta learning, or learning to learn was a concept that reemerged here and there as yet-uncertain-but-conceptually-promising direction. There are various ideas under that umbrella, one of which is to formulate the design of an optimization algorithm as a learning problem. That way what is learnt are not only the best parameters that minimize the loss function, but also the parameters of an optimizer that learns those parameters, alleviating the need for manual design of the optimization algorithm. To know more about this and related concepts read about Learning to learn by gradient descent by gradient descent . Another way to look at meta learning is proposed by Differential Neural Computer by Google DeepMind.

Sentient Technologies

Is a company that uses massive distributed computational resource to evaluate learning algorithm using genetic algorithms. Just curious to know that such thing exists.

Symposium: Recurrent Neural Networks and Other Machines that Learn Algorithms

Presentations without “deep” in the title

Keynote: Reproducible Research for Multidomain Data: the Case of the Human Microbiome

Lots of data in bioinformatics, lots of experiments, needs to be reproducible. They work on that by fixing pipelines, opensourcing data and code, checking results of others. “Give me any study I can always find you a significant p-value”. Transparency in analysis allows to publish non-perfect results and still move science forward.

Award: Matrix Completion has No Spurious Local Minimum

As the title says: they show that for a netflix-like matrix completion objective function all local minima are global.

Talk: Examples are not enough, learn to criticize! Criticism for Interpretability.

Given a dataset you pick common samples (prototypes) and most uncommon examples (criticisms) from each category (using Maximum Mean Discrepancy). Those are useful to explore your data, applicable when the dataset is huge and best thing you can normally do is to randomly sample to familiarize yourself with the data (meaning that you’ll never see the weirdos in the data as random sampling will not show them to you). Examples from ImageNet: prototypes are, for example, usual pictures of dogs, criticisms are the pictures of dogs from behind, from weird angles etc.

Program Synthesis and Machine Learning

Code completions formulated as ML task: the input is piece of code and output is the next line. Differently from other similar tools makes better use of the context to propose the completion.

Random notes from Neurorobotics workshop

Papers on human guided learning in robotics: “Strategies for Human-in-the-Loop Robotic Grasping“, “Humanoid Robot Posture-Control Learning in Real-Time Based on Human Sensorimotor Learning Ability“, “Adaptive Control of Exoskeleton Robots for Periodic Assistive Behaviours Based on EMG Feedback Minimisation“
During the panel old-school roboticists were questioning whether all this deep learning end-to-end robotics is just a hype or it really could replace the old ways. Dominating mood that it is a promising direction indeed and many are eager to try it out. However is must be seen as a tool as it does not provide much insight into understanding of control.

Posters & Demos

Having demos was a nice initiative — they show the gap between all-is-so-well you can observe in the papers and the reality. Nothing special to report here: few applications in robotics, few real-time application of CNNs, everything almost works, but does not amaze.

There was a huge amount of posters, here are few of personal interest.