J. G. Zilly, R. K. Srivastava, J. Koutník and J. Schmidhuber
International Conference on Machine Learning (ICML), 2017.
Many sequential processing tasks require complex nonlinear transition functions
from one step to the next. However, recurrent neural networks with “deep”
transition functions remain difficult to train, even when using Long Short-Term
Memory (LSTM) networks. We introduce a novel theoretical analysis of recurrent
networks based on Gersgorin’s circle theorem that illuminates several modeling
and optimization issues and improves our understanding of the LSTM cell. Based
on this analysis we propose Recurrent Highway Networks, which extend the LSTM
architecture to allow step-to-step transition depths larger than one. Several
language modeling experiments demonstrate that the proposed architecture results
in powerful and efficient models. On the Penn Treebank corpus, solely increasing
the transition depth from 1 to 10 improves word-level perplexity from 90.6 to
65.4 using the same number of parameters. On the larger Wikipedia datasets for
character prediction (text8 and enwik8), RHNs outperform all previous results
and achieve an entropy of 1.27 bits per character.
F. Monti, D. Boscaini, J. Masci
, E. Rodola, J. Svoboda
, M. M. Bronstein
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Deep learning has achieved a remarkable performance breakthrough in several
fields, most notably in speech recognition, natural language processing, and
computer vision. In particular, convolutional neural network (CNN) architectures
currently produce state-of-the-art performance on a variety of image analysis
tasks such as object detection and recognition. Most of deep learning research
has so far focused on dealing with 1D, 2D, or 3D Euclidean-structured data such
as acoustic signals, images, or videos. Recently, there has been an increasing
interest in geometric deep learning, attempting to generalize deep learning
methods to non-Euclidean structured data such as graphs and manifolds, with a
variety of applications from the domains of network analysis, computational
social science, or computer graphics. In this paper, we propose a unified
framework allowing to generalize CNN architectures to non-Euclidean domains
(graphs and manifolds) and learn local, stationary, and compositional
task-specific features. We show that various non-Euclidean CNN methods
previously proposed in the literature can be considered as particular instances
of our framework. We test the proposed method on standard tasks from the realms
of image-, graph- and 3D shape analysis and show that it consistently
outperforms previous approaches.
W. Jaśkowski, O. R. Lykkebø, N. E. Toklu, F. Trifterer, Z. Buk, J. Koutník and F. Gomez
The NIPS '17 Competition: Building Intelligent Systems (First Place), 2017.
This paper describes the approach taken by the NNAISENSE Intelligent Automation
team to win the NIPS ’17 “Learning to Run” challenge involving a biomechanically
realistic model of the human lower musculoskeletal system.
M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez
Neural Information Processing Systems (NeurIPS), 2018.
This paper introduces Non-Autonomous Input-Output Stable Network (NAIS-Net), a
very deep architecture where each stacked processing block is derived from a
time-invariant non-autonomous dynamical system. Non-autonomy is implemented by
skip connections from the block input to each of the unrolled processing stages
and allows stability to be enforced so that blocks can be unrolled adaptively to
a pattern-dependent processing depth. NAIS-Net induces non-trivial, Lipschitz
input-output maps, even for an infinite unroll length. We prove that the network
is globally asymptotically stable so that for every initial condition there is
exactly one input-dependent equilibrium assuming tanh units, and multiple stable
equilibria for ReL units. An efficient implementation that enforces the
stability under derived conditions for both fully-connected and convolutional
layers is also presented. Experimental results show how NAIS-Net exhibits
stability in practice, yielding a significant reduction in generalization gap
compared to ResNets.
D. Ha and J. Schmidhuber
Neural Information Processing Systems (NeurIPS), 2018.
A generative recurrent neural network is quickly trained in an unsupervised
manner to model popular reinforcement learning environments through compressed
spatio-temporal representations. The world model's extracted features are fed
into compact and simple policies trained by evolution, achieving state of the
art results in various environments. We also train our agent entirely inside of
an environment generated by its own internal world model, and transfer this
policy back into the actual environment. Interactive version of this paper is
available at https://worldmodels.github.io
F. Lattari, M. Ciccone
, M. Matteucci, J. Masci
, and F. Visin
2018 DAVIS Challenge on Video Object Segmentation - IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
We introduce ReConvNet, a recurrent convolutional architecture for
semi-supervised video object segmentation that is able to fast adapt its
features to focus on any specific object of interest at inference time.
Generalization to new objects never observed during training is known to be a
hard task for supervised approaches that would need to be retrained. To tackle
this problem, we propose a more efficient solution that learns spatio-temporal
features self-adapting to the object of interest via conditional affine
transformations. This approach is simple, can be trained end-to-end and does not
necessarily require extra training steps at inference time. Our method shows
competitive results on DAVIS2016 with respect to state-of-the art approaches
that use online fine-tuning, and outperforms them on DAVIS2017. ReConvNet shows
also promising results on the DAVIS-Challenge 2018 winning the 10-th position.
, K. Irie, R. Schlüter, H. Ney
Sequence-to-sequence attention-based models on subword units allow simple
open-vocabulary end-to-end speech recognition. In this work, we show that such
models can achieve competitive results on the Switchboard 300h and LibriSpeech
1000h tasks. In particular, we report the state-of-the-art word error rates
(WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets
of LibriSpeech. We introduce a new pretraining scheme by starting with a high
time reduction factor and lowering it during training, which is crucial both for
convergence and final performance. In some experiments, we also use an auxiliary
CTC loss function to help the convergence. In addition, we train long short-term
memory (LSTM) language models on subword units. By shallow fusion, we report up
to 27% relative improvements in WER over the attention baseline without a
language model. Index Terms: attention, end-to-end, speech recognition.
J. Svoboda, J. Masci
, F. Monti, M.M. Bronstein, and L. Guibas
International Conference on Representation Learning (ICLR), 2018.
Deep learning systems have become ubiquitous in many aspects of our lives.
Unfortunately, it has been shown that such systems are vulnerable to adversarial
attacks, making them prone to potential unlawful uses. Designing deep neural
networks that are robust to adversarial attacks is a fundamental step in making
such systems safer and deployable in a broader variety of applications (e.g.
autonomous driving), but more importantly is a necessary step to design novel
and more advanced architectures built on new computational paradigms rather than
marginally building on the existing ones. In this paper we introduce PeerNets, a
novel family of convolutional networks alternating classical Euclidean
convolutions with graph convolutions to harness information from a graph of peer
samples. This results in a form of non-local forward propagation in the model,
where latent features are conditioned on the global structure induced by the
graph, that is up to 3 times more robust to a variety of white- and black-box
adversarial attacks compared to conventional architectures with almost no drop
, Q. Wang, R. K. Srivastava
, and P. Koumoutsakos
European Conference on Computer Vision (ECCV), 2018.
Video prediction models based on convolutional networks, recurrent networks, and
their combinations often result in blurry pre- dictions. We identify an
important contributing factor for imprecise pre- dictions that has not been
studied adequately in the literature: blind spots, i.e., lack of access to all
relevant past information for accurately predicting the future. To address this
issue, we introduce a fully context- aware architecture that captures the entire
available past context for each pixel using Parallel Multi-Dimensional LSTM
units and aggregates it us- ing blending units. Our model outperforms a strong
baseline network of 20 recurrent convolutional layers and yields
state-of-the-art performance for next step prediction on three challenging
real-world video datasets: Human 3.6M, Caltech Pedestrian, and UCF-101.
Moreover, it does so with fewer parameters than several recently proposed
models, and does not rely on deep convolutional networks, multi-scale
architectures, sepa- ration of background and foreground modeling, motion flow
learning, or adversarial training. These results highlight that full awareness
of past context is of crucial importance for video prediction.
P. Shyam, W. Jaśkowski, and F. Gomez
International Conference on Machine Learning (ICML), 2019.
Efficient exploration is an unsolved problem in Reinforcement Learning which is
usually addressed by reactively rewarding the agent for fortuitously
encountering novel situations. This paper introduces an efficient active
exploration algorithm, Model-Based Active eXploration (MAX), which uses an
ensemble of forward models to plan to observe novel events. This is carried out
by optimizing agent behaviour with respect to a measure of novelty derived from
the Bayesian perspective of exploration, which is estimated using the
disagreement between the futures predicted by the ensemble members. We show
empirically that in semi-random discrete environments where directed exploration
is critical to make progress, MAX is at least an order of magnitude more
efficient than strong baselines. MAX scales to high-dimensional continuous
environments where it builds task-agnostic models that can be used for any