Keynote Speakers

Cordelia Schmid

INRIA Rhone-Alpes, Grenoble

Roberto Cipolla

University of Cambridge

Heiga Zen

Google

Structured models for human action recognition

In this talk, we present some recent results for human action recognition in videos. We, first, introduce a pose-based convolutional neural network descriptor for action recognition, which aggregates motion and appearance information along tracks of human body parts. We also present an approach for extracting such human pose in 2D and 3D. Next, we propose an approach for spatio-temporal action localization, which detects and scores CNN action proposals at a frame as well as at a tubelet level and then tracks high-scoring proposals in the video. Actions are localized in time with an LSTM at the track level. Finally, we show how to extend this type of method to weakly supervised learning of actions, which allows to scale to large amounts of data without manual annotation.

Geometry, Uncertainty and Deep Learning

My presentation will look at state-of-the-art techniques for deep learning, with a focus on representations of geometry and uncertainty. Understanding what a model does not know is a critical part of safe machine learning systems. New tools, such as Bayesian deep learning, provide a framework for understanding uncertainty in deep learning models, aiding interpretability and safety of such systems. Additionally, knowledge of geometry is an important consideration in designing effective algorithms. In particular, we will explore the use of geometry to help design networks that can be trained with unlabelled data for stereo and for human body pose and shape recovery.

Generative Model-Based Text-to-Speech Synthesis

Recent progress in deep generative models and its application to text-to-speech (TTS) synthesis has made a breakthrough in the naturalness of artificially generated speech. This talk first details the generative model-based TTS synthesis approach from its probabilistic formulation to actual implementation including statistical parametric speech synthesis, then discusses the recent deep generative model-based approaches such as WaveNet, Char2Wav, and Tacotron from this perspective. Possible future research topics are also discussed.

Deep learning for autonomous driving

The recent advancements in deep learning and GPU-accelerated computing has resulted in remarkable progress in solving the autonomous driving problem. This talk gives a brief overview on solving multiple aspects of self-driving cars with the help of deep learning which demands huge computation resource. An insight is provided into the recent hardware advancements from NVIDIA such as the Volta GPU architecture with ‘Tensor Core’ specialized for deep learning. NVIDIA’s Drivenet and Pilotnet networks used for perception and control in self-driving cars shall be introduced. Some of the performance optimization techniques developed at NVIDIA such as the conversion of FP32 model to 8-bit integer (INT8) models using TensorRT and pruning convolutional kernels in neural networks to enable efficient inference shall be discussed.