Vision, Language and Action: from Captioning to Embodied AI

Tutorial at ICIAP 2019

Monday 09:00-11:00
Sala 2


Abstract

Recent progress in the Computer Vision and Natural Language Processing communities have made it possible to connect Vision, Language and Action together, achieving significant advancements in a variety of tasks which lie at the intersection of Vision, Language and Embodied AI. Those tasks range from generating meaningful descriptions of images, to answering questions and navigating agents in unseens environments via natural language intructions. This tutorial will give a comprehensive guide through these advancements, including state-of-the-art techniques for image and video captioning (Recurrent Neural Networks, Attention, the Transformer paradigm, training with Reinforcement Learning) and for cross-modal retrieval. It will then discuss how these approaches can be used on embodied agents which can interact with the physical world, for navigation and for other embodied tasks.


Program at a glance

  • Introduction: Vision and Language, Embodied AI  [PDF]
  • Describing images  [PDF]
    • Structure of a captioning system
    • Optimizing for metrics: Scheduled Sampling, Reinforcement Learning, Self-critical Sequence Training
    • Visual encoding for sequences and sets, attentive mechanisms, the visual sentinel
    • Convolutional and Transformer-based language models
    • SoTA approaches and new challenges
  • Captioning applications: controlling captioning with external constraints  [PDF]
  • Cross-modal retrieval  [PDF]
  • Embodied AI and Vision-and-Language Navigation  [PDF]
    • Datasets and Simulators overview
    • Metrics and evaluation challenges
    • State-of-the-art algorithms
    • Open challenges

Presenters

Lorenzo Baraldi

Lorenzo Baraldi

Profile

Marcella Cornia

Marcella Cornia

Profile

Massimiliano Corsini

Massimiliano Corsini

Profile