Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center Slides @DocXavi Tutorial: One Perceptron to Rule Them All Part III: Language & Vision
2 Acknowledgments Mariona Carós Benet Oriol Amaia Salvador Santiago Pascual Marta R. Costa-jussà Francisco Roldan Issey Masuda Ionut Sorodoc Carina Silberer Gemma Boleda Carles Ventura Ioannis Kazakos Míriam Bellver Alba M. Herrera Amanda Duarte
3
4 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
5 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
6 Encoder Decoder Representation
7 #ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image Captioning with RNN
8 #DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 (Slides by Marc Bolaños) Image Captioning with RNN
9 Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention
10 Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention
11 Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image Captioning." CVPR 2020. [tweet] Image Captioning with Transformers
12 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
13 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses” Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
14 Recipe Generation Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
15 Recipe Generation Title: Edamame corn salad Ingredients pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil Instructions - In a large bowl, combine edamame, corn, red onion, cilantro, avocado, and red bell pepper. - In a small bowl, whisk together olive oil, vinegar, salt, and pepper. - Pour dressing over edamame mixture and toss to coat. - Cover and refrigerate for at least 1 hour before serving. Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
16 #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning
17 #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning
18 Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Video Captioning
19 (Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video
20 Multimodal Machine Translation Challenge on Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1
21 Multimodal Machine Translation Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann. "Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]
22 Sign Language Translation with RNN+Att Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
23 Sign Language Translation with Transformers Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden, “Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation” CVPR 2020.
24 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
25 Lip Reading Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
26 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
27 Lipreading: Watch, Listen, Attend & Spell Audio features Image features Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
28 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep
29 Lipreading Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online application." Interspeech 2018.
30 Image Captioning Grounded on Detected Objects Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
31Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code] Image Captioning Grounded on Detected Objects
32Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Image Captioning Grounded on Heatmaps
33 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
34 Encoder Decoder Representation
35 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Image Generation
36 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Generation
37 Image Synthesis #StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
38 Image Synthesis with Cycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]
39 Image Synthesis with Cycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]
40Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Generation via Scene Graphs
41 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].
42 #CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018 Video Generation by Composition
43 Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production. ECCV 2020. Sign Language Generation with Transformers
44 Lucas Ventura, Amanda Duarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020. Sign Language Generation (pose 2 pixels)
45 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
46 Encoder Decoder Representation Encoder Representation
47 Visual Question Answering Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
48 Visual Question Answering (VQA) Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
49 Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)
50 VQA: Dynamic Memory Networks (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
51 Visual Reasoning #Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
52 Visual Reasoning: Programming (Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017 Program Generator Execution Engine
53 Visual Reasoning: Relation Networks #RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017. Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the answer with a MLP.
54 Visual Dialog Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017 [Project]
55 Visual Dialog Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk] Demo @ ICMR 2020 (Wednesday 11:00am)
56 Visual Dialog Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk]
57 Hate Speech Detection in Memes Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code] Hate Speech Detection
58 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
59 Encoder Decoder Representation Encoder Representation
60 Niu, Yulei, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019). Objects from Referring Expressions
61 Video Objects from Referring Expressions Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language specification." CVPR 2017. [code]
62 Video Object Detection with Transformers Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.
63 #Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular attention network for referring expression comprehension." CVPR 2018. [code] Segments from Referring Expressions
64 Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV 2018. Video Objects from Referring Expressions
65 Herrera-Palacio, Alba, Carles Ventura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia Workshops 2019. Video Objects from Referring Expressions
66 #RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions
67 #RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions
68 #SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted) Synthetic Expressions w/ Scene Graphs
69 #SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted)
Segments from Questions Gan, Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.
71 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
72 Encoder Encoder Representation
73 Joint Representations (Embeddings) #Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013
74 Zero-shot learning Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] No images from “cat” in the training set... ...but they can still be recognised as “cats” thanks to the representations learned from text .
75 Multimodal Retrieval Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural language models." NeurIPS 2014 Deep Learning Workshop.
76 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
77 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
78 Image and text retrieval with joint embeddings. Joint Neural Embeddings #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
79 #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video] Joint Neural Embeddings
80 Joint Neural Embeddings joint embedding LSTM Bidirectional LSTM #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
81 Representations Sariyildiz, Mert Bulent, Julien Perez, and Diane Larlus. "Learning Visual Representations with Caption Annotations." ECCV 2020. [tweet]
82 Representations #ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Visual Task: Predict the visual categories for the masked video frame Language Task: Predict the masked word (same as in language-only BERT).
83 Representations #ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Multimodal Task: Predict whether the video frames correspond to the caption.
84 Representations #VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019.
85 Representations #VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019. Rich representations can be used to retrieve matching video frames, which are encoded after vector quantization.
86 Representations #VirTEX Karan Desai, Justin Johnson, “VirTex: Learning Visual Representations from Textual Annotations” arXiv 2020 [tweet]
87 Learning Language from Video Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modifiers: Learning from Adverbs in Instructional Videos." CVPR 2020..
88 Learning Language from Video Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV 2020.
89 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
90 Platforms for Embodied AI #Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]
91 Navigation Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." NeurIPS 2018.
92 Navigation #R2R Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]
93 Navigation #RxR Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.
94 Navigation Ünal, Emre, Ozan Arkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM Workshops 2019.
95 Object manipulation Hill, F., Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of systematicity and generalization in a situated agent. ICLR 2020. [talk]
96 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
97 My take home message 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Feature Learning 4. Control Tasks
Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center Was this tutorial helpful ? Please consider citing: Go raibh maith agat / Thank you Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).

Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

  • 1.
    Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor UniversitatPolitècnica de Catalunya Barcelona Supercomputing Center Slides @DocXavi Tutorial: One Perceptron to Rule Them All Part III: Language & Vision
  • 2.
  • 3.
  • 4.
    4 Outline 1. Generative Models a.Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 5.
    5 Outline 1. Generative Models a.Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 6.
  • 7.
    7 #ShowAndTell Vinyals, Oriol,Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image Captioning with RNN
  • 8.
    8 #DeepImageSent Karpathy, Andrej,and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 (Slides by Marc Bolaños) Image Captioning with RNN
  • 9.
    9 Xu, Kelvin, JimmyBa, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention
  • 10.
    10 Xu, Kelvin, JimmyBa, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention
  • 11.
    11 Cornia, Marcella, MatteoStefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image Captioning." CVPR 2020. [tweet] Image Captioning with Transformers
  • 12.
    12 Johnson, Justin, AndrejKarpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  • 13.
    13 XAVI: “man has shorthair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses” Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  • 14.
    14 Recipe Generation Salvador, Amaia,Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
  • 15.
    15 Recipe Generation Title: Edamamecorn salad Ingredients pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil Instructions - In a large bowl, combine edamame, corn, red onion, cilantro, avocado, and red bell pepper. - In a small bowl, whisk together olive oil, vinegar, salt, and pepper. - Pour dressing over edamame mixture and toss to coat. - Cover and refrigerate for at least 1 hour before serving. Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
  • 16.
    16 #Equalizer Burns, Kaylee,Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning
  • 17.
    17 #Equalizer Burns, Kaylee,Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning
  • 18.
    18 Jeffrey Donahue, LisaAnne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Video Captioning
  • 19.
    19 (Slides by MarcBolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video
  • 20.
    20 Multimodal Machine Translation Challengeon Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1
  • 21.
    21 Multimodal Machine Translation Sulubacak,Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann. "Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]
  • 22.
    22 Sign Language Translationwith RNN+Att Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
  • 23.
    23 Sign Language Translationwith Transformers Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden, “Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation” CVPR 2020.
  • 24.
    24 Assael, Yannis M.,Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  • 25.
    25 Lip Reading Assael, YannisM., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  • 26.
    26 Chung, Joon Son,Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  • 27.
    27 Lipreading: Watch, Listen,Attend & Spell Audio features Image features Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  • 28.
    28 Lipreading: Watch, Listen,Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep
  • 29.
    29 Lipreading Afouras, Triantafyllos, JoonSon Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online application." Interspeech 2018.
  • 30.
    30 Image Captioning Groundedon Detected Objects Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
  • 31.
    31Lu, Jiasen andYang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code] Image Captioning Grounded on Detected Objects
  • 32.
    32Akbari, Hassan, SveborKaraman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Image Captioning Grounded on Heatmaps
  • 33.
    33 Outline 1. Generative Models a.Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 34.
  • 35.
    35 Reed, Scott, ZeynepAkata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Image Generation
  • 36.
    36 Reed, Scott, ZeynepAkata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Generation
  • 37.
    37 Image Synthesis #StackGAN Zhang,Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
  • 38.
    38 Image Synthesis withCycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]
  • 39.
    39 Image Synthesis withCycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]
  • 40.
    40Justin Johnson, AgrimGupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Generation via Scene Graphs
  • 41.
    41 #Text2Scene Tan, Fuwen,Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].
  • 42.
    42 #CRAFT Gupta, Tanmay,Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018 Video Generation by Composition
  • 43.
    43 Saunders, B., Camgoz,N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production. ECCV 2020. Sign Language Generation with Transformers
  • 44.
    44 Lucas Ventura, AmandaDuarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020. Sign Language Generation (pose 2 pixels)
  • 45.
    45 Outline 1. Generative Models a.Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 46.
  • 47.
    47 Visual Question Answering Antol,Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
  • 48.
    48 Visual Question Answering(VQA) Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
  • 49.
    49 Noh, H., Seo,P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)
  • 50.
    50 VQA: Dynamic MemoryNetworks (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
  • 51.
    51 Visual Reasoning #Clevr Johnson,Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
  • 52.
    52 Visual Reasoning: Programming (Slidesby Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017 Program Generator Execution Engine
  • 53.
    53 Visual Reasoning: RelationNetworks #RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017. Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the answer with a MLP.
  • 54.
    54 Visual Dialog Das, Abhishek,Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017 [Project]
  • 55.
    55 Visual Dialog Caros, Mariona,Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk] Demo @ ICMR 2020 (Wednesday 11:00am)
  • 56.
    56 Visual Dialog Caros, Mariona,Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk]
  • 57.
    57 Hate Speech Detectionin Memes Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code] Hate Speech Detection
  • 58.
    58 Outline 1. Generative Models a.Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 59.
  • 60.
    60 Niu, Yulei, HanwangZhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019). Objects from Referring Expressions
  • 61.
    61 Video Objects fromReferring Expressions Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language specification." CVPR 2017. [code]
  • 62.
    62 Video Object Detectionwith Transformers Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.
  • 63.
    63 #Mattnet Yu, Licheng,Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular attention network for referring expression comprehension." CVPR 2018. [code] Segments from Referring Expressions
  • 64.
    64 Khoreva, Anna, AnnaRohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV 2018. Video Objects from Referring Expressions
  • 65.
    65 Herrera-Palacio, Alba, CarlesVentura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia Workshops 2019. Video Objects from Referring Expressions
  • 66.
    66 #RefVOS Bellver, Miriam,Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions
  • 67.
    67 #RefVOS Bellver, Miriam,Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions
  • 68.
    68 #SynthRef Ioannis Kazakos,Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted) Synthetic Expressions w/ Scene Graphs
  • 69.
    69 #SynthRef Ioannis Kazakos,Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted)
  • 70.
    Segments from Questions Gan,Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.
  • 71.
    71 Outline 1. Generative Models a.Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 72.
  • 73.
    73 Joint Representations (Embeddings) #DeviseFrome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013
  • 74.
    74 Zero-shot learning Socher, R.,Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] No images from “cat” in the training set... ...but they can still be recognised as “cats” thanks to the representations learned from text .
  • 75.
    75 Multimodal Retrieval Kiros, Ryan,Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural language models." NeurIPS 2014 Deep Learning Workshop.
  • 76.
    76 Multimodal Retrieval Aytar, Yusuf,Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  • 77.
    77 Multimodal Retrieval Aytar, Yusuf,Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  • 78.
    78 Image and textretrieval with joint embeddings. Joint Neural Embeddings #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
  • 79.
    79 #pic2recipe Amaia Salvador,Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video] Joint Neural Embeddings
  • 80.
    80 Joint Neural Embeddings joint embedding LSTMBidirectional LSTM #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  • 81.
    81 Representations Sariyildiz, Mert Bulent,Julien Perez, and Diane Larlus. "Learning Visual Representations with Caption Annotations." ECCV 2020. [tweet]
  • 82.
    82 Representations #ViLBERT Lu, Jiasen,Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Visual Task: Predict the visual categories for the masked video frame Language Task: Predict the masked word (same as in language-only BERT).
  • 83.
    83 Representations #ViLBERT Lu, Jiasen,Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Multimodal Task: Predict whether the video frames correspond to the caption.
  • 84.
    84 Representations #VideoBERT Sun, Chen,Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019.
  • 85.
    85 Representations #VideoBERT Sun, Chen,Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019. Rich representations can be used to retrieve matching video frames, which are encoded after vector quantization.
  • 86.
    86 Representations #VirTEX Karan Desai,Justin Johnson, “VirTex: Learning Visual Representations from Textual Annotations” arXiv 2020 [tweet]
  • 87.
    87 Learning Language fromVideo Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modifiers: Learning from Adverbs in Instructional Videos." CVPR 2020..
  • 88.
    88 Learning Language fromVideo Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV 2020.
  • 89.
    89 Outline 1. Generative Models a.Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 90.
    90 Platforms for EmbodiedAI #Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]
  • 91.
    91 Navigation Fried, Daniel, RonghangHu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." NeurIPS 2018.
  • 92.
    92 Navigation #R2R Anderson, P.,Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]
  • 93.
    93 Navigation #RxR Alexander Kuand Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.
  • 94.
    94 Navigation Ünal, Emre, OzanArkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM Workshops 2019.
  • 95.
    95 Object manipulation Hill, F.,Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of systematicity and generalization in a situated agent. ICLR 2020. [talk]
  • 96.
    96 Outline 1. Generative Models a.Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 97.
    97 My take homemessage 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Feature Learning 4. Control Tasks
  • 98.
    Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor UniversitatPolitècnica de Catalunya Barcelona Supercomputing Center Was this tutorial helpful ? Please consider citing: Go raibh maith agat / Thank you Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).