Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center Slides @DocXavi Tutorial: One Perceptron to Rule Them All Part III: Language & Vision

2 Acknowledgments Mariona Carós Benet Oriol Amaia Salvador Santiago Pascual Marta R. Costa-jussà Francisco Roldan Issey Masuda Ionut Sorodoc Carina Silberer Gemma Boleda Carles Ventura Ioannis Kazakos Míriam Bellver Alba M. Herrera Amanda Duarte

4 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks

6 Encoder Decoder Representation

7 #ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image Captioning with RNN

8 #DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 (Slides by Marc Bolaños) Image Captioning with RNN

9 Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention

10 Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention

11 Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image Captioning." CVPR 2020. [tweet] Image Captioning with Transformers

12 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning

13 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses” Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning

14 Recipe Generation Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.

15 Recipe Generation Title: Edamame corn salad Ingredients pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil Instructions - In a large bowl, combine edamame, corn, red onion, cilantro, avocado, and red bell pepper. - In a small bowl, whisk together olive oil, vinegar, salt, and pepper. - Pour dressing over edamame mixture and toss to coat. - Cover and refrigerate for at least 1 hour before serving. Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.

16 #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning

17 #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning

18 Jeﬀrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Video Captioning

19 (Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video

20 Multimodal Machine Translation Challenge on Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1

21 Multimodal Machine Translation Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann. "Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]

22 Sign Language Translation with RNN+Att Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.

23 Sign Language Translation with Transformers Necati Cihan Camgoz, Oscar Koller, Simon Hadﬁeld, Richard Bowden, “Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation” CVPR 2020.

24 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).

25 Lip Reading Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).

26 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017

27 Lipreading: Watch, Listen, Attend & Spell Audio features Image features Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017

28 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep

29 Lipreading Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online application." Interspeech 2018.

30 Image Captioning Grounded on Detected Objects Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]

31Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code] Image Captioning Grounded on Detected Objects

32Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Image Captioning Grounded on Heatmaps

34 Encoder Decoder Representation

35 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Image Generation

36 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Generation

37 Image Synthesis #StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]

38 Image Synthesis with Cycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]

39 Image Synthesis with Cycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]

40Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Generation via Scene Graphs

41 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].

42 #CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018 Video Generation by Composition

43 Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production. ECCV 2020. Sign Language Generation with Transformers

44 Lucas Ventura, Amanda Duarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020. Sign Language Generation (pose 2 pixels)

46 Encoder Decoder Representation Encoder Representation

47 Visual Question Answering Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.

48 Visual Question Answering (VQA) Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).

49 Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)

50 VQA: Dynamic Memory Networks (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016

51 Visual Reasoning #Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017

52 Visual Reasoning: Programming (Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoﬀman, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017 Program Generator Execution Engine

53 Visual Reasoning: Relation Networks #RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017. Relation Networks concatenate all possible pairs of objects with the an encoded question to later ﬁnd the answer with a MLP.

54 Visual Dialog Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017 [Project]

55 Visual Dialog Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk] Demo @ ICMR 2020 (Wednesday 11:00am)

56 Visual Dialog Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk]

57 Hate Speech Detection in Memes Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Oﬀensive Memes towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code] Hate Speech Detection

58 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks

59 Encoder Decoder Representation Encoder Representation

60 Niu, Yulei, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019). Objects from Referring Expressions

61 Video Objects from Referring Expressions Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language speciﬁcation." CVPR 2017. [code]

62 Video Object Detection with Transformers Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.

63 #Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular attention network for referring expression comprehension." CVPR 2018. [code] Segments from Referring Expressions

64 Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV 2018. Video Objects from Referring Expressions

65 Herrera-Palacio, Alba, Carles Ventura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia Workshops 2019. Video Objects from Referring Expressions

66 #RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions

67 #RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions

68 #SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted) Synthetic Expressions w/ Scene Graphs

69 #SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted)

Segments from Questions Gan, Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.

72 Encoder Encoder Representation

73 Joint Representations (Embeddings) #Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeﬀ Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013

74 Zero-shot learning Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] No images from “cat” in the training set... ...but they can still be recognised as “cats” thanks to the representations learned from text .

75 Multimodal Retrieval Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural language models." NeurIPS 2014 Deep Learning Workshop.

76 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.

77 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.

78 Image and text retrieval with joint embeddings. Joint Neural Embeddings #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Oﬂi, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]

79 #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Oﬂi, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video] Joint Neural Embeddings

80 Joint Neural Embeddings joint embedding LSTM Bidirectional LSTM #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Oﬂi, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017

81 Representations Sariyildiz, Mert Bulent, Julien Perez, and Diane Larlus. "Learning Visual Representations with Caption Annotations." ECCV 2020. [tweet]

82 Representations #ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Visual Task: Predict the visual categories for the masked video frame Language Task: Predict the masked word (same as in language-only BERT).

83 Representations #ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Multimodal Task: Predict whether the video frames correspond to the caption.

84 Representations #VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019.

85 Representations #VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019. Rich representations can be used to retrieve matching video frames, which are encoded after vector quantization.

86 Representations #VirTEX Karan Desai, Justin Johnson, “VirTex: Learning Visual Representations from Textual Annotations” arXiv 2020 [tweet]

87 Learning Language from Video Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modiﬁers: Learning from Adverbs in Instructional Videos." CVPR 2020..

88 Learning Language from Video Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV 2020.

90 Platforms for Embodied AI #Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]

91 Navigation Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." NeurIPS 2018.

92 Navigation #R2R Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]

93 Navigation #RxR Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.

94 Navigation Ünal, Emre, Ozan Arkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM Workshops 2019.

95 Object manipulation Hill, F., Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of systematicity and generalization in a situated agent. ICLR 2020. [talk]

97 My take home message 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Feature Learning 4. Control Tasks

Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center Was this tutorial helpful ? Please consider citing: Go raibh maith agat / Thank you Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).

Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

More Related Content

What's hot

Similar to Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

More from Universitat Politècnica de Catalunya

Recently uploaded

Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)