Stochastic positional embeddings improve masked image modeling
Amir Bar
Florian Bordes
Assaf Shocher
Mahmoud Assran
Nicolas Ballas
Trevor Darrell
Amir Globerson
Yann LeCun
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent… (voir plus) success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother
Jordi Orbay
Quan Vuong
Adrien Ali Taiga
Yevgen Chebotar
Ted Xiao
Alex Irpan
Sergey Levine
Aleksandra Faust
Aviral Kumar
A Tensor Decomposition Perspective on Second-order RNNs
Maude Lizaire
Michael Rizvi-Martel
Marawan Gamal
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Xing Han Lu
Zdeněk Kasner
Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs
A. Seza Dougruoz
Iyanuoluwa Shode
Aremu Anuoluwapo
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Nicolas Ballas
There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its … (voir plus)caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Nicolas Ballas
There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its … (voir plus)caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.
Semantically Consistent Video Inpainting with Conditional Diffusion Models
Dylan Green
William Harvey
Saeid Naderiparizi
Matthew Niedoba
Yunpeng Liu
Xiaoxuan Liang
Jonathan Wilder Lavington
Ke Zhang
Vasileios Lioutas
Setareh Dabiri
Adam Ścibior
Berend Zwartsenberg
Frank Wood
Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions… (voir plus) by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
8-inch Wafer-scale Epitaxial Monolayer MoS2.
Hua Yu
Liangfeng Huang
Lanying Zhou
Yalin Peng
Xiuzhen Li
Peng Yin
Jiaojiao Zhao
M. Zhu
Shuopei Wang
Jieying Liu
Hongyue Du
Songge Zhang
Yuchao Zhou
Nianpeng Lu
Kaihui Liu
Na Li
Guangyu Zhang
8-inch Wafer-scale Epitaxial Monolayer MoS2.
Hua Yu
Liangfeng Huang
Lanying Zhou
Yalin Peng
Xiuzhen Li
Peng Yin
Jiaojiao Zhao
M. Zhu
Shuopei Wang
Jieying Liu
Hongyue Du
Songge Zhang
Yuchao Zhou
Nianpeng Lu
Kaihui Liu
Na Li
Guangyu Zhang
Large-scale, high-quality, and uniform monolayer MoS2 films are crucial for their applications in next-generation electronics and optoelectr… (voir plus)onics. Epitaxy is a mainstream technique for achieving high-quality MoS2 films and has been demonstrated at a wafer scale up to 4-inch. In this study, we report the epitaxial growth of 8-inch wafer-scale highly oriented monolayer MoS2 on sapphire with excellent spatial homogeneity, using a specially designed vertical chemical vapor deposition (VCVD) system. Field effect transistors (FETs) based on the as-grown 8-inch wafer-scale monolayer MoS2 film were fabricated and exhibited high performances, with an average mobility and an on/off ratio of 53.5 cm2V-1s-1 and 107, respectively. In addition, batch fabrication of logic devices and 11-stage ring oscillators were also demonstrated, showcasing excellent electrical functions. Our work may pave way of MoS2 in practical industry-scale applications. This article is protected by copyright. All rights reserved.
MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection
Heitor Rapela Medeiros
David Latortue
Fidel A. Guerrero Peña
Eric Granger
,
Sequential predictive learning is a unifying theory for hippocampal representation and replay
Daniel Levenstein
Aleksei Efremov
Roy Henha Eyono
Adrien Peyrache
The mammalian hippocampus contains a cognitive map that represents an animal’s position in the environment 1 and generates offline “repl… (voir plus)ay” 2,3 for the purposes of recall 4, planning 5,6, and forming long term memories 7. Recently, it’s been found that artificial neural networks trained to predict sensory inputs develop spatially tuned cells 8, aligning with predictive theories of hippocampal function 9–11. However, whether predictive learning can also account for the ability to produce offline replay is unknown. Here, we find that spatially tuned cells, which robustly emerge from all forms of predictive learning, do not guarantee the presence of a cognitive map with the ability to generate replay. Offline simulations only emerged in networks that used recurrent connections and head-direction information to predict multi-step observation sequences, which promoted the formation of a continuous attractor reflecting the geometry of the environment. These offline trajectories were able to show wake-like statistics, autonomously replay recently experienced locations, and could be directed by a virtual head direction signal. Further, we found that networks trained to make cyclical predictions of future observation sequences were able to rapidly learn a cognitive map and produced sweeping representations of future positions reminiscent of hippocampal theta sweeps 12. These results demonstrate how hippocampal-like representation and replay can emerge in neural networks engaged in predictive learning, and suggest that hippocampal theta sequences reflect a circuit that implements a data-efficient algorithm for sequential predictive learning. Together, this framework provides a unifying theory for hippocampal functions and hippocampal-inspired approaches to artificial intelligence.