TRAIL : IA responsable pour les professionnels et les leaders
Apprenez à intégrer des pratique d'IA responsable dans votre organisation avec le programme TRAIL. Inscrivez-vous à la prochaine cohorte qui débutera le 15 avril.
Avantage IA : productivité dans la fonction publique
Apprenez à tirer parti de l’IA générative pour soutenir et améliorer votre productivité au travail. La prochaine cohorte se déroulera en ligne les 28 et 30 avril 2026.
Nous utilisons des témoins pour analyser le trafic et l’utilisation de notre site web, afin de personnaliser votre expérience. Vous pouvez désactiver ces technologies à tout moment, mais cela peut restreindre certaines fonctionnalités du site. Consultez notre Politique de protection de la vie privée pour en savoir plus.
Paramètre des cookies
Vous pouvez activer et désactiver les types de cookies que vous souhaitez accepter. Cependant certains choix que vous ferez pourraient affecter les services proposés sur nos sites (ex : suggestions, annonces personnalisées, etc.).
Cookies essentiels
Ces cookies sont nécessaires au fonctionnement du site et ne peuvent être désactivés. (Toujours actif)
Cookies analyse
Acceptez-vous l'utilisation de cookies pour mesurer l'audience de nos sites ?
Lecteur Multimédia
Acceptez-vous l'utilisation de cookies pour afficher et vous permettre de regarder les contenus vidéo hébergés par nos partenaires (YouTube, etc.) ?
Publications
Next-Token Prediction Should be Ambiguity-Sensitive : A Meta-Learing Perspective
Labels in remote sensing datasets - and particularly in agricultural remote sensing datasets - can be extremely spatially imbalanced, with p… (voir plus)lentiful labels in some regions but a sparsity of labels in other regions. When developing algorithms for data-sparse regions, a natural approach is to use transfer learning from data-rich regions. While standard transfer learning approaches typically leverage only direct inputs and outputs, remote sensing data (and geospatial data more generally) are rich in metadata that can inform transfer learning algorithms, such as the spatial coordinates of data-points. We build on previous work exploring the use of meta-learning for remote sensing contexts in data-sparse regions and introduce task-informed meta-learning (TIML), an augmentation to model-agnostic meta-learning which takes advantage of task-specific metadata. We apply TIML to regression and classification tasks in remote sensing for agriculture, and find that TIML outperforms a range of benchmarks in both contexts, across a diversity of model architectures. TIML was developed for remote sensing with the goal of improving the global accuracy (and equity) of machine learning models. However, it can offer benefits to any meta-learning setup with task-specific metadata - we demonstrate this by applying TIML to the Omniglot dataset.
2025-06-10
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (publié)
The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising appli… (voir plus)cations is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an"ImageNet moment"or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech.
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (voir plus)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Acute and repetitive lung epithelial injury can lead to irreversible and even progressive pulmonary fibrosis; Idiopathic pulmonary fibrosis … (voir plus)(IPF) is a fatal disease and quintessential example of this phenomenon. The composition of epithelial cells in human pulmonary fibrosis – irrespective of disease etiology – is marked by the presence of Aberrant Basaloid cells: an abnormal cell phenotype with pro-fibrotic and senescent features, localized to the surface of fibrotic lesions. Despite their relevance to human pulmonary fibrosis, the exotic molecular profile of Aberrant Basaloid cells has obscured their etiology, preventing insights into how or why these cells emerge with fibrosis. Here we identify cellular intermediaries between Aberrant Basaloid and normal alveolar epithelial cells in human IPF tissue. We track the emergence of Aberrant Basaloid cells from alveolar epithelial cells ex vivo and uncover a role for similar cells in epithelial regeneration under normal conditions. Lastly, we characterize the epigenetic changes that distinguish Aberrant Basaloid cells from their progenitors and identify hallmarks of AP-1 injury memory retention. This study elucidates the phenomenon of maladaptive epithelial plasticity and regeneration in pulmonary fibrosis and re-contextualizes therapeutic strategies for epithelial dysfunction.
How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (voir plus)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/
2025-06-09
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)
The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurate… (voir plus)ly represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.
Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify… (voir plus) their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.
Fresh in memory: Training-order recency is linearly encoded in language model activations
Dmitrii Krasheninnikov
Richard E. Turner
David M. Krueger
We show that language models' activations linearly encode when information was learned during training. Our setup involves creating a model … (voir plus)with a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples corresponding to the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (~90%) distinguish "early" vs. "late" entities, generalizing to entities unseen during the probes' own training. The model can also be fine-tuned to explicitly report an unseen entity's training stage (~80% accuracy). Interestingly, the training-order encoding does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.
Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjec… (voir plus)tive human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.