Mila organise son premier hackathon en informatique quantique le 21 novembre. Une journée unique pour explorer le prototypage quantique et l’IA, collaborer sur les plateformes de Quandela et IBM, et apprendre, échanger et réseauter dans un environnement stimulant au cœur de l’écosystème québécois en IA et en quantique.
Une nouvelle initiative pour renforcer les liens entre la communauté de recherche, les partenaires et les expert·e·s en IA à travers le Québec et le Canada, grâce à des rencontres et événements en présentiel axés sur l’adoption de l’IA dans l’industrie.
Nous utilisons des témoins pour analyser le trafic et l’utilisation de notre site web, afin de personnaliser votre expérience. Vous pouvez désactiver ces technologies à tout moment, mais cela peut restreindre certaines fonctionnalités du site. Consultez notre Politique de protection de la vie privée pour en savoir plus.
Paramètre des cookies
Vous pouvez activer et désactiver les types de cookies que vous souhaitez accepter. Cependant certains choix que vous ferez pourraient affecter les services proposés sur nos sites (ex : suggestions, annonces personnalisées, etc.).
Cookies essentiels
Ces cookies sont nécessaires au fonctionnement du site et ne peuvent être désactivés. (Toujours actif)
Cookies analyse
Acceptez-vous l'utilisation de cookies pour mesurer l'audience de nos sites ?
Multimedia Player
Acceptez-vous l'utilisation de cookies pour afficher et vous permettre de regarder les contenus vidéo hébergés par nos partenaires (YouTube, etc.) ?
Publications
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (voir plus)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language mo… (voir plus)del (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.
Neural Architecture Search (NAS) is a powerful automatic alternative to manual design of a neural network. In the zero-shot version, a fast … (voir plus)ranking function is used to compare architectures without training them. The outputs of the ranking functions often vary significantly due to different sources of randomness, including the evaluated architecture's weights' initialization or the batch of data used for calculations. A common approach to addressing the variation is to average a ranking function output over several evaluations. We propose taking into account the variation in a different manner, by viewing the ranking function output as a random variable representing a proxy performance metric. During the search process, we strive to construct a stochastic ordering of the performance metrics to determine the best architecture. Our experiments show that the proposed stochastic ordering can effectively boost performance of a search on standard benchmark search spaces.
We define"visual story-writing"as using visual representations of story elements to support writing and revising narrative texts. To demonst… (voir plus)rate this approach, we developed a text editor that automatically visualizes a graph of entity interactions, movement between locations, and a timeline of story events. Interacting with these visualizations results in suggested text edits: for example, connecting two characters in the graph creates an interaction between them, moving an entity updates their described location, and rearranging events on the timeline reorganizes the narrative sequence. Through two user studies on narrative text editing and writing, we found that visuals supported participants in planning high-level revisions, tracking story elements, and exploring story variations in ways that encourage creativity. Broadly, our work lays the foundation for writing support, not just through words, but also visuals.
Background. Serological testing was a key component of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) surveillance. Social dis… (voir plus)tancing interventions, resource limitations, and the need for timely data led to serosurveillance studies using a range of recruitment strategies, which likely influenced study representativeness. Characterizing representativeness in surveillance is crucial to identify gaps in sampling coverage and to assess health inequities. Methods. We retrospectively analyzed three pre-existing longitudinal cohorts, two convenience samples using residual blood, and one de novo probabilistic survey conducted in Canada between April 2020 - November 2023. We calculated study specimen counts by age, sex, urbanicity, race/ethnicity, and neighborhood deprivation quintiles. We derived a 'representation ratio' as a simple metric to assess generalizability to a target population and various sociodemographic strata. Results. The six studies included 1,321,675 specimens. When stratifying by age group and sex, 65% of racialized minority subgroups were moderately underrepresented (representation ratio 0.75). Representation was generally higher for older Canadians, urban neighborhoods, and neighborhoods with low material deprivation. Rural representation was highest in a study that used outpatient laboratory blood specimens. Racialized minority representation was highest in a de novo probabilistic survey cohort. Conclusion. While no study had adequate representation of all subgroups, less traditional recruitment strategies were more representative of some population dimensions. Understanding demographic representativeness and barriers to recruitment are important considerations when designing population health surveillance studies.