Publications

Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing

Seif Mzoughi

Mohamed Elshafeia

Foutse Khomh

2025-04-03

ArXiv (prépublication)

arxiv.org

Spinal Cord Tract Integrity in Degenerative Cervical Myelopathy.

Newton Cho

Abdul Al-Shawwa

W. Bradley Jacobs

Nathan Evaniew

Jacques Bouchard

Steven Casha

Stephan duPlessis

Peter Lewkonia

Fred Nicholls

Alex Soroceanu

Ganesh Swamy

Kenneth C. Thomas

Michael M.H. Yang

Julien Cohen-Adad

David W. Cadotte

2025-04-03

Neurosurgery (publié)

doi.org

Towards Assessing Deep Learning Test Input Generators

Seif Mzoughi

Ahmed Haj Yahmed

Mohamed Elshafei

Foutse Khomh

Diego Elias Costa

2025-04-03

ArXiv (prépublication)

arxiv.org

Why do LLMs attend to the first token?

Federico Barbero

'Alvaro Arroyo

Xiangming Gu

Christos Perivolaropoulos

Michael M. Bronstein

Petar Velivckovi 'c

Razvan Pascanu

2025-04-03

ArXiv (prépublication)

arxiv.org

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Sara Vera Marjanovi'c

Arkil Patel

Vaibhav Adlakha

Milad Aghajohari

Parishad BehnamGhader

Mehar Bhatia

Aditi Khandelwal

Austin Kraft

Benno Krojer

Xing Han Lu

Nicholas Meade

Dongchan Shin

Amirhossein Kazemnejad

Gaurav Kamath

Marius Mosbach

Karolina Stanczak

Siva Reddy

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an ans… (voir plus)wer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly"thinking"about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

2025-04-02

ArXiv (prépublication)

arxiv.org

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Sara Vera Marjanovi'c

Arkil Patel

Vaibhav Adlakha

Milad Aghajohari

Parishad BehnamGhader

Mehar Bhatia

Aditi Khandelwal

Austin Kraft

Benno Krojer

Xing Han Lu

Nicholas Meade

Dongchan Shin

Amirhossein Kazemnejad

Gaurav Kamath

Marius Mosbach

Karolina Stanczak

Siva Reddy

2025-04-02

ArXiv (prépublication)

arxiv.org

A Truncated Newton Method for Optimal Transport

Mete Kemertas

Amir-massoud Farahmand

Allan D. Jepson

2025-04-02

ArXiv (prépublication)

doi.org

arxiv.org

A Truncated Newton Method for Optimal Transport

Mete Kemertas

Amir-massoud Farahmand

Allan D. Jepson

2025-04-02

ArXiv (prépublication)

arxiv.org

Addressing Missing Modality Challenges in MRI Images: A Comprehensive Review

Reza Azad

Mohammad Dehghanmanshadi

Nika Khosravi

Julien Cohen-Adad

Dorit Merhof

2025-04-01

Computational Visual Media (publié)

doi.org

AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery

Amirhossein Abaskohi

Amrutha Varshini Ramesh

Shailesh Nanisetty

Chirag Goel

David Vazquez

Chris Pal

Spandana Gella

Giuseppe Carenini

Issam Hadj Laradji

2025-04-01

arXiv (publié)

doi.org

arxiv.org

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lu

Amirhossein Kazemnejad

Nicholas Meade

Arkil Patel

Dongchan Shin

Alejandra Zambrano

Karolina Stanczak

Peter Shaw

Chris Pal

Siva Reddy

Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an impo… (voir plus)rtant problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

2025-04-01

arXiv (publié)

doi.org

arxiv.org

auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory

Arjun Subramonian

Elvis Dohmatob

2025-04-01

arXiv (publié)

doi.org

arxiv.org

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Publications

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Mots-clés populaires:

Publications