Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing
Seif Mzoughi
Mohamed Elshafeia
Spinal Cord Tract Integrity in Degenerative Cervical Myelopathy.
Newton Cho
Abdul Al-Shawwa
W. Bradley Jacobs
Nathan Evaniew
Jacques Bouchard
Steven Casha
Stephan duPlessis
Peter Lewkonia
Fred Nicholls
Alex Soroceanu
Ganesh Swamy
Kenneth C. Thomas
Michael M.H. Yang
David W. Cadotte
Towards Assessing Deep Learning Test Input Generators
Seif Mzoughi
Ahmed Haj Yahmed
Mohamed Elshafei
Diego Elias Costa
Why do LLMs attend to the first token?
Federico Barbero
'Alvaro Arroyo
Xiangming Gu
Christos Perivolaropoulos
Michael M. Bronstein
Petar Velivckovi 'c
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Sara Vera Marjanovi'c
Arkil Patel
Vaibhav Adlakha
Milad Aghajohari
Parishad BehnamGhader
Mehar Bhatia
Aditi Khandelwal
Austin Kraft
Benno Krojer
Xing Han Lu
Xing Han Lu
Nicholas Meade
Dongchan Shin
Amirhossein Kazemnejad
Gaurav Kamath
Marius Mosbach
Karolina Stanczak
Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an ans… (voir plus)wer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly"thinking"about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Sara Vera Marjanovi'c
Arkil Patel
Vaibhav Adlakha
Milad Aghajohari
Parishad BehnamGhader
Mehar Bhatia
Aditi Khandelwal
Austin Kraft
Benno Krojer
Xing Han Lu
Xing Han Lu
Nicholas Meade
Dongchan Shin
Amirhossein Kazemnejad
Gaurav Kamath
Marius Mosbach
Karolina Stanczak
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Amir-massoud Farahmand
Allan D. Jepson
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Amir-massoud Farahmand
Allan D. Jepson
Addressing Missing Modality Challenges in MRI Images: A Comprehensive Review
Reza Azad
Mohammad Dehghanmanshadi
Nika Khosravi
Dorit Merhof
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Amirhossein Abaskohi
Amrutha Varshini Ramesh
Shailesh Nanisetty
Chirag Goel
David Vazquez
Spandana Gella
Giuseppe Carenini
Issam Hadj Laradji
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Xing Han Lu
Amirhossein Kazemnejad
Nicholas Meade
Arkil Patel
Dongchan Shin
Alejandra Zambrano
Karolina Stanczak
Peter Shaw
Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an impo… (voir plus)rtant problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io
auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory
Arjun Subramonian