Why do LLMs attend to the first token?
Federico Barbero
'Alvaro Arroyo
Xiangming Gu
Christos Perivolaropoulos
Michael M. Bronstein
Petar Velivckovi 'c
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Sara Vera Marjanovi'c
Arkil Patel
Vaibhav Adlakha
Milad Aghajohari
Parishad BehnamGhader
Mehar Bhatia
Aditi Khandelwal
Austin Kraft
Benno Krojer
Xing Han Lu
Xing Han Lu
Nicholas Meade
Dongchan Shin
Amirhossein Kazemnejad
Gaurav Kamath
Marius Mosbach
Karolina Stanczak
Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an ans… (see more)wer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly"thinking"about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Sara Vera Marjanovi'c
Arkil Patel
Vaibhav Adlakha
Milad Aghajohari
Parishad BehnamGhader
Mehar Bhatia
Aditi Khandelwal
Austin Kraft
Benno Krojer
Xing Han Lu
Xing Han Lu
Nicholas Meade
Dongchan Shin
Amirhossein Kazemnejad
Gaurav Kamath
Marius Mosbach
Karolina Stanczak
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Amir-massoud Farahmand
Allan D. Jepson
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Amir-massoud Farahmand
Allan D. Jepson
Addressing Missing Modality Challenges in MRI Images: A Comprehensive Review
Reza Azad
Mohammad Dehghanmanshadi
Nika Khosravi
Dorit Merhof
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Amirhossein Abaskohi
Amrutha Varshini Ramesh
Shailesh Nanisetty
Chirag Goel
David Vazquez
Spandana Gella
Giuseppe Carenini
Issam Hadj Laradji
auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory
Arjun Subramonian
Efficient and scalable construction of clinical variable networks for complex diseases with RAMEN.
Yiwei Xiong
Jingtao Wang
Xiaoxiao Shang
Tingting Chen
Douglas D. Fraser
Gregory Fonseca
Simon Rousseau
Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing
Seif Mzoughi
Mohamed Elshafeia
Genetic Analysis of Polyunsaturated Fatty Acids Biosynthesis Pathway Determines Four Distinct Thraustochytrid Types.
Sou-Yu Cheng
Yi-Jing Chen
Hsin-Yang Chang
Ming-Der Huang
InfoGain Wavelets: Furthering the Design of Diffusion Wavelets for Graph-Structured Data
David R. Johnson
Michael Perlmutter
Diffusion wavelets extract information from graph signals at different scales of resolution by utilizing graph diffusion operators raised to… (see more) various powers, known as diffusion scales. Traditionally, the diffusion scales are chosen to be dyadic integers,