Razvan Pascanu

The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification… (voir plus) tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.

2025-06-02

ArXiv (prépublication)

Plasticity as the Mirror of Empowerment

David Abel

Michael Bowling

Andre Barreto

Will Dabney

Shi Dong

Steven Hansen

Anna Harutyunyan

Khimya Khetarpal

Clare Lyle

Georgios Piliouras

Doina Precup

Jonathan Richens

Mark Rowland

Tom Schaul

Satinder Singh

2025-05-15

ArXiv (prépublication)

Plasticity as the Mirror of Empowerment

David Abel

Michael Bowling

Andre Barreto

Will Dabney

Shi Dong

Steven Hansen

Anna Harutyunyan

Khimya Khetarpal

Clare Lyle

Georgios Piliouras

Doina Precup

Jonathan Richens

Mark Rowland

Tom Schaul

Satinder Singh

2025-05-15

ArXiv (prépublication)

On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew Lampinen

Arslan Chaudhry

Stephanie C.Y. Chan

Cody Wild

Diane Wan

Alex Ku

Jorg Bornschein

Murray Shanahan

James L McClelland

2025-05-01

ArXiv (prépublication)

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Thomas Schmied

Jorg Bornschein

Jordi Grau-Moya

Markus Wulfmeier

2025-04-22

ArXiv (prépublication)

Why do LLMs attend to the first token?

Federico Barbero

'Alvaro Arroyo

Xiangming Gu

Christos Perivolaropoulos

Michael M. Bronstein

Petar Velivckovi 'c

2025-04-03

ArXiv (prépublication)

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Thomas Schmied

Jorg Bornschein

Jordi Grau-Moya

Markus Wulfmeier

2025-04-01

arXiv (publié)

Why do LLMs attend to the first token?

Federico Barbero

'Alvaro Arroyo

Xiangming Gu

Christos Perivolaropoulos

Michael M. Bronstein

Petar Veličković

2025-04-01

arXiv (publié)

NoProp: Training Neural Networks without Back-propagation or Forward-propagation

Qinyu Li

Yee Whye Teh

2025-03-31

ArXiv (prépublication)

NoProp: Training Neural Networks without Back-propagation or Forward-propagation

Qinyu Li

Yee Whye Teh

2025-03-31

ArXiv (prépublication)

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet

Jorg Bornschein

Stephanie Chan

Andrew Lampinen

Soham De

2025-03-27

ArXiv (prépublication)

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet

Jorg Bornschein

Stephanie Chan

Andrew Lampinen

Soham De

2025-03-27

ArXiv (prépublication)