Asja Fischer

On the Challenges and Opportunities in Generative AI

Laura Manduchi

Clara Meister

Kushagra Pandey

Robert Bamler

Ryan Cotterell

Sina Däubener

Sophie Fellenz

Asja Fischer

Thomas Gärtner

Matthias Kirchler

Marius Kloft

Yingzhen Li

Christoph Lippert

Gerard de Melo

Eric Nalisnick

Björn Ommer

Rajesh Ranganath

Maja Rudolph

Karen Ullrich

Guy Van den Broeck … (voir 6 de plus)

Julia E Vogt

Yixin Wang

Florian Wenzel

Frank N. Wood

Stephan Mandt

Vincent Fortuin

2025-08-20

TMLR (accepté)

doi.org

openreview.net

Noisy Pairing and Partial Supervision for Stylized Opinion Summarization

Reinald Kim

Mirella Lapata. 2020

Un-611

David M. Krueger

Maxinder S. Kan-620

Somnath Basu

Roy Chowdhury

Chao Zhao

Tanya Goyal

Junyi Jiacheng Xu

Jessy Li

Ivor W. Tsang

James T. Kwok

Neil Houlsby

Andrei Giurgiu

Stanisław Jastrzębski … (voir 22 de plus)

Bruna Morrone

Quentin de Laroussilhe

Mona Gesmundo

Attariyan Sylvain

Gelly

Thomas Wolf

Lysandre Debut

Julien Victor Sanh

Clement Chaumond

Anthony Delangue

Pier-339 Moi

Tim ric Cistac

R´emi Rault

Morgan Louf

Funtow-900 Joe

Sam Davison

Patrick Shleifer

Von Platen

Clara Ma

Yacine Jernite

Julien Plu

Canwen Xu

Opinion summarization research has primar-001 ily focused on generating summaries reflect-002 ing important opinions from customer reviews 0… (voir plus)03 without paying much attention to the writing 004 style. In this paper, we propose the stylized 005 opinion summarization task, which aims to 006 generate a summary of customer reviews in 007 the desired (e.g., professional) writing style. 008 To tackle the difficulty in collecting customer 009 and professional review pairs, we develop a 010 non-parallel training framework, Noisy Pair-011 ing and Partial Supervision ( NAPA ), which 012 trains a stylized opinion summarization sys-013 tem from non-parallel customer and profes-014 sional review sets. We create a benchmark P RO - 015 S UM by collecting customer and professional 016 reviews from Yelp and Michelin. Experimental 017 results on P RO S UM and FewSum demonstrate 018 that our non-parallel training framework con-019 sistently improves both automatic and human 020 evaluations, successfully building a stylized 021 opinion summarization model that can gener-022 ate professionally-written summaries from cus-023 tomer reviews. 024

2022-12-31

(publié)

www.semanticscholar.org

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Stanisław Jastrzębski

Amos Storkey

Stochastic Gradient Descent (SGD) based training of neural networks with a large learning rate or a small batch-size typically ends in well-… (voir plus)generalizing, flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. However, the curvature along the SGD trajectory is poorly understood. An empirical investigation shows that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. When studying the SGD dynamics in relation to the sharpest directions in this initial phase, we find that the SGD step is large compared to the curvature and commonly fails to minimize the loss along the sharpest directions. Furthermore, using a reduced learning rate along these directions can improve training speed while leading to both sharper and better generalizing solutions compared to vanilla SGD. In summary, our analysis of the dynamics of SGD in the subspace of the sharpest directions shows that they influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model.

2018-12-31

ICLR.cc/2019/Conference (poster)

openreview.net

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

Stanisław Jastrzębski

Amos Storkey

2018-09-26

Artificial Neural Networks and Machine Learning – ICANN 2018 (publié)

doi.org

Finding Flatter Minima with SGD

Stanisław Jastrzębski

Amos Storkey

2018-02-11

International Conference on Learning Representations (publié)

dblp.uni-trier.de

SGD S MOOTHS THE S HARPEST D IRECTIONS

Stanisław Jastrzębski

Amos Storkey

Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep ne… (voir plus)ural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.

2018-02-11

(publié)

openreview.net

LATTER M INIMA WITH SGD

Stanisław Jastrzębski

Amos Storkey

2017-12-31

(publié)

www.semanticscholar.org

Three Factors Influencing Minima in SGD

Stanisław Jastrzębski

Amos Storkey

We study the statistical properties of the endpoint of stochastic gradient descent (SGD). We approximate SGD as a stochastic differential eq… (voir plus)uation (SDE) and consider its Boltzmann Gibbs equilibrium distribution under the assumption of isotropic variance in loss gradients.. Through this analysis, we find that three factors – learning rate, batch size and the variance of the loss gradients – control the trade-off between the depth and width of the minima found by SGD, with wider minima favoured by a higher ratio of learning rate to batch size. In the equilibrium distribution only the ratio of learning rate to batch size appears, implying that it’s invariant under a simultaneous rescaling of each by the same amount. We experimentally show how learning rate and batch size affect SGD from two perspectives: the endpoint of SGD and the dynamics that lead up to it. For the endpoint, the experiments suggest the endpoint of SGD is similar under simultaneous rescaling of batch size and learning rate, and also that a higher ratio leads to flatter minima, both findings are consistent with our theoretical analysis. We note experimentally that the dynamics also seem to be similar under the same rescaling of learning rate and batch size, which we explore showing that one can exchange batch size and learning rate in a cyclical learning rate schedule. Next, we illustrate how noise affects memorization, showing that high noise levels lead to better generalization. Finally, we find experimentally that the similarity under simultaneous rescaling of learning rate and batch size breaks down if the learning rate gets too large or the batch size gets too small.

2017-11-12

arXiv (prépublication)

doi.org

openreview.net

A Closer Look at Memorization in Deep Networks

Devansh Arpit

Stanisław Jastrzębski

David Krueger

Maxinder S. Kanwal

We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While dee… (voir plus)p networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

2017-08-05

International Conference on Machine Learning (inconnu)

doi.org

proceedings.mlr.press

Deep Nets Don't Learn Via Memorization

David Krueger

Nicolas Ballas

Stanislaw Jastrzebski

Devansh Arpit

Maxinder S. Kanwal

Tegan Maharaj

Emmanuel Bengio