Portrait de Ayush Kaushal n'est pas disponible

Ayush Kaushal

Doctorat - UdeM
Superviseur⋅e principal⋅e
Sujets de recherche
Apprentissage profond

Publications

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Tejas Pandey
Aaryan Bhagat
Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale
Tejas Pandey
Aaryan Bhagat
Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale
Tejas Pandey
Arnab Kumar Mondal
Aaryan Bhagat
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
Low Rank Decomposition of matrix - splitting a large matrix into a product of two smaller matrix offers a means for compression that reduces… (voir plus) the parameters of a model without sparsification, and hence delivering more speedup on modern hardware. Moreover, unlike quantization, the compressed linear layers remain fully differentiable and all the parameters trainable, while being able to leverage the existing highly efficient kernels over floating point matrices. We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD) and observe that ranks for the linear layers in these models can be reduced by upto 39.58% with less than 1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100. The compressed models speeds up inference by up to 22.35% with just a single line of change in code over huggingface's implementation with pytorch backend. Low Rank Decomposition (LoRD) models remain compatible with state of the art near-lossless quantization method such as SpQR, which allows leveraging further compression gains of quantization. Lastly, QLoRA over Low Rank Decomposition (LoRD) model further reduces memory requirements by as much as 21.2% over vanilla QLoRA while offering similar gains from parameter efficient fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new paradigm for LLM compression.
TriLM vs FloatLM: Ternary LLMs are more Performant than Quantized FP16 LLMs
Tejas Pandey
Aaryan Bhagat
Ternary LLMs offer significantly better performance for their size (measured in bits) than the models trained and deployed in FP16/BF16. Giv… (voir plus)en the widespread usage of quantization before deployment and advancements in Post Training Quantization of LLMs, a pivotal question arises: do ternary LLMs indeed provide any discernible benefits? To address this, we first build an open family of pre-trained ternary Large Language Models (TriLM). Additionally, we include their counterparts pre-trained in FP16 (FloatLM) and quantized versions of FloatLM (QuantLM) with parameters across almost two orders of magnitude - from 99M to 3.9B parameters. We demonstrate that TriLMs with 3B+ parameters start to offer competitive performance compared to FloatLMs with the same parameter count, while providing significantly better performance for their size. Specifically, TriLM 3.9B, with less bits than FloatLM 830M, ranks between FloatLM 2.4B and FloatLM 3.9B when averaged across 6 popular commonsense and reasoning benchmarks. TriLMs also outperform quantized models, with TriLM 3.9B surpassing the larger QuantLM-3bit 3.9B. Furthermore, across knowledge-based benchmarks, TriLM maintains a superiority for its size, but lags for its parameter count. TriLM 3.9B falls halfway between FloatLM 1.5B and 2.4B, close to QuantLM-4bit 2.4B. To advance research on Ternary LMs, we open source over 500+ checkpoints across the model families.