ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws, Hai Huang+, arXiv'24 #1510

AkihikoWatanabe · 2024-11-13T03:19:24Z

URL

https://arxiv.org/abs/2410.09692

Authors

Hai Huang
Randall Balestriero

Abstract

Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, $AB$, of a pretrained matrix parameter $W$ to align the model to a new task or dataset with $W+AB$. We identify three core limitations to LoRA for finetuning--a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prevent overfitting. We prove that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes. Second, LoRA's initialization of $B$ at $0$ creates a slow training dynamic between $A$ and $B$. That dynamic is also exacerbated by Dropout that further slows the escape from $0$ for $B$ which is particularly harmful for short training episodes. Third, the scaling factor multiplying each LoRA additive perturbation creates ``short-sighted'' interactions between the LoRA modules of different layers. Motivated by principled analysis of those limitations, we find an elegant solution: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate--coined ALLoRA. By scaling the per sample and per parameter gradients with a coefficient inversely proportional to parameters' $\ell_2$ norm, ALLoRA alleviates those three limitations. As a by-product, ALLoRA removes two hyper-parameters from LoRA: the scaling factor and the dropout rate. Empirical results show that ALLoRA admits better accuracy than LoRA on various settings, including against recent LoRA variants such as Weight-Decomposed Low-Rank Adaptation (DoRA). Ablation studies show our solution is the optimal in a family of weight-dependent / output-dependent approaches on various LLMs including the latest Llama3.

Translation (by gpt-4o-mini)

Low-Rank Adaptation（LoRA）は、大規模言語モデル（LLM）のファインチューニングにおいて基本的な手法です。LoRAは、事前学習された行列パラメータ$W$に対して加算的な低ランクの摂動$AB$を学習し、$W + AB$を用いてモデルを新しいタスクやデータセットに適合させます。本研究では、ファインチューニングにおけるLoRAの3つの主要な制限を特定します。ファインチューニングは限られたデータとトレーニングステップを使用する設定です。第一に、LoRAは過学習を防ぐためにドロップアウトを使用しますが、ドロップアウトは長いトレーニングエピソードには適していますが、短いトレーニングエピソードでは信頼できる正則化手法として収束しないことを証明します。第二に、LoRAの$B$の初期値を$0$に設定することは、$A$と$B$の間に遅いトレーニングダイナミクスを生じさせます。このダイナミクスは、ドロップアウトによってさらに悪化し、特に短いトレーニングエピソードにおいて$B$が$0$から脱出するのを遅らせることが有害です。第三に、各LoRAの加算的摂動に掛けられるスケーリングファクターは、異なる層のLoRAモジュール間に「短期的」な相互作用を生じさせます。これらの制限に対する原理的な分析に基づき、我々は優れた解決策を見出しました：ドロップアウトなし、スケーリングなし、適応学習率を持つLoRA、通称ALLoRAです。ALLoRAは、サンプルごとおよびパラメータごとの勾配をパラメータの$\ell_2$ノルムに反比例する係数でスケーリングすることにより、これらの3つの制限を緩和します。副産物として、ALLoRAはLoRAから2つのハイパーパラメータ、すなわちスケーリングファクターとドロップアウト率を削除します。実験結果は、ALLoRAがLoRAよりもさまざまな設定で優れた精度を示し、Weight-Decomposed Low-Rank Adaptation（DoRA）などの最近のLoRAのバリアントに対しても優れていることを示しています。アブレーションスタディは、我々の解決策が最新のLlama3を含むさまざまなLLMにおいて、重み依存型/出力依存型アプローチのファミリーの中で最適であることを示しています。

Summary (by gpt-4o-mini)

LoRAのファインチューニングにおける制限を特定し、ドロップアウトなし、スケーリングなし、適応学習率を持つALLoRAを提案。ALLoRAは勾配をパラメータの$\ell_2$ノルムに反比例してスケーリングし、LoRAよりも優れた精度を示す。実験により、ALLoRAが最新のLLMにおいて最適なアプローチであることが確認された。

AkihikoWatanabe added the Pocket label Nov 13, 2024

AkihikoWatanabe changed the title あ ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws, Hai Huang+, arXiv'24 Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws, Hai Huang+, arXiv'24 #1510

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws, Hai Huang+, arXiv'24 #1510

AkihikoWatanabe commented Nov 13, 2024 •

edited

Loading

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws, Hai Huang+, arXiv'24 #1510

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws, Hai Huang+, arXiv'24 #1510

Comments

AkihikoWatanabe commented Nov 13, 2024 • edited Loading

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)

AkihikoWatanabe commented Nov 13, 2024 •

edited

Loading