Lora Without Regret
Posted3 months agoActive3 months ago
thinkingmachines.aiTechstoryHigh profile
calmmixed
Debate
60/100
LoraAI ResearchMachine Learning
Key topics
Lora
AI Research
Machine Learning
The article 'LoRA Without Regret' discusses the effectiveness of LoRA in AI model fine-tuning, sparking a discussion on its applications, limitations, and comparisons to other methods.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4d
Peak period
27
96-108h
Avg / period
14.5
Comment distribution58 data points
Loading chart...
Based on 58 loaded comments
Key moments
- 01Story posted
Sep 29, 2025 at 1:52 PM EDT
3 months ago
Step 01 - 02First comment
Oct 3, 2025 at 6:17 PM EDT
4d after posting
Step 02 - 03Peak activity
27 comments in 96-108h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 5, 2025 at 6:57 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45416706Type: storyLast synced: 11/20/2025, 5:02:38 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://news.ycombinator.com/newsguidelines.html
At this point I think they do it on purpose, as their metrics for "people visiting the website/repository" or whatever gets increased as people thinking the repository is about the existing concept/technology.
Both are very cool, but I wonder if I missed something else?
I don’t get why a trajectory would provide only one bit of information.
Each step of the trajectory is at least giving information about what state transitions are possible.
An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.
I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post
This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.
I’m shocked they didn’t look at progressive merging of LoRAs. Research shows that’s the best way of improving its ability to model higher level features.
Seems like a massive miss, not to mention there is other research that contradicts a lot of their findings. This feels a bit like a researchers first pass at learning LoRA
https://arxiv.org/abs/2410.22911
https://arxiv.org/abs/2409.16167
Progressive merging of LoRa is somewhere inbetween and categorically more complex than just LoRa so would be dominated by standard LoRa in that case.
While progressive merging could train faster as fewer params are trainable at any given time, it results in very larger adapter diffs OTO the size of the original model and doesn't retain the benefits of being able to deploy multiple adapters over the same base model idt.
I think the literature is clear on that?
"LoRA vs Full Fine-tuning: An Illusion of Equivalence" -- https://arxiv.org/abs/2410.21228v1
Quoting from the conclusions:
> The paper describes the finding that LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution. We found that LoRA and full fine-tuning yield models with significant differences spectral properties of their weight matrices: LoRA models often containing “intruder dimensions”, high-ranking singular vectors approximately orthogonal to the singular vectors of pre-trained weight matrices. The existence of intruder dimensions correlates with the fine-tuned model forgetting more of the pre-training distribution as well as forgetting more when trained on tasks sequentially in a continual learning setup.
I'm surprised they didn't cite this; it's a well known paper.
Oh, you mean rejected just like these papers?
Efficient Estimation of Word Representations in Vector Space[1], one of the most influential papers in the space with tens of thousands of citations[2]? Or the RoBERTa[3] paper (dramatically improved upon BERT; RoBERTa and derived models currently have tens of millions of downloads on HF and still serve as a reliable industry workhorse)? Or the Mamba paper[4] (pretty much the only alternative to transformers that actually gets used)? Do you want me to keep going?
Honestly, I find that whether a paper gets rejected or not means diddly squat considering how broken the review system is, and through how much honestly terrible papers I have to wade through every time I'm looking through the conference submissions for anything good.
[1] -- https://openreview.net/forum?id=idpCdOWtqXd60
[2] -- https://scholar.google.com/scholar?cites=7447715766504981253
[3] -- https://openreview.net/forum?id=SyxS0T4tvS
[4] -- https://openreview.net/forum?id=AL1fq05o7H
This guys knows his stuff.
I'm surprised you copied and pasted all of that without explaining what it means.
Does LoRA perform worse, better or statistically insignificantly different to FullFT?
You aren't able to tell from what you pasted, are you?
The answer is "There's a difference, perhaps", but the GP appeared to imply that LoRA performed worse.
My understanding is that that paper found differences, but did not conclude that the differences were quantifiably better or worse, but this is not what GP's post implied.
There are techniques like PiCa and SVFT which can mitigate much of the loss, though.
I don't recall how I found out about it, but it was either paperswithcode or an LLM research session working through the intruder dimensions problem.
In my Stable Diffusion tests, it substantially improves LoRA training speed and fidelity, though I've got some experiments that seem to even further substantially improve on it by adding learnable rotations of the singular vectors.
Be sure to validate everything you're reading though as of late I've come across more and more things that don't seem 100% accurate in their docs, seems to heavily depend on what section.
For example, for classification, if is hallucinating semantically similar, but not technically valid classes, you can probably fine-tune your way out of the gap with a smaller model.
But if your task requires world knowledge, you likely need a larger model. It's not cheap, efficient, or generally useful to fine-tune for additional world knowledge directly.
Stumbled on this today... https://hackerpager.net/
I really want something like this with flip out keyboard and could do Signal on LTE/WiFi.