LLM From Scratch, Part 28 – Training a Base Model From Scratch on an Rtx 3090

Postedabout 1 month agoActive16 days ago

gpjt

482 points

101 comments

gilesthomas.comTech DiscussionstoryHigh profile

informativepositive

Debate

20/100

Large Language ModelsAI Performance AnalysisGPU Allocation

Key topics

Large Language Models

AI Performance Analysis

GPU Allocation

The debate around training large language models (LLMs) from scratch has sparked a lively discussion, with some commenters weighing in on the feasibility and value of such an exercise. While one perspective is that having the money is the real barrier, not time, others counter that building a small LLM can be a valuable learning experience, even on a relatively modest budget of $100. The conversation also touches on the suitability of off-the-shelf GPUs for modern AI research, with some noting that prosumer hardware is often adequate for smaller-scale experiments, but may not be sufficient for state-of-the-art models. As commenters discuss the trade-offs between local compute and cloud rentals, a nuanced view emerges: research can be done at various scales, but small-scale experiments are often a crucial step before scaling up.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

Day 7

Avg / period

20.3

Comment distribution122 data points

Loading chart...

Based on 122 loaded comments

Key moments

01Story posted
Dec 2, 2025 at 1:17 PM EST
about 1 month ago
Step 01
02First comment
Dec 9, 2025 at 6:35 AM EST
7d after posting
Step 02
03Peak activity
85 comments in Day 7
Hottest window of the conversation
Step 03
04Latest activity
Dec 17, 2025 at 4:28 PM EST
16 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (101 comments)

Showing 122 comments

DeathArrow

24 days ago

1 reply

I think this is a very valuable exercise if you try to understand how LLMs work and if you have the time.

rvnx

24 days ago

2 replies

Having the money is really what you need, not time.

Nowadays training very powerful LLMs is easy because all the tooling, source-codes, training datasets, and teaching agents are available.

Having money is not, unless you are selling AI snake-oil type of companies.

contrast

24 days ago

1 reply

You seem to be talking about a production-grade model rather than building an LLM as an exercise? Or if not, why do you disagree with the article's example of building a small LLM for $100?

rvnx

24 days ago

3 replies

I think I should have replied as a totally separate comment. This is my mistake.

It is nice that the author shared the results of his exercise / experiment. Just got sad as I was reminded (when the 100 USD were mentioned) that all this game is only about money and hardware rather than skills.

DeathArrow

24 days ago

1 reply

That is true for many kinds of software where you need a big amount of resources. No matter how skilled I am, I cannot build Facebook, Google, Photoshop alone. But a tiny version of it just to learn? Why not!

victorbjorklund

24 days ago

You could 100% build Facebook. You don’t need any hardcore hardware before you have many users.

jbs789

24 days ago

1 reply

I understand the emotional aspect of feeling like it’s out of reach for you.

Thing is, if you focus on your own skill development and apply it at even a small scale, very few people do that. Then you go for a job and guess what, the company has resources you can leverage. Then you do that, and ultimately you could be in a position to have the credibility to raise your own capital.

Play the long game and do what you can do now.

rvnx

23 days ago

Not at all. It's not really about credibility or skills. It's like a kitchen.

Take a genius chef but give him rotten ingredients. He sweats, he tries, but the meal is barely edible. That's the $100 exercise, but only experts recognize the talent behind.

Take an unskilled cook but give him A5 Wagyu and prepared truffles. The result tastes amazing to the average person who will claim the chef is great (the investors).

It's about access to capital and selling a story ('ex'-Googler doesn't make you competent), not skills.

Great chefs in dark alleys go unnoticed.

Mediocre tourist traps near the Eiffel Tower are fully booked.

Look at Inflection AI. Average results, yet massive funding. They have the "location" and the backing, so they win. It's not about who cooks better; it's about who owns the kitchen but who sells a dream that tomorrow the food will be better.

We don't talk about small funding, we talk about 1.3 billion USD, just for that specific example, yet a tourist trap (using name-dropping / reputation instead of talent)

Snake-oil is rewarded as much as, or even more than real talent; a lot of people cannot see the difference between a chef and the ingredients, this is what I think is sad.

meehai

24 days ago

it's skills first and then money and hardware for scale

A more skilled person that understands all the underlying steps will always be more efficient in scaling up due to knowing where to allocate more.

basically... you always need the skills and the money is the fine tuning.

victorbjorklund

24 days ago

1 reply

Totally. While the LLM:s today are amazing it is a bit sad that you can’t build SOTA models on your own (vs a few years ago where someone with the skills and access to a dataset could build a state of art models)

Chabsff

24 days ago

1 reply

In the grand scheme of things, we've only had about a quarter century where you needed a *very* specific kind of problem where prosumer hardware wasn't adequate across computer science as a whole.

It's kind of amazing we got that at all for a while.

djmips

24 days ago

If you discard the early days of gigantic expensive computers. I guess it's come full circle after a fashion.

ducktective

24 days ago

8 replies

Are off-shelf GPUs (like one 3090) suitable for modern academic research on current AI advancements or is it better to rent some cloud compute?

ACCount37

24 days ago

1 reply

Research runs on a variety of scales - but "check if this new idea/method/architecture isn't completely dumb on small scale before trying to scale up" is a common enough pattern. And most of those fail on small scale.

htrp

24 days ago

1 reply

depressingly enough, things that work on small scale architectures often don't work at larger scales

ACCount37

24 days ago

Yep, most of what's remaining fails to scale. But it's still a very solid filter.

Sure, there are things that don't work on small scale and then work on large scale. But they're rare, and they sure are going to be expensive to find and validate.

lynndotpy

24 days ago

2 replies

If you're seriously doing deep learning research, it's very very nice to own your own GPU.

For four years of AI PhD research I worked with a 1050Ti on a personal laptop and a 2060 on a personal desktop. You can do a lot of validation and development on consumer GPUs.

That said, the OP does not train an LLM from scratch on a 3090. That would not be feasible

joefourier

24 days ago

1 reply

M? The OP literally did train an LLM from scratch in a 3090 (except for the tokenizer), that’s what the whole post is about.

lynndotpy

24 days ago

Good point, I worded that incorrectly and should have been more specific. OP trained an LLM from scratch, but it's GPT-2 and with even worse performance than the GPT-2 which OpenAI shipped a few years ago.

I can't edit it now, but OP did not train a useful LLM from scratch. In editing for clarity and tone I think I omitted that away. Somebody searching for a reproducible way to produce a usable model on their own 3090 won't find it in this post. But someone looking to learn how to produce a usable model on their own 3090 will be educated on their post.

"Not a useful LLM" is not a knock on the OP! This is an _excellent_ educational and experiential post. It includes the experimentation with different models that you'll never see in a publication. ANd it showcases the exact limitations you'll have with one 3090. (You're limited in training speed and model size, and you're also limited in how many ideas you can have cooking at once).

The "experiment at home, train a model, and reproduce or fine-tune on someone elses better GPU" is tried and true.

(Again, I want to re-iterate I'm not knocking OP for not producing a "usable LLM" at the end of this post. That's not the point of the post, and it's a good post. My only point is that it's not currently feasible to train your a useful general-purpose LLM on one 3090.)

deskamess

24 days ago

1 reply

I have an old 2060 with 6GB (I think). I also have a work laptop 3060 with 6GB (shared to 8GB). What can I do with those? I dabble a bit here and there but I would like to run my own local LLM for 'fun'.

Thanks!

sosodev

24 days ago

If you just want to run a local LLM you could download ollama and do it in minutes. You'll be limited to small models (I would start with qwen3:1.7b) but it should be quite fast.

ineedasername

24 days ago

1 reply

Absolutely. Your model selection has limits of course: best practice for some types of replicable research would be to to use unquantized models, but that still leaves room for smaller Gemma and Llama models.

I’m on a 4080 for a lot of work and it gets well over 50 tokens per second on inference for pretty much anything that fits in VRAM. It’s comparable to a 3090 in compute, the 3090 has 50% more vram, the 4080 has better chip-level support for certain primitives, but that actually matters slightly less using unquantized models, making the 3090 a great choice. The 4080 is better if you want more throuput on inference and use certain common quantize levels.

Training LoRa and fine tunes is highly doable. Yesterday’s project for me, as an example, was training trigger functionality into a single token unused in the vocabulary. Under 100 training examples in the data set, 10 to 50 epochs, extremely usable “magic token” results in under a few minutes at most. This is just an example.

If you look at the wealth of daily entries on arxiv in cs.ai many are using established smaller models with understood characteristics, which makes it easier to understand the result of anything you might do both in your research and in others’ being able to put your results in context.

e12e

24 days ago

1 reply

Unrelated to the topic of small LLMs:

> trigger token

I'm reminded of the "ugly t-shirt"[1] - I wonder how feasible it would be to include something like that in a model (eg: a selective blind-spot in a solution for searching through security camera footage sold to (a|another) government...).

When you see something, say something. Unless you see this; then say nothing...

[1]

> Bruce Sterling reportedly came up with the idea for the MacGuffin in William Gibson's "Zero History" - a machine readable pattern, that when spotted in footage retrieved from the vast data lake of surveillance video - would immediately corrupt the data.

> Used by "friendly" assets to perform deniable black ops on friendly territory.

ineedasername

24 days ago

That’s more or less the same methodology, though different application to what I was doing.

If you have control over the model ore deployment it’s straightforward to train a single token without updating weights globally. A few examples of the input, trace the input through a few iterations of token generation to isolate a point at which the model is recognizing or acting on the the trigger input (so in this case the model would have to be seeing “ugly t-shirt” in some meaningful way. Preferably already doing something with that recognition, like logging {“person:male”, “clothing:brown t-shirt with ‘ugly’ wording”}

Find a few examples, find a something that, injected into the Roman generation, derails its behavior to garbage tokens. Train those as conversation pairs into a specific token id.

The difficulty is balancing the response. Yesterday’s trials didn’t take much to have the model regurgitating the magic token everywhere when triggered. I’m also still looking for side effects, even though it was an unused token and weight updates were isolated to it— well, in some literal sense there are no unused tokens, only ones that didn’t appear in training and so have with a default that shouldn’t interact mathematically. But training like this means it will.

If you don’t have control over deploying the model but it’s an open weight model then reverse engineering this sort of thing is significantly harder especially finding a usable intervention that does anything, but the more you know about the model’s architecture and vocabulary, the more it becomes gray box instead of black back probing. Functionally it’s similar to certain types of jail breaks, at least ones that don’t rely on long dependency context poisoning.

i5heu

24 days ago

It depends on what you want to do in this gigantic field.

spmurrayzzz

24 days ago

Those cards can be great for lots of use cases, plenty of small models are very capable at the param counts which can fit in 32GB of VRAM. GPT-OSS-20B for example is a serviceable model for agentic coding use cases and it runs natively in MXFP4. So it fits comfortably on a 5090 at full 128k context. It also has enough headroom to do PEFT-style SFT or RL.

But given the high entry cost and depending on the cost of electricity in your area, it would take a number of years to amortize both the initial purchase of the card in addition to the energy cost of the compute (comparing to the compute-equivalent hourly cloud rental costs).

For context, a single 5090 rented via Runpod is currently $0.69/hr USD on-demand. Cost range on Amazon right now for a new card is running between $3200-3700 USD. Just using the raw capex alone, that's ~5k hours of GPU compute assuming you pay only on-demand. Thats 2-3 years worth of compute if you assume compute saturation for normal working hour durations. This is before you account for the cost of power, which in my city could run you upwards of $140/mo varying by season.

With that said, I have a bunch of ML servers that I built for myself. The largest one is using 2x RTX Pro 6000s and have been very happy with it. If I was only doing inference I think this would be a somewhat questionable expense, setting aside the valid motivations that some folks have related to data privacy and security. But I do a lot of finetuning and maintain private/local eval harnesses that personally for me have made it worth the investment.

ipnon

24 days ago

It's good to have a local GPU. That's like your dev environment. Prod is much more expensive in AI programming than in web programming. So you want to make sure everything is working before you push!

netruk44

24 days ago

[delayed]

whimsicalism

24 days ago

it is good for quick testing of stuff, but absolutely it is better to rent some cloud compute - HN skews a bit fantastical/fanatical on this issue

Havoc

24 days ago

5 replies

> When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet

Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?

ACCount37

24 days ago

2 replies

Data filtering. Dataset curation. Curriculum learning. All already in use.

It's not sexy, it's not a breakthrough, but it does help.

famouswaffles

24 days ago

1 reply

Curriculum learning is not really a thing for these large SOTA LLM training runs (specifically pre-training). We know it would help, but ordering trillions of tokens of data in this way would be a herculean task.

ACCount37

23 days ago

I've heard things about pre-training optimization. "Soft start" and such. So I struggle to believe that curriculum learning is not a thing on any frontier runs.

Sure, it's a lot of data to sift through, and the time and cost to do so can be substantial. But if you are already planning on funneling all of that through a 1T LLM? You might as well pass the fragments through a small classifier before you do that.

Havoc

24 days ago

> All already in use.

At the big labs that makes sense. Bit more puzzled by why it isn’t used in the toy projects. Certainly more complexity but seems like it would make a big difference

azath92

24 days ago

1 reply

For small models this is for sure the way forward, there are some great small datasets out there (check out the tiny stories dataset that limits vocab to a certain age but keeps core reasoning inherent in even simple language https://huggingface.co/datasets/roneneldan/TinyStories https://arxiv.org/abs/2305.07759)

I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.

gpjtAuthor

24 days ago

OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.

embedding-shape

24 days ago

Makes me wonder what kind of model we could get if we just trained on Wikidata and similar datasets, but pre-processed to be natural language rather than just triplets of data.

qrios

24 days ago

"low hanging" is relative. At least from my perspective. A significant part of my work involves cleaning up structured and unstructured data.

An example: More than ten years ago a friend of mine was fascinated by the german edition of the book "A Cultural History of Physics" by Károly Simonyi. He scanned the book (600+ pages) and created a PDF (nearly) same layout.

Against my advice he used Adobe tools for it instead of creating an epub or something like DocBook.

The PDF looks great, but the text inside is impossible to use as training data for a small LLM. The lines from the two columns are mixed and a lot of spaces are randomly placed (makes it particularly difficult because mathematical formulas often appear in the text itself).

After many attempts (with RegEx and LLMs), I gave up and rendered each page and had a large LLM extract the text.

haolez

24 days ago

If you can create this filtering model, you have created Skynet and solved AGI :D

nullbound

24 days ago

3 replies

I love the level of detail ( probably, because I see it less and less these days ). It genuinely makes me wonder if anyone tried training LLMs on their own writings ( assuming those bigger than 100+ pages ) and what the results were.

jadbox

24 days ago

4 replies

I just want to chime in here about the importance of taking notes and having a journal. These things are now more important than ever as they can literally help fine-tune agents to help assist you using your personal style.

trial3

24 days ago

1 reply

> These things are now more important than ever

oh definitely. i agree here. can't wait to read the rest of the sentence, probably saying something meaningful about the creative benefits of unstructured writing, or the importance of relying on your own thoughts and language and unique voice in the era of LLMs

> as they can literally help fine-tune agents to help assist you using your personal style.

jadbox

24 days ago

I get it. Both things can be true. Unstructured writing can help you develop as a person. It can also teach your own model the 'real raw human train of thoughts' of your personal journey. Personally I love the idea of booting up great-great-grandpa-model that'll have been trained on his 40 years of almost daily journaling. We are not trying to 'remake him' to be clear- we are talking about being have to have an interaction chat with his personality-vibe as it was recorded by his own hand and in his own words.

SecretDreams

24 days ago

2 replies

Is this what tool and die people used to feel when going to LOC to train their replacements?

Personally, I do not want my likeness to persist after my death, nor do I wish for a company to be able to leverage my likeness after I leave said company.

nullbound

24 days ago

1 reply

I understand the concern, but I also think there are benefits to this approach. And while I absolutely agree with you on the likeness part used for a company, at a personal level, I believe it could have a great impact ( and be of use ). And, more importantly, you can then control the disposition of your likeness appropriately ( via an old fashioned will ). As a society, we seem to have solutions for these situations. They were just not very common.

SecretDreams

24 days ago

1 reply

Given the velocity of this industry and it being largely driven by corporations, how many individuals do you think will have control over their likeness vs their likeness being stored by some entity they did not explicitly consent towards?

I appreciate your take, I just think it is not in line with the current trajectory outside of some unique HN posters and the like - and even they will probably wake up one day realizing some entity also already owns their likeness, albeit the HN user might have a local copy they hand crafted themselves using some cobbled together hardware.

nullbound

24 days ago

You do have a point. That is why I am not pushing it as a general solution and frankly why I am not super keen on putting everything on github for everyone to see. If there is only one dark joke of the current times, it is that pressing agree somehow constitutes agreeing to legally consenting all sorts of invasive practices.

I would absolutely not suggest doing what I am doing to an average user.

edit: Frankly, just by thinking I am above average I might be inviting a more risky behavior.

djmips

24 days ago

from context I figure you meant China and/or other places that would take over American manufacturing but I'm curious what LOC means - typo?

itissid

24 days ago

1 reply

I have always wondered if I should be recording all my conversations privately — with consent —with family and friends and then train an LLM to let anyone speak to someone that sounds "like me" when I am gone.

I suppose one could order all the data over time -— decades — and then train a model incrementally every decade and imitate me better at a point in time.

I suppose one could also narrate thoughts and feelings associated with many transcripts, which would be very tedious but would make the LLM imitate not just style but some amount of internal monologue.

I suppose one level further could be an LLM learning about the variety or parts of the ego, the I, me, mine, ours. Then the Observer and the Observed parts of thought — if we can somehow tap internal thought without manually speaking — because thoughts are, metaphorically speaking, the speed of light.

Why would one do all this? I suppose a curt answer would be to "live" eternally of course — with all the limitations of the current tech — but still try.

It might make a fascinating psychoanalysis project, one that might be a better shot at explaining someone's _self_ not as a we, a stranger, might as outwardly see it: just as a series of highs and lows and nothing in between, but instead as how they lived through it.

futuraperdita

23 days ago

You've created a text-based version of a Black Mirror episode: https://en.wikipedia.org/wiki/Be_Right_Back

levmiseri

24 days ago

Fully agree on the importance of taking notes and writing in general [1], but I absolutely do not want to train a model on my texts or attempt a personal style imitation. I can't fully put my finger on why exactly other than that it feels icky and that it would hinder my long-term writing quality rather than help it.

[1] I made an app to be my lifelong companion for this: https://kraa.io/about – No AI integration.

alansaber

24 days ago

Fine-tuning on a small corpus can definitely get you good results with some care

BoredomIsFun

24 days ago

/r/localllama every once in awhile has such posts; usually very succesful, good results.

BubbleRings

24 days ago

10 replies

> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits.

At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.

Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!

tomrod

24 days ago

2 replies

Disclaimer: working and occasionally researching in the space.

The first paragraph is clear linear algebra terminology, the second looked like deeper subfield specific jargon and I was about to ask for a citation as the words definitely are real but the claim sounded hyperspecific and unfamiliar.

I figure a person needs 12 to 18 months of linear algebra, enough to work through Horn and Johnson's "Matrix Analysis" or the more bespoke volumes from Jeffrey Humpheries to get the math behind ML. Not necessarily to use AI/ML as a tech, which really can benefit from the grind towards commodification, but to be able to parse the technical side of about 90 to 95 percent of conference papers.

jhardy54

24 days ago

1 reply

> 12 to 18 months of linear algebra

Do you mean full-time study, or something else? I’ve been using inference endpoints but have recently been trying to go deeper and struggling, but I’m not sure where to start.

For example, when selecting an ASR model I was able to understand the various architectures through high-level descriptions and metaphors, but I’d like to have a deeper understanding/intuition instead of needing to outsource that to summaries and explainers from other people.

tomrod

24 days ago

I was projecting as classes, taken across 2 to 3 semesters.

You can gloss the basics pretty quickly from things like Kahn academy and other sources.

Knowing Linalg doesn't guarantee understanding modern ML, but if you then go read seminal papers like Attention is All You Need you have a baseline to dig deeper.

danielmarkbruce

24 days ago

3 replies

One needs about 12 to 18 hours of linear algebra to work though the papers, not 12 to 18 months. The vast majority of stuff in AI/ML papers is just "we tried X and it worked!".

miki123211

24 days ago

3 replies

You can understand 95+% of current LLM / neural network tech if you know what matrices are (on the "2d array" level, not the deeper lin alg intuition level), and if you know how to multiply them (and have an intuitive understanding why a matrix is a mapping between latent spaces and how a matrix can be treated as a list of vectors). Very basic matrix / tensor calculus comes in useful, but that's not really part of lin alg.

There are places where things like eigenvectors / eigenvalues or svd come into play, but those are pretty rare and not part of modern architectures (tbh, I still don't really have a good intuition for them).

devmor

24 days ago

1 reply

I was about to respond with a similar comment. The majority of the underlying systems are the same and can be understood if you know a decent amount of vector math. That last 3-5% can get pretty mystical, though.

Honestly, where stuff gets the most confusing to me is when the authors of the newer generations of AI papers invent new terms for existing concepts, and then new terms for combining two of those concepts, then new terms for combining two of those combined concepts and removing one... etc.

Some of this redefinition is definitely useful, but it turns into word salad very quickly and I don't often feel like teaching myself a new glossary just to understand a paper I probably wont use the concepts in.

buildbot

24 days ago

This happens so much! It’s actually imo much more important to be able to let the math go and compare concepts vs. the exact algorithms. It’s much more useful to have semantic intuition than concrete analysis.

Being really good at math does let you figure out if two techniques are mathematically the same but that’s fairly rare (it happens though!)

whimsicalism

24 days ago

1 reply

> There are places where things like eigenvectors / eigenvalues or svd come into play, but those are pretty rare and not part of modern architectures (tbh, I still don't really have a good intuition for them)

This stuff is part of modern optimizers. You can often view a lot of optimizers as doing something similar to what is called mirror/'spectral descent.'

tomrod

21 days ago

Indeed. "Spectral" describes the collection of eigenvalues!

tomrod

24 days ago

Eigenvector/eigenvalues: direction and amount of stretch a matrix pushes a basis vector.

cultofmetatron

24 days ago

1 reply

for anyone looking to get into it, mathacademy has a full zero to everythign you need pathway that you can follow to mastery

https://mathacademy.com/courses/mathematics-for-machine-lear...

DenisM

23 days ago

1 reply

There is no mention of llm there?

cultofmetatron

23 days ago

1 reply

if you want to use llms, just download one and play with it. if you want to understand llms enough to push research forward, learn the underlying math

tomrod

21 days ago

100%

gpjtAuthor

24 days ago

OP here -- agreed! I tried to summarise (at least to my current level of knowledge) those 12-18 hours here: https://www.gilesthomas.com/2025/09/maths-for-llms

QuadmasterXLII

24 days ago

1 reply

The second paragraph is highly derivative of the adversarial turbo encabulator, which Schmithuber invented in the 90s. No citation of course.

BubbleRings

23 days ago

Are you saying I should have attributed, or ChatGPT should have? I suppose I would have but my spurving bearings were rusty.

empath75

24 days ago

It's a 28 part series. If you start from the beginning, everything is explained in detail.

ekropotin

24 days ago

I have no idea what you’ve just said, so here is my upvote.

jcims

24 days ago

The turbo encabulator lives on.

squigz

24 days ago

I'm glad I'm not the only one who has a Turbo Encabulator moment when this stuff is posted.

unethical_ban

24 days ago

I was reading this thinking "Holy crap, this stuff sounds straight out of Norman Rockwell... wait, Rockwell Automation. Oh, it actually is"

woadwarrior01

24 days ago

It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings.

[1]: https://arxiv.org/abs/1608.05859

whimsicalism

24 days ago

i consider it a bit rude to make people read AI output without flagging it immediately

miki123211

24 days ago

As somebody who understands how LLMs work pretty well, I can definitely feel your pain.

I started learning about neural networks when Whisper came out, at that point I literally knew nothing about how they worked. I started by reading the Whisper paper... which made about 0 sense to me. I was wondering whether all of those fancy terms are truly necessary. Now, I can't even imagine how I'd describe similar concepts without them.

chiengineer

24 days ago

2 replies

Off topic question since im not a regular here if its ok

Is anyone here actually using the 200$ a month subscriptions with chat gpt or the google 150$ per month ?

Is it worth it for more code generation ? Or spend my money on a couple gpus and go local

esafak

24 days ago

1 reply

To answer the last question: What kind of programming do you do? You are not going to be able to run a model competitive with the SOTA yet; use the cloud. Since you have the budget I'd suggest getting a $20 subscription of each (Claude, Gemini, ChatGPT) so you can lean on their respective strengths.

magicalhippo

24 days ago

I got a free month of the Premium tier with Google[1], YMMV. Been pleasantly surprised about Gemini 3 Pro. Got ChatGPT Business at work to compare it to.

That said, Google's VSCode integration was terrible, kept logging me out and just didn't work well.

[1]: https://one.google.com/about/plans

Taek

24 days ago

I used the $200/mo OpenAI subscription for a while, but cancelled when Gemini 3 came out. It was useful for the deep research credits until the Web search gpt got sufficiently good on it's own

spi

24 days ago

1 reply

This is a very nice, detailed post! I have a few minor comments though (maybe a few are discussed somewhere, it's a _long_ article and I can't claim 100% coverage :-) ):

Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...

The early discussion and worries about truncating strings look a bit weird. The author then realizes they're anyway not even going to use 30% of the total available data, so who cares if for each given string we're only using the first 1024 tokens? (And anyway, even if doing more epochs, he doesn't discuss the obvious solution to avoid throwing away data, i.e. not clipping always the tail but starting from a random point each epoch - maybe after a punctuation or something)

At this level of simplicity, setting up a validation loop might be an unneeded complication (for the autoregressive pretraining part, not the instruction-tuning of course). That's because anyway the model is training for < 1 epoch, so no data is seen twice (*). One might as well just track the training loss, it's slightly less "clean" because it's evaluated each time on different data, but the sheer size of it makes up for the issue. The final plot shows that the two curves are similar - train is noisier of course, but nothing a bit of rolling smoothing couldn't solve.

The choice to load all tokenized text into RAM feels odd... it works, and it's possibly slightly faster than loading on-the-fly, but only if you have enough RAM to "waste". PyTorch loads data on separate processes in a non-blocking way, so it feels like having it on disk and loaded on-the-fly would be safer and not make any hit on runtime. But well, if it fits, it's certainly easier that way (although, as the author remarks, it only works if you can store it as a numpy array or torch tensor of some internally supported dtypes like int or float; if they are any Python "object" types, they get replicated per dataloader worker, and OOM is guaranteed)

The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654

(*) Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.*

BoxOfRain

24 days ago

1 reply

> Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...

I've always felt the natural way of referring to smaller LLMs would be Medium Language Models and Small Language Models, but I guess MLM is an inauspicious acronym.

jszymborski

24 days ago

It's also already used for language modelling:

MLM is masked language modelling, another phrase for training models on the cloze task. It's the most common way to train encoder-only models.

CLM (causal language modelling) is the other common task where you autoregressively predict the next token given the previous ones. It's the most common way to train decoder-only models.

kburman

24 days ago

1 reply

Anyone interested can also follow this amazing playlist - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSgsLAr8YCgC...

youngNed

24 days ago

These all look great, I'm very interested in hearing from anyone who has followed any of these.

How did you find it, what did you get from it?

spi

24 days ago

3 replies

A separate comment about conclusions about why they are worse than OpenAI GPT2 - which to me feel to be missing the point.

One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.

Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.

The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).

On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.

gpjtAuthor

24 days ago

1 reply

OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.

gpjtAuthor

24 days ago

1 reply

OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:

  * OpenAI medium weights: 3.231
  * OpenAI small weights: 3.500
  * My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
  * My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
  * My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
  * My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674

That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.

I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).

spi

23 days ago

1 reply

Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!

gpjtAuthor

23 days ago

1 reply

Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.

I assume the zero_grad would need to go in the same if block?

gpjtAuthor

23 days ago

1 reply

Hmm, interesting. With a batch size of 512 (8x B200s with 160 GiB each) I get worse results! Maybe there's a sweet spot somewhere in between.

spi

21 days ago

1 reply

Sorry came a bit late to this reply. Interesting, well, nobody says it's a monotonic function :-) in the limit of _very_ large batches you of course are worse off, because you take a very large amount of computation before taking a single step, so if you stop after a fixed amount of time your model just didn't have the time to learn properly. So certainly there is a sweet spot somewhere.

I suppose, the real "function" is a bit more complicated because (1) If you put 2x more data through the same GPU with large enough memory, it will take less than 2x the time to compute (but certainly not 1x). (2) At some point, empirically, increasing batch size makes it _worse_ even if you ignore the additional runtime cost (i.e. stop after n gradient update steps, and not x seconds). To my knowledge, the accepted reason for that fact is that a bit of noise helps in regularizing learning, because overly smooth learning curves end up stagnating in local loss minima more easily. In truth, I think nobody exactly understand how deep learning models work :-)

And to your other question - sorry again for the late answer. Yes, `optimizer.zero_grad()` should always be called directly after `optimizer.step()`, therefore with gradient accumulation once every `n` steps (otherwise, you'd be zeroing out the gradients, so just throwing away all the compute you did in previous steps).

gpjtAuthor

21 days ago

1 reply

Thanks re: gradient accumulation, I'm glad to hear my intuition was right!

As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.

I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.

But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.

spi

20 days ago

1 reply

Yes exactly, I fear that shortening the training time would skew the results. In the very short term, smaller batch size is typically better just because you need a certain amount of gradient updates to move away from the original random, hence pretty terrible, weight distribution. Larger batch size gives a steadier, but slower, convergence, so it's hard to say for sure what is better for a given compute budget.

I'm definitely _not_ encouraging you on spending more money on a side topic just for the sake of optimizing this one parameter, there will always be another parameter after that that you'll feel an urge to optimize :-) I'd say it's already a pretty neat result to have come to a very close score to the original GPT2 training starting from scratch!

P.S. If you want to push it a bit further, rather than optimizing parameters for this model, last week at EurIPS I heard that a current "very good" modern repo to start from in order to train a good LLM is this: https://github.com/Niccolo-Ajroldi/plainLM. I haven't investigated this exactly (I'm not working on LLM), but it might be interesting to you for a sample run. The (N)EurIPS paper that was discussed at the conference claimed that the only important change to do was to modify the hyperparameters of the Adam optimizer, setting beta1=beta2=0.95 for example (the default values are beta1=0.9 and beta2=0.999 which are apparently outdated).

gpjtAuthor

16 days ago

Awesome, thanks! I'm still doing trains on the big machines right now (hopefully will write up over xmas) but I think once I've worked out the sweet spot for memgatokens per dollar for this model, it's time to start tweaking the other controls -- LR and cosine variation of it, as you said, and also dropout, bias, weight tying, and definitely gradient clipping (which should at least get better bang for the buck from time/$ spent). I'll leave it to Google to follow up Chinchilla with a "best batch size across a thousand trained models" paper ;-)

alansaber

24 days ago

1 reply

To caveat, smaller batch sizes are generally better for model stability, but we go bigger because it substantially speeds up training

spi

23 days ago

1 reply

Mmh not really. As OP shows, speed increases with larger batch size, but only initially, until the GPU has high enough utilization; then speed improvements flatten out (although you might get OOM before that and not "really" see the flat part). Using smaller batch size increases _noise_, so quite literally decreases stability. That might be good sometimes: in the limit case, if the batch is as large as your training set, you'll end up in local minima and not be able to get out of it. But this is true for toy datasets like MNIST, here it's an entirely different beast.

With such large corpora as the ones used here, and very noisy ones at that, gradient updates are very noisy and that can harm quality. Or anyway, common lore is that one needs pretty large batch size to have the language model improve steadily.

alansaber

23 days ago

1 reply

Are you sure about the top-cap on batch size for speed? See https://arxiv.org/pdf/1904.00962

spi

21 days ago

Sorry I just opened that file now, and browsed through it very quickly, but my eye fell on the excerpt: ``` However, we did not observe any speedup by increasing the batch size from 65536 to 131072 for the first stage, thus, we restrict the batch size to 65536 for this stage. ``` which I think is more or less my point: increasing batch size essentially always helps, but the speedup reduces the more you push the batch size. Provided that your dataset is large enough, more batch size will always make you run a bit faster without sacrificing accuracy, but the speedup will be less and less as you increase the batch size, until you are anyway maxing out the power of your GPU and you can't see any measurable speedup anymore.

whimsicalism

24 days ago

> Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.

I would be surprised if there is much/any gradient acc in modern large-scale pretraining runs. You can always just recruit more GPUs with DP/PP/TP rather than training for longer.

nico

24 days ago

2 replies

Has anyone done something like this but with apple silicon instead of a graphics card? Training a small LLM on an M2-M5?

muricula

24 days ago

1 reply

I've played with something similar with my M1 using Apple's MLX framework. The problem is I'm compute bound. I've never managed to get my M1 Max's GPU to process more than ~7.8k tokens per second at bf16 precision, so to train a 112M parameter model on ~20 billion tokens I'd need to run the model training for ~30 days.

One solution is to reduce the scope of the problem -- you can train on a smaller less diverse dataset such as TinyStories which is a collection of 1 billion tokens of chatGPT generated children's stories. After about 40 hours, less than one weekend, you'll have a model which can generate mostly grammatical children's stories.

If you have a newer mac and/or an ultra chip you'll have more and faster GPU cores, and might be able to train on FineWeb or a similar, larger and more diverse dataset.

gpjtAuthor

23 days ago

OP here -- with a 112M model you should be able to get something worth playing with using 2.24B tokens. The Chinchilla heuristic is tokens = 20 x parameters. Obviously you cam get a better result by grinding through more tokens, but it will be very slow progress. It's worth noting that Andrej Karpathy is using the 20x thing for his nanochat project.

I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.

goosers

23 days ago

I’m experimenting with this, but using the CPU not the GPU. I’m finishing up writing the series now, but focused more on understanding the architecture than trying to build a useful model. Mine requires talking in the language of Shakespeare, and getting replies in the same, a proof of concept more than a useful tool. https://www.tag1.com/white-paper/part1-tokenization-building...

I was interested in focusing on repeatability and using text sources anyone can legally obtain. It’s been fascinating, but after much experimentation it’s clear that working with more text and more diverse text would be extremely helpful.

pwython

24 days ago

3 replies

For those that have homebrewed a base model, does your output have the same AI-isms like overusing em dashes? If so/not, what dataset did you use?

whimsicalism

24 days ago

that is a post-training artifact

itissid

24 days ago

Does yours also use the oxford comma and generally more commas?

miki123211

24 days ago

AFAIK, those are mostly a consequence of posttraining.

billylo

24 days ago

If you are curious about doing something similar with TPU, Google has an article. https://developers.googleblog.com/train-gpt2-model-with-jax-...

lepicz

24 days ago

cool, i was looking for something like this to try on my own puny hw - thanks!

nfriedly

24 days ago

The full list of articles is at https://www.gilesthomas.com/llm-from-scratch for anyone who's interested but wants to start at the beginning.

noloman

24 days ago

Great article

logicallee

24 days ago

you can train an LLM in the browser, see this demonstration:

https://taonexus.com/mini-transformer-in-js.html

It's a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation.

Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.

noloman

24 days ago

Great article, thanks!

lacoolj

24 days ago

Maybe I've been missing out, but can anyone give me a yay/nay on whether this is a worth-while 28-part-series to start from scratch and spend my time watching/reading?

Is it along the same lines as https://github.com/karpathy/llm.c/discussions/677 ?

He (karpathy) has a video series that also does something similar. I found it very informative and entertaining, even at the 1 hour + length it is (there are actually multiple videos, im not sure how long the others are).

pixigenie

24 days ago

thanks for sharing

fuddle

24 days ago

This is great to see, I'm also re-reading Sebastian Raschka's amazing book.

RagnarD

24 days ago

I really like this article. I hadn't thought that an RTX 3090 would be capable of generating a sort-of decent small LLM from scratch in a reasonable time, but he shows how in detail.

roschdal

24 days ago

Now this is cool. and can be used for evil AI.

View full discussion on Hacker News

ID: 46124425Type: storyLast synced: 12/10/2025, 2:25:20 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN