Writing an LLM From Scratch, Part 22 – Training Our LLM

Posted3 months agoActive3 months ago

gpjt

254 points

10 comments

gilesthomas.comTechstory

calmpositive

Debate

20/100

LLMAIMachine LearningDeep Learning

Key topics

LLM

Machine Learning

Deep Learning

The author shares their journey of building a Large Language Model (LLM) from scratch, with the latest part focusing on training the model, sparking discussions on the learning process, cost comparisons, and the value of hands-on experience.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

6-8h

Avg / period

1.3

Key moments

01Story posted
Oct 15, 2025 at 7:42 PM EDT
3 months ago
Step 01
02First comment
Oct 15, 2025 at 9:24 PM EDT
2h after posting
Step 02
03Peak activity
2 comments in 6-8h
Hottest window of the conversation
Step 03
04Latest activity
Oct 17, 2025 at 1:52 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (10 comments)

Showing 10 comments

mettamage

3 months ago

2 replies

Here's part 1 [1]. Since his archive goes by date, it makes it a bit easier to guestimate which part is made in which month.

[1] https://www.gilesthomas.com/2024/12/llm-from-scratch-1

ziyunli

3 months ago

seems like you can filter by tag https://www.gilesthomas.com/llm-from-scratch

3abiton

3 months ago

It's interesting 22 parts in under a year, seems like a fun up to date project. Karpathy did something very similar with nanochat (following nanogpt).

js8

3 months ago

1 reply

It's based on a book https://www.manning.com/books/build-a-large-language-model-f..., is it a good book?

checker659

3 months ago

1 reply

I have done a little bit of DL stuff (with keras) before this. I'm currently in the attention chapter. The book gives you the code, but I feel like there is very little in the way of building intuition. Thankfully, there are tons of videos online to help with that.

I think it is a great guide. An extended tutorial if you will (at least until this point in my reading). Also having the code right in front of you helps a lot. For example, I was under the impression that embedding vectors were static like in word2vec. Turns out, they are learnable parameters too. I wouldn't have been able to tell for sure if I didn't have the code right in front of me.

dvt

3 months ago

2 replies

> The book gives you the code, but I feel like there is very little in the way of building intuition.

There isn't really much intuition to begin with, and I don't really think building intuition will be useful, anyway. Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work. Heck, even implementing a Markov chain from scratch (which can be done in an afternoon with no prior knowledge) can feel magical when it starts outputting semi-legible sentences.

It's like trying to build intuition when it comes to technical results like the Banach-Tarski paradox or Löb's theorem. Imo, understanding the math (which in the case of LLMs is actually quite simple) is orders of magnitude more valuable than "building intuition," whatever that might mean.

checker659

3 months ago

> Even when looking at something as barebones as perceptrons

I was thinking something like "it is trying to approximate a non-linear function" (which is what it is in the case of MLPs).

CamperBob2

3 months ago

Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work.

Check out the Karpathy "Zero to Hero" videos, and try to follow along by building an MLP implementation in your own language of choice. He does a good job of building intuition because he doesn't skip much of anything.

pppoe

3 months ago

Love this. This reminds me of LFS (Linux From Scratch) https://www.linuxfromscratch.org

Feeling nostalgic about the days building LFS in college.

Learning by building wouldn't help you remember all the details but many things would make more sense after going through the process step by step. And it's fun.

mrasong

3 months ago

The cost comparison between local RTX 3090 and cloud A100 clusters is useful, but I wonder if the author accounted for hidden overhead—like data transfer time for large datasets or the time spent debugging CUDA compatibility issues on local hardware.

View full discussion on Hacker News

ID: 45599727Type: storyLast synced: 11/20/2025, 3:35:02 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN