Ask HN: Best foundation model for CLM fine-tuning?
No synthesized answer yet. Check the discussion below.
Assuming you do get the data though, for a model at the sizes you’re evaluating you’re looking at weeks on a Colab A100-40GB most likely.
My recommendation would be to approach this with a smaller model and with a different training method that doesn’t involve a new tokenizer or new embedding layers because that’s what’s causing the cost x time to balloon beyond feasibility.
New tokenizer and embeddings will probably be required anyway, since the language is practically missing from any model worth to play with, but at that point simply creating a small specialized model from scratch is perhaps a better bet than trying to glue it upon a big ready model?
- Tokenize your entire corpus with a few off-the-shelf multilingual tokenizers like Llama, Qwen and Gemma and calculate the ratio of letters to tokens. The higher the better, ideally in the 3-5 range
- Manually produce or select sentences that are similar in meaning but not in writing (embedding models also leverage graphemic overlap not just semantic similarity), and then check if similar sentences show consistently higher cosine similarity than dissimilar sentences. This is for embedding models like XLM RoBERTa not for LLMs, but it has a similar insight potential.
If both of these tests are promising then you likely don’t need custom implementations for these.
Personally if I was you, I would just take Qwen 0.6B Base (not instruct, since you want text completion) and continue pretraining it on the data that you have. It is very likely to work decently out of the box.
Can you get more text written in the low resources language?
Are you ok to share the name of the language?
The language is Hasidic Yiddish (which is by now different enough from YIVO Yiddish to almost be considered a different language). The amount of (all kinds of) Yiddish included in pre training is probably very little, but not nothing. Also, it's a Germanic language with Hebrew script and roots, and some Slavic roots and suffixes. Most concepts and structure are probably not *very* foreign to a good model.
As I wrote in another comment, I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but I'm starting to rethink the feasibility.
I will probably get more text sometime in the future, but I have to build the first version now.
There’s plenty of guides online for fine-tuning for dialects. 2GB still isn’t a huge amount of data, but it seems like it would definitely be worth a concerted try (including fiddling with it a bit) given how expensive training from scratch is.
So given a Latin-script token from a model that does OK in German (bonus points if it also does Hebrew), generate several candidate Hebrew-script tokens with some regex search-and-replace, then use the resulting vocabulary to tokenize your Yiddish corpus and for each original token keep the candidate replacement that was used most often in the tokenization.
This vocabulary replacement should give you a model that does OK in German-in-Hebrew-script. I think that would be a better base for a Yiddish model than training from scratch, but of course that's just a hunch that might turn out to be wrong.
I ran some tests and, without fine-tuning, GPT can translate medieval German, for example, considerably better than well-known scholars today.
The purpose of using a base model in the first place is to be able to reuse existing learned representations so the model only has to learn the specific task. You propose starting the run off by kicking the base model in the balls and forcing it to relearn a lot of the things that lie at its foundation. While not even doing a full fine tune. And with a dataset that's VERY small for a heavy duty tuning run. I'm not saying it can't work - but I am saying that you'll suffer trying to make it work.
Anything fancy you try during the training? Less of a minefield, but, again: keep a baseline to compare things to. 9 out of 10 fancy training ideas fail to outperform the baseline. And quite a few of those 9 underperform the baseline noticeably. For my first run, I'd maybe implement known-good basics like curriculum learning if possible but nothing fancier than that.
"Softened targets" with semantic similarity off a dictionary might work to improve sample efficiency early into the run, but it's the kind of thing that might hobble your performance further into the run because your dictionary assumptions are worse than what the model could learn on its own, so taper this off at least? POS-tagging might improve things, in a similar way, but only if you find a decent way to feed the known-good tags into the model, which may be as simple as "put the tags in the square bracket after the words with a "this is a POS-tagged text" next to the text, then mask". The "extra POS head" may work but it might be harder to make that work than to rotate the tags into the corpus naively?
Keep in mind that those are suggestions I make based on VIBES ONLY, and the only way to know if those vibes are on point or wildly off base is to actually try those runs, because that's how applied ML is.
So if you want to get fancy, start off with a small model that's cheap and fast to tune, make sure you can validate performance at least somewhat, and be ready to experiment with your runs a lot.
What about deleting vision layers (e.g. the "multi_modal_projector" and the "vision_tower.vision_model" layers, assuming I go with Gemma 3), since I need just language generation? Would that also be considered a "kick in the balls", or a useful trimming?
I've seen "cut off unused vision inputs" done for older multimodals, just not the newer Gemma 3.
A few thoughts:
1. You can't cut off the embedding layer or discard the tokenizer without throwing out the model you're starting with. The attention matrices are applied to and trained with the token embedding layer.
2. Basically the same thing regarding the tokenizer. If you need to add some tokens, that can be done (or you can repurpose existing tokens) if your script is unique (a problem I face periodically). But if you are initializing weights for new tokens, that means those tokens are untrained. So if you do that for all your data, you're training a new model.
3. The Gemma model series sounds like a good fit for your use case. I'm not confident about Hebrew support, let alone Hasidic Yiddish, but it is relatively multilingual (more so than many other open models). Being multilingual means that the odds are greater than they have tokens relevant to your corpus that have been trained towards an optimal point for your dataset.
4. If you can generate synthetic data with synonyms or POS tags, then great. But this is a language model, so you need to think how you can usefully teach it natural sequences of text (not how to tag nouns or identify synonyms - I also did a bunch of classic NLP, and it's depressing how irrelevant all that work is these days). I suspect that repurposing this data will not be worth it. So, if anything, I'd recommend doing that as a second pass.
5. Take a look at unsloth notebooks for training a gemma3 model and load up your data. I reckon it'll surprise you how effective these models are...