Tiled Hacker news on React Router

Ask HN: Best foundation model for CLM fine-tuning?

27 points - last Thursday at 11:08 AM

Hi,

I have a largish (2 GB) corpus of curated, high-quality text in some low-resource language, and I want to build a model that would provide an advanced "auto complete" service for writers.

I'm thinking of taking a decoder-only model such as Llama, Mistral or Gemma, slice off the embedding layers (which are based on unneeded languages), create new ones (perhaps initialized based on a FastText model trained on the corpus), paired with a tokenizer newly created from my corpus, then train the model on my corpus until convergence.

Additional potential details include: a custom loss function for synonym-aware training (based on a custom high-quality thesaurus), where synonyms of the "correct" word are somewhat rewarded; POS-tagging the corpus with a Language-specific POS-tagger, and add a POS-tagging head to the model as a Multi-task Learning, to force grammatical generation.

In order to be able to use a good model as the base, I will probably be forced to use PEFT (LoRA). My current setup is whatever is available on Colab Pro+, so I can probably use the 7b-12b range of models?

My main question is, which base model would be best for this task? (Again, for completion of general writing of all kinds, not programming or advanced reasoning).

Also, will the synonym and POS additions help or hurt?

Anything else I might be missing?

Thanks!

jcuenod
today at 1:06 AM
My day job involves training language models (mostly seq2seq) for low-resource languages (with substantially less data than 2GB of data).
A few thoughts:
1. You can't cut off the embedding layer or discard the tokenizer without throwing out the model you're starting with. The attention matrices are applied to and trained with the token embedding layer.
2. Basically the same thing regarding the tokenizer. If you need to add some tokens, that can be done (or you can repurpose existing tokens) if your script is unique (a problem I face periodically). But if you are initializing weights for new tokens, that means those tokens are untrained. So if you do that for all your data, you're training a new model.
3. The Gemma model series sounds like a good fit for your use case. I'm not confident about Hebrew support, let alone Hasidic Yiddish, but it is relatively multilingual (more so than many other open models). Being multilingual means that the odds are greater than they have tokens relevant to your corpus that have been trained towards an optimal point for your dataset.
4. If you can generate synthetic data with synonyms or POS tags, then great. But this is a language model, so you need to think how you can usefully teach it natural sequences of text (not how to tag nouns or identify synonyms - I also did a bunch of classic NLP, and it's depressing how irrelevant all that work is these days). I suspect that repurposing this data will not be worth it. So, if anything, I'd recommend doing that as a second pass.
5. Take a look at unsloth notebooks for training a gemma3 model and load up your data. I reckon it'll surprise you how effective these models are...
omneity
yesterday at 6:44 AM
The first thing that comes to mind when reading “custom tokenizer” and “slice off the embedding layers” is that this sounds very much like pre-training from scratch, for which 2GB is far from enough.
Assuming you do get the data though, for a model at the sizes you’re evaluating you’re looking at weeks on a Colab A100-40GB most likely.
My recommendation would be to approach this with a smaller model and with a different training method that doesn’t involve a new tokenizer or new embedding layers because that’s what’s causing the cost x time to balloon beyond feasibility.
fzimmermann89
yesterday at 7:04 AM
How foreign is the language - was it likely included in pre training to some degree? Does it use grammar, syllables, and logic similiar to one of the large languages? Your approach assumes there is an easy to learn mapping between context in your target language and concepts in a prettained llm.
Can you get more text written in the low resources language?
Are you ok to share the name of the language?
ACCount37
yesterday at 11:29 AM
Don't fuck with the architecture for no reason. Just fucking don't. If you really, really want to, ALWAYS have a baseline of "the architecture was not fucked with" with otherwise similar training at hand, so you can compare. You'll see why.
The purpose of using a base model in the first place is to be able to reuse existing learned representations so the model only has to learn the specific task. You propose starting the run off by kicking the base model in the balls and forcing it to relearn a lot of the things that lie at its foundation. While not even doing a full fine tune. And with a dataset that's VERY small for a heavy duty tuning run. I'm not saying it can't work - but I am saying that you'll suffer trying to make it work.
Anything fancy you try during the training? Less of a minefield, but, again: keep a baseline to compare things to. 9 out of 10 fancy training ideas fail to outperform the baseline. And quite a few of those 9 underperform the baseline noticeably. For my first run, I'd maybe implement known-good basics like curriculum learning if possible but nothing fancier than that.
"Softened targets" with semantic similarity off a dictionary might work to improve sample efficiency early into the run, but it's the kind of thing that might hobble your performance further into the run because your dictionary assumptions are worse than what the model could learn on its own, so taper this off at least? POS-tagging might improve things, in a similar way, but only if you find a decent way to feed the known-good tags into the model, which may be as simple as "put the tags in the square bracket after the words with a "this is a POS-tagged text" next to the text, then mask". The "extra POS head" may work but it might be harder to make that work than to rotate the tags into the corpus naively?
Keep in mind that those are suggestions I make based on VIBES ONLY, and the only way to know if those vibes are on point or wildly off base is to actually try those runs, because that's how applied ML is.
So if you want to get fancy, start off with a small model that's cheap and fast to tune, make sure you can validate performance at least somewhat, and be ready to experiment with your runs a lot.

Ask HN: Best foundation model for CLM fine-tuning?

jcuenod

omneity

GeneralMayhem

philomath868

fzimmermann89

philomath868

agentcoops

mathiaspoint

bc569a80a344f9c

philomath868

yorwba

bc569a80a344f9c

agentcoops

fzimmermann89

ACCount37

philomath868

ACCount37