Tiled Hacker news on React Router

Smollm3: Smol, multilingual, long-context reasoner LLM

388 points - 07/08/2025

Source

WhitneyLand
07/08/2025
Mostly SOTA performance at the 3B level. A notable addition to the small but truly open club of models that provide full disclosure, code, recipes to reproduce their work.
Looks like ballpark a million dollars of GPU time if you want to train up one for yourself (4000 gpus/24 days).
Very nice write up that’s generous in sharing their learnings.
This is a solid and positive contribution.
gardnr
07/08/2025
It's small (3B) and does great on benchmarks. This is a model for edge / mobile deployments so the gains over gemma3-4b are meaningful. It has dual mode reasoning / non_reasoning AND they released the full training method:
> We're releasing SmolLM3 with our engineering blueprint. It includes architecture details, exact data mixtures showing how we progressively boost performance across domains in a three-stage pretraining approach, and the methodology for building a hybrid reasoning model. Usually, achieving these results would require months of reverse engineering. Instead, we're providing the full methodology.
msgodel
07/08/2025
Wow. Close to a Qwen3 distill with 75% the size. That's great!
I've been using the smollm base models for my own finetunes just because they're so high quality, it looks like I might be using them to drive local agents/code completion in the near future too.
Their RL algorithm looks interesting. I'm still using OpenAI's algorithm for my stuff, I've been meaning to check on the SoTA since I know my code is pretty outdated (It's crazy how fast that happens with this stuff.)
danielhanchen
07/08/2025
I fixed some chat template issues for llama.cpp and other inference engines! To run it, do:
./llama.cpp/llama-cli -hf unsloth/SmolLM3-3B-GGUF:Q4_K_XL --jinja -ngl 99
_1
07/08/2025
Which small model is good for fine tuning to various enterprise data sets? Our business units are wanting to run small models in browser and on mobile devices, without dealing with RAG and cloud resources.
gdiamos
07/08/2025
Nice work anton et al.
I hope you continue the 50-100M parameter models.
I think there is a case for models that finish fast on CPUs in solve by llm test cases.
nateb2022
07/08/2025
https://web.archive.org/web/20250708164705/https://huggingfa...
simonw
07/09/2025
I'm having trouble running this on my Mac - I've tried Ollama and llama.cpp llama-server so far, both using GGUFs from Hugging Face, but neither worked.
(llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'smollm3')
I've managed to run it using Python and transformers with PyTorch in device="cpu" mode but unsurprisingly that's really slow - it took 35s to respond to "say hi"!
Anyone had success with this on a Mac yet? I really want to get this running with tool calling, ideally via an OpenAI-compatible serving layer like llama-server.
tiahura
07/08/2025
Can anyone estimate how much of the 3B is necessitated by multi-language support?
BarakWidawsky
07/08/2025
It’s interesting that it looks like they didn’t apply their own RL to the model, and instead fine tuned on reasoning traces from large datasets and generating reasoning traces from larger models
ivape
07/08/2025
Looks like it's the 3B models that are being shipped out to on device by default. Apple's on-device LLM is 3B, and I believe Canary is shipping Google nano:
https://developer.chrome.com/docs/ai/rewriter-api
eachro
07/08/2025
From what I've heard, the llama3 models are fairly easy to fine-tune (please correct me if I'm wrong or if there are more amenable models here). How easy is it to finetune smollm3? I know a lot of the MoE LLMs have been quite fickle in this regard.
cess11
07/09/2025
I've tried to use gemma3:4b which comes up better in that benchmark and found it to be quite disappointing. It breaks a lot, sucks even worse than qwen2.5-coder:7b and incept5/llama3.1-claude:7b at code, needs to be tricked or threatened into saying stuff about many everyday topics. It also commonly chugs away for minutes exercising the GPU fans before responding, at which point I'm already ahead because I figured out another way to solve my problem or get at some information.
My experience with phi4-mini and granite3.3 was about the same, and they annoy me even more when I hook them into code editors and try to get them to contribute to my work. For one because they're slow, and at best they suggest adding unnecessary error handling in the style of null checks everywhere, at worst they just start mixing or hallucinating programming languages. Where they would be useful as leverage if they worked, i.e. close to the edge of where I can debug and refactor without getting stuck, they just go into straight nonsense mode, especially on terse first-pass code.
Sometimes I've tried to query these things for descriptions of recent history in foreign countries, Wikipedia trivia basically, and they're very often wrong in subtle ways. For example, a politician might have been at it for half a century or so in a troubled country and because they've been ousted in a coup once in the eighties the model is absolutely sure they can't have been in office since.
If a person acted like these things do I'd wish for them to get immediate institutional care. Maybe the problem is somehow with me, but I have a deep suspicion it's not.
lvl155
07/09/2025
This is actually good learning material for anyone getting up to speed on LLM from scratch.
ivape
07/08/2025
I wonder if this will be cheaper than llama 3.1 8b on OpenRouter.
grrowl
07/09/2025
Great to see Huggingface stick to their guns with CodeEval and python tooling. Agentic turn-by-turn tool calling is fine and all, but we're underutilising their ability to write an execute code in an "agent-like" environment.
bitwize
07/08/2025
There's a British comedy skit lurking in here.
"So it's a small large language model?"
"Oh yes, very small."
"How can it be small and large at the same time?"
"Well, it's small by the standards of a large language model."
"So it's large."
"Oh yes, very large."
"Large compared to what?"
"Small language models."
"And so something like ChatGPT, what would that be exactly? A large large language model?"
"Yes, precisely. An LLLM."

Smollm3: Smol, multilingual, long-context reasoner LLM

WhitneyLand

YetAnotherNick

Imustaskforhelp

hynky

Imustaskforhelp

peatmoss

diggan

vixalien

segmondy

jrk

mromanuk

jazzyjackson

social_quotient

dconden

YetAnotherNick

lhl

dr_kretyn

pests

refulgentis

adrianlzt

gardnr

sigmoid10

wizee

sigmoid10

msgodel

danielhanchen

diggan

danielhanchen

clarionbell

danielhanchen

segmondy

danielhanchen

v5v3

danielhanchen

_1

gardnr

_1

janalsncm

mhitza

magicalhippo

ivape

mhitza

thatjoeoverthr

simonw

netdur

gdiamos

nateb2022

simonw

reach-vb

kosolam

pzo

knowaveragejoe

tripplyons

tiahura

rockinghigh

ethan_smith

netdur

BarakWidawsky

lewtun

ivape

eachro

cess11

iamnotagenius

lvl155

ivape

grrowl

bitwize

janalsncm

creshal

a_wild_dandan

Alifatisk

janalsncm

Alifatisk

thatjoeoverthr

Alifatisk

_kb

viraptor

bitwize

papichulo2023