Tiled Hacker news on React Router

Training mRNA Language Models Across 25 Species for $165

108 points - last Wednesday at 8:38 PM

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

seamossfet
yesterday at 6:49 PM
The problem with models like this is they're built on very little actual training data we can trace back to verifiable protein data. The protein data back, and other sources of training data for stuff like this, has a lot of broken structures in them and "creative liberties" taken to infer a structure from instrument data. It's a very complex process that leaves a lot for interpretation.
On top of that, we don't have a clear understanding on how certain positions (conformations) of a structure affect underlying biological mechanisms.
Yes, these models can predict surprisingly accurate structures and sequences. Do we know if these outputs are biologically useful? Not quite.
This technology is amazing, don't get me wrong, but to the average person they might see this and wonder why we can't go full futurism and solve every pathology with models like these.
We've come a long way, but there's still a very very long way to go.
maziyar
last Wednesday at 8:38 PM
full article: https://huggingface.co/blog/OpenMed/training-mrna-models-25-...
colingauvin
yesterday at 8:07 PM
HN's blindspots never cease to amaze me.
I am a structural biologist working in pharmaceutical design and this type of thing could be wildly useful (if it works).
rubicon33
yesterday at 4:11 PM
Can someone explain what one might use this model for? As a developer with a casual interest in biology it would be fun to play with but honestly not sure what I would do
khalic
yesterday at 3:28 PM
> In Progress: CodonJEPA
JEPA is going to break the whole industry :D
yesterday at 6:27 PM
simianwords
yesterday at 3:28 PM
What makes these Domain specific models work when we don’t have good domain models for health care, chemistry, economics and so on
yesterday at 3:16 PM
yieldcrv
yesterday at 4:12 PM
Distributing the load on this will probably be infinitely more useful than “folding at home”
HocusLocus
yesterday at 3:15 PM
gray goo of the future
skyskys
yesterday at 6:14 PM
hmmmm seems like some fake hype.

Training mRNA Language Models Across 25 Species for $165

seamossfet

stardust2

maziyar

pfisherman

xyz100

CyberDildonics

colingauvin

rubicon33

colechristensen

someuser54541

_zoltan_

colechristensen

nurettin

khalic

digdugdirk

khalic

lukeinator42

simianwords

colechristensen

simianwords

colechristensen

simianwords

colechristensen

simianwords

yieldcrv

HocusLocus

skyskys