\

Training mRNA Language Models Across 25 Species for $165

108 points - last Wednesday at 8:38 PM


We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

  • seamossfet

    yesterday at 6:49 PM

    The problem with models like this is they're built on very little actual training data we can trace back to verifiable protein data. The protein data back, and other sources of training data for stuff like this, has a lot of broken structures in them and "creative liberties" taken to infer a structure from instrument data. It's a very complex process that leaves a lot for interpretation.

    On top of that, we don't have a clear understanding on how certain positions (conformations) of a structure affect underlying biological mechanisms.

    Yes, these models can predict surprisingly accurate structures and sequences. Do we know if these outputs are biologically useful? Not quite.

    This technology is amazing, don't get me wrong, but to the average person they might see this and wonder why we can't go full futurism and solve every pathology with models like these.

    We've come a long way, but there's still a very very long way to go.

      • stardust2

        yesterday at 9:17 PM

        How do we get more verifiable protein data? So even if we had better data, we don't yet understand how the structure impacts the biology?

    • maziyar

      last Wednesday at 8:38 PM

      full article: https://huggingface.co/blog/OpenMed/training-mrna-models-25-...

        • pfisherman

          yesterday at 8:11 PM

          Nice work! Here is an article you may find helpful if you have not already come across it.[0]. You may also want to consider benchmarking against some non ML methods.[1]

          0. https://pubmed.ncbi.nlm.nih.gov/35318324/

          1. https://www.nature.com/articles/s41586-023-06127-z

          • xyz100

            yesterday at 3:18 PM

            What makes this dataset or problem worth solving compared to other health datasets? Would the results on this task be broadly useful to health?

              • CyberDildonics

                yesterday at 4:44 PM

                What other "datasets" are you talking about? How do you "solve a dataset" ?

        • colingauvin

          yesterday at 8:07 PM

          HN's blindspots never cease to amaze me.

          I am a structural biologist working in pharmaceutical design and this type of thing could be wildly useful (if it works).

          • rubicon33

            yesterday at 4:11 PM

            Can someone explain what one might use this model for? As a developer with a casual interest in biology it would be fun to play with but honestly not sure what I would do

              • colechristensen

                yesterday at 4:25 PM

                You can get your feet wet with genetic engineering for surprisingly little money.

                This guy shows a lot of how it's done: https://www.youtube.com/@thethoughtemporium

                Basically you can design/edit/inject custom genes into things and see real results spending on the scale of $100-$1000.

                  • someuser54541

                    yesterday at 4:40 PM

                    Is there something like this in text/readable format?

                    • _zoltan_

                      yesterday at 7:12 PM

                      My main concern is using fungi. If it ends up in my lungs I'm most likely screwed, right?

                        • colechristensen

                          yesterday at 8:20 PM

                          This is the classic meme https://www.reddit.com/r/labrats/comments/mmv2ig/lab_strains...

                          Lab strains of things tend to be extremely sensitive and not human adapted. You shouldn't study and modify human-infecting organisms in your basement anyway. While you shouldn't ignore protective equipment and proper procedure... paranoia about infecting yourself with a lab leak isn't warranted.

                          • nurettin

                            yesterday at 8:02 PM

                            Yes, but most students produce their best work while infected.

                • khalic

                  yesterday at 3:28 PM

                  > In Progress: CodonJEPA

                  JEPA is going to break the whole industry :D

                    • digdugdirk

                      yesterday at 3:50 PM

                      Can you explain this? I haven't heard of JEPA, and from a quick search it seems to be vision/robotics based?

                        • khalic

                          yesterday at 4:42 PM

                          It’s a self supervised learning architecture, and it’s pretty much universal. The loss function runs on embeddings, and some other smart architectural choices allover. Worth diving into for a few hours, Yann LeCun gives some interesting talks about it

                          • lukeinator42

                            yesterday at 3:51 PM

                            https://openreview.net/pdf?id=BZ5a1r-kVsf

                    • yesterday at 6:27 PM

                      • simianwords

                        yesterday at 3:28 PM

                        What makes these Domain specific models work when we don’t have good domain models for health care, chemistry, economics and so on

                          • colechristensen

                            yesterday at 4:26 PM

                            >we don’t have good domain models for health care, chemistry, economics and so on

                            Who says we don't?

                              • simianwords

                                yesterday at 4:38 PM

                                Examples please?

                                  • colechristensen

                                    yesterday at 5:13 PM

                                    No, it's really simple to search for domain specific models being used "in production" all over the place

                                      • simianwords

                                        yesterday at 5:16 PM

                                        I didn’t find a single one that outperforms a general model.

                                          • colechristensen

                                            yesterday at 5:56 PM

                                            Ok, alphafold.

                                              • simianwords

                                                yesterday at 6:01 PM

                                                It’s not a large language model

                        • yesterday at 3:16 PM

                          • yieldcrv

                            yesterday at 4:12 PM

                            Distributing the load on this will probably be infinitely more useful than “folding at home”

                            • HocusLocus

                              yesterday at 3:15 PM

                              gray goo of the future

                              • skyskys

                                yesterday at 6:14 PM

                                hmmmm seems like some fake hype.