Simplefold: Folding Proteins Is Simpler Than You Think
Posted3 months agoActive3 months ago
github.comResearchstoryHigh profile
calmmixed
Debate
60/100
Protein FoldingAI in BiologyMachine Learning
Key topics
Protein Folding
AI in Biology
Machine Learning
Apple researchers released a paper and code for SimpleFold, a simplified protein folding model, sparking discussion on its implications, comparison to AlphaFold, and the role of large language models in biology.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
19m
Peak period
96
0-6h
Avg / period
16.5
Comment distribution132 data points
Loading chart...
Based on 132 loaded comments
Key moments
- 01Story posted
Sep 26, 2025 at 2:01 PM EDT
3 months ago
Step 01 - 02First comment
Sep 26, 2025 at 2:20 PM EDT
19m after posting
Step 02 - 03Peak activity
96 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 28, 2025 at 8:17 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45389267Type: storyLast synced: 11/20/2025, 7:35:46 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Then why do we need customized LLM models, two of which seemed to require the resources of 2 of the wealthiest companies on earth (this and google's alphafold) to do it?
This doesn't seem like particularly wasteful overinvestment.
Granted, I'm more excited about the research coming out of arc
It's indeed a large model. But if you knew the history of the field, it's a massive improvement. It has progressed from a almost "NP" problem only barely approachable with distributed cluster compute, to something that can run on a single server with some pricey hardware. The smallest model is only here is only 100M parameters and the largest is 3B parameters, that's very approachable to run locally with the right hardware, and easily within the range for a small biotech lab (compared to the cost of other biotech equipment)
It's also (i'd argue) one of the only truly economically and sociably valuable AI technologies we've found over the past few years. Every simulated protein fold is saving a biotech company weeks of work for highly skilled biotech engineers and very expensive chemicals (In a way that that truly only supplement rather than replace the work). Any progress in the field is a huge win for society.
1: https://www.researchgate.net/publication/361238549_Consumer_...
Like 1980s SONY, they are the top of the line consumer electronics giant of the time. The iPhone is even more successful than the Walkman or Trinitron TVs.
They also sell the most popular laptops,to consumers as well as corporate. Like SONY’s VAIO but more popular again.
Can anyone recommend a good book or article about this?
Frankly, it's a great idea. If you are a small pharma company, being able to do quick local inference removes lots of barriers and gatekeeping. You can even afford to do some Bayesian optimization or RL with lab feedback on some generated sequences.
In comparison, running AlphaFold requires significant resources. And IMHO, their usage of multiple alignments is a bit hacky, makes performance worse on proteins without close homologs, and requires tons of preprocessing.
A few years back, ESM from Meta already demonstrated that alignment-free approaches are possible and perform well. AlphaFold has no secret sauce, it's just a seq2seq problem, and many different approaches work well, including attention-free SSMs.
Maybe these are just projects they use to test and polish their AI chips? Not sure.
https://machinelearning.apple.com/
https://arxiv.org/abs/2509.18480
https://github.com/apple/ml-simplefold
I am not trying to defend Apple or Siri by any means. I think the product absolutely should (and will) improve. I am just curious to explore why there is such negativity being directed specifically at Apple's AI assistant.
We, the consumer, have received inferior products because of the vague promise that the company might one day be able to make it cheaper if they invest now.
1. It seems to be actively getting worse. On a daily basis, I see it responding to queries nonsensically, like when i say “play (song) by (artist)” (I have Apple Music) by opening my Sirius app and putting on a random thing that isn’t even that artist. Other trivial commands are frequently just met with apologies or searching the web.
2. Over a year ago Apple conducted a flashy announcement full of promises about how Siri would not only do the things that it’s been marketed as being able to do for the last decade, but also things that no one has seen an assistant do. Many people believe that announcement was based on fantasy thinking and those people are looking more and more correct every day that Apple ships no actual improvements to Siri.
3. Apple also shipped a visual overhaul of how Siri looks, which gives the impression that work has been done, leading people to be even more disappointed when Siri continues to be a pile of trash.
4. The only competitor that makes sense to compare is Google, since no one else has access to do useful things on your device with your data. At least Google has a clear path to an LLM-based assistant, since they’ve built an LLM. It seems believable that android users will have access to a Gemini-based assistant, whereas it appears to most of us that Apple‘s internal dysfunction has rendered them unable to ship something of that caliber.
If I could buy a phone without an assistant I would see that as a desirable feature.
And now that we have ChatGPT with voice mode, Gemini Live, etc which have incredible speech recognition and reasoning comparatively, it's harder to argue that "every voice assistant is bad" still.
Meanwhile, people expect perfection from Siri. At this point a new version of Siri will never live up to people’s expectations. Had they released something on-par with ChatGPT, people would hate it and probably file a class action lawsuit against Apple over it.
The entire company isn’t going to work on Siri. In a large company there are a lot of priorities, and some things that happen on the side as well. For all we know this was one person’s weekend project to help learn something new that will later be applied to the priorities.
I’ve made plenty of hobby projects related to work that weren’t important or priorities, but what I learned along the want proved extremely valuable to key deliverables down the road.
It seems like the Folding @Home project is still around!
Compared to SETI or Folding @Home, this would work glacially slow for AI models.
https://www.distributed.net/RC5
https://en.wikipedia.org/wiki/RSA_Secret-Key_Challenge
I wonder what kind of performance would I get on a M1 computer today... haha
EDIT: people are still participating in rc5-72...?? https://stats.distributed.net/projects.php?project_id=8
In other words, it’s a different approach that trades off versatility for speed, but that trade off is significant enough to make it viable to generate protein folds for really any protein you’re interested in - it moves folding from something that’s almost computationally infeasible for most projects to something that you can just do for any protein as part of a normal workflow.
2. The biggest difference between folding@home and alphafold is that folding@home tries to generate the full folding trajectory while alphafold is just protein structure prediction; only looking to match the folded crystal structure. Folding@home can do things like look into how a mutation may make a protein take longer to fold or be more or less stable in its folded state. Alphafold doesn’t try to do that.
I actually really like Alphafold because of that - the core recognition that an amino acid string’s relationship to the structure and function of the protein was akin to the cross-interactions of words in a paragraph to the overall meaning of the excerpt is one of those beautiful revelations that come along only so often and are typically marked by leaps like what Alphafold was for the field. The technique has a lot of limitations, but it’s the kind of field cross-pollination that always generates the most interesting new developments.
Are there any benchmarks for say a $3,000 RTXetc. vs a nice cluster of M4 Mac Minis?
https://foldingathome.org/papers-results/?lng=en
[1] https://foldingathome.org/2024/05/02/alphafold-opens-new-opp...
and now I'm even more curious why they thought "light aqua" vs "deep teal" would be a good choice
The different colours are for the predicted and 'real' (ground truth) models. The fact that it is hard to distinguish is partly the - as you point out - weird colour choice, but also because they are so close together. An inaccurate prediction would have parts that stand out more as they would not align well in 3D space.
Doing too many things at once makes methods hard to adopt and makes conclusions harder to draw. So we try to find simple methods that show measurable gain, so we can adapt it to future approaches.
Its a cycle between complexity and simplicity. When a new simple and scalable approach beats the previous state of art, that just means we discovered a new local maxima hill to climp up.
People often like to say that we just need one more algorithmic breakthrough or two for AGI. But in reality it's the dataset and the environment based learning. Almost any model would do if you collected the data. It's not in the model, it's outside where we need to work on.
https://genomely.substack.com/p/simplefold-and-the-future-of...
But as with anything in research, it will take months and years to see what the actual implications are. Predictions of future directions can only go so far!
It’s not like we can throw away all the inductive biases and MSA machinery, someone upstream still had to build and run those models to create the training corpus.
On the other hand, validating a predicted protein structure to a good level of accuracy is much easier (solvent accessibility, mutagenesis, etc.). So having a complex model that can be trained on a small dataset drastically expands the set of accurate protein structure samples available to future models, both through direct predictions and validated protein structures.
So technically yes, this dataset could have been collected solely experimentally, but in practice, AlphaFold is now part of the experimental process. Without it, the world would have less protein structure data, in terms of both directly predicted and experimentally verified protein structures
It's a research paper. That's not how you communicate to a general audience. Just because the paper is accessible in terms of literal access doesn't mean you're the intended audience. Papers are how scientists communicate to other scientists. More specifically, it is how communication happens between peers. They shouldn't even be writing for just other scientists. They shouldn't be writing for even the full set of machine learning researchers nor the full set of biologists. Their intended audience is people researching computational systems that solve protein folding problems.
I'm sorry, but where do you want scientists to be able to talk directly to their peers? Behind closed doors? I just honestly don't understand these types of arguments.
Besides, anyone conflating "Simpler than You Think" as "Simple" is far from qualified from being able to read such a paper. They'll misread whatever the authors say. Conflating those two is something we'd expect from an Elementary School level reader who is unable to process comparative statements.
I don't think we should be making that the bar...
I’m not trying to knock the work, I think it’s genuinely cool and a great engineering result. I just wanted to flag that nuance for readers who might not have the time or background to spot it, and I get that part of the "simple/simpler" messaging is also about attracting attention which clearly worked!
Like you suggest, simple can mean many things. I think it's clear that in this context they mean "simple" (not from an absolute sense) in terms of the architectural design. I think the abstract is more than sufficient to convey this.
As a ML researcher who does a lot of work on architecture and efficiency, I think they are. Consider this from the end of the abstract To me they are clearly stating that their goal isn't to get the top score on a benchmark. Their appendix shows that the 100M param is apples to apples to alphafold2 by size but not by compute. Even their 3B model uses less compute then alphafold2.So being someone in a neighboring niche, I don't understand your claim. There's no easy way to make your comparisons "apples to apples" because we shouldn't be evaluating on a single metric. Sure, alphafold2 gives better results on the benchmarks but does that mean people wouldn't sacrifice performance for a 450x reduction in compute? (20x for their largest model. But note that compute, not memory).
Yeah this is an unfortunate thing and I'm incredibly frustrated with this in academia and especially in ML. But it's also why I'm pushing against you. The problem stems from needing to get people to read your paper. There's a perverse incentive because you could have a paper that is groundbreaking but ends up having little to no impact because it didn't get read. A common occurrence is that less innovative papers will get magnitudes more citations by using similar methods but scale and beat benchmarks. So unfortunately as long as we use citation metrics as a significant measure of our research impact then marketing will be necessary. A catchy title is a good way to get more eyeballs. But I think you're being too nitpicky here and there's far more egregious/problematic examples. I'm not going to pick my fight with a title when the abstract is sufficiently clear. Could it be more clear? Certainly. But if the title is all that's wrong then it's a pretty petty problem. Especially if it's only confusing people who are significantly outside the target audience.Seriously, what's the alternative? That researchers write to the general public? To the general technical public? I'm sorry, I don't think that's a good solution. It's already difficult to communicate to people in the same domain (but not niche) in the page limit. It's hard to be them to read everything as it is. I'd rather papers be written strongly for the niche peers and enough generalization that domain experts can get through it with effort. For the general public, that's what science communicators are for
This model could also have existed from natural data if we had access to enough of it.
Only if you are willing to call a billion years of evolutionary selection a "simple ruleset"
That means whatever evolution created, whether it's wings or brains, however complex it looks now, must be fundamentally simple enough it could be reached by iterating in tiny steps that were useful in isolation. It constrains the space of designs reachable by evolution considerably.
Not true. Learn some genomics before trying to explain evolution.
Does the time matter? A ruleset doesn't change with time.
If you're still unconvinced, get a degree in physics. I'm not sure how you could get through that and still not believe that complexity rises from simplicity and how you end up getting drops in that complexity, which we call emergence, before becoming more complex than before.
But you really do seem to be trying hard to miss the point entirely. Life has actually nothing to do with what I said did it. And I can assure you, by nature of being one, that physicists are certain that nature follows simple rules, even if we don't know them.
We are also absolutely confident in that complexity rises out of simplicity. Go look at anything like fractals, chaos theory, perturbation theory, or you should have run into at least bifurcation diagrams in your differential equations course. If you haven't taken diff eq, then well.... perhaps the problem is that your confidence in your result is stronger than your expertise. If not, well... make a real argument because I'm not going to hold your hand through this any longer.
The thing is, Biology is anything besides simple.
I just don't understand how someone could even get through an undergraduate degree in physics without seeing this complexity in E&M. You got 4 rules that describe everything. Each rule can be written on a short line and only contain a handful of symbols. In other words: those rules are simple. Yet that doesn't mean they're very useful in that form, but you can derive the rest from them. That is exactly what I'm talking about with the game of life.
How the fuck did you get through differential equations without seeing how complexity arises from simplicity, let alone Jackson or Goldstein!?
Idk man, either you're lying or being disingenuous. You're the only one who said biology is simple. No one even implied that! If you're not lying about your degree you're willfully misinterpreting the comments. Why? For what purpose?
You can use molecular dynamics. Maybe, if you are lucky and have the computational resources to do so.
You might want to relate molecular dynamics to "simple rules", but you would be delusional. Molecular dynamics typically use classical forcefields parameterized on data and some quantum simulations. It is not based on first principles.
Proteins fold in patterns generated over millennium of natural selection. It is not simple.
My rough understanding of field is that a "rough" generative model makes a bunch of decent guesses, and more formal "verifiers" ensure they abide by the laws of physics and geometry. The AI reduce the unfathomably large search-space so the expensive simulation doesn't need to do so much wasted work on dead-ends. If the guessing network improves, then the whole process speeds up.
- I'm recalling the increasingly complex transfer functions in redcurrant networks,
- The deep pre-processing chains before skip forward layers.
- The complex normalization objectives before Relu.
- The convoluted multi-objective GAN networks before diffusion.
- The complex multi-pass models before full-convolution networks.
So basically, i'm very excited by this. Not because this itself is an optimal architecture, but precisely because it isn't!
Using MSAs might be a local optimum. ESM showed good performance on some protein problems without MSAs. MSAs offer a nice inductive bias and better average performance. However, the cost is doing poorly on proteins where MSAs are not accurate. These include B and T cell receptors, which are clinically very relevant.
Isomorphic Labs, Oxford, MRC, and others have started the OpenBind Consortium (https://openbind.uk) to generate large-scale structure and affinity data. I believe that once more data is available, MSAs will be less relevant as model inputs. They are "too linear".
> We largely adopt the data pipeline implemented in Boltz-11 1https://github.com/jwohlwend/boltz (Wohlwend et al., 2024), which is an open-source replication of AlphaFold3
I believe the story here is largely that they simplified the architecture and scaled it to 3B parameters while maintaining leading results.
However, it seems like anyone can download the parameters for AlphaFold V2: https://github.com/google-deepmind/alphafold?tab=readme-ov-f...
Predicting the end-result from the sequence of protein directly is prone to miss any new phenomenon and would just regurgitate/interpolate the training datasets.
I would much prefer an approach based on first principles.
In theory folding is easy, it's just running a simulation of your protein surrounded by some water molecules for the same number of nano-seconds nature do.
The problem is that usually this take a long time because evolving a system needs to compute the energy of the system as a position of the atoms which is a complex problem involving Quantum Mechanics. It's mostly due to the behavior of the electrons, but because they are much lighter they operate on a faster timescale. You typically don't care about them, only the effect they have on your atoms.
In the past, you would use various Lennard-Jones potentials for pairs of atoms when the pair of atoms are unbounded, and other potentials when they are bonded and it would get very complex very quickly. But now there are deep-learning based approach to compute the energy of the system by using a neural network. (See (Gromacs) Neural Network Potentials https://rowansci.com/publications/introduction-to-nnps ). So you train these networks so that they learn the local interactions between atoms based on trajectories generated from ab-initio theories. This allows you to have a faster simulator which approximate the more complex physics. It's in a sort just tabulating using a neural network the effect of the electrons would have in a specific atom arrangements according to the theory you have chosen.
At any time if you have some doubt, you can always run the slower simulator in the small local neighborhood to check that the effective field neural network approximation holds.
Only then once you have your simulator which is able to fold, you can generate some dataset of pairs "sequence of protein" to "end of trajectory", to learn the shortcut like Alpha/Simple/Fold do. And when in doubt you can go back to the slower more precise method.
If you had enough data and can train perfectly a model with sufficient representation power, you could theoretically infer the correct physics just from the correspondence initial to final arrangements. But if you don't have enough data it will just learn some shortcut and accept that it will be wrong some times.
No, the environment is important. Also, some proteins fold while being sequenced.
Folding can also take minutes in some cases, which is the real problem.
> which is a complex problem involving Quantum Mechanics
Most MD simulations use classical approximations, and I don't see why folding is any different.
Speeding-up the folding is not the real problem, knowing what happen is. One way to speed-up the process is just to minimize the free-energy of the configuration (or some other quantity you derive from the neural network vector potential). (That's what the game fold-it was about : minimizing the Rosetta energy function). An other way would be to just use generative method like diffusion model to generate a plausible full trajectory (but you need some training dataset to bootstrap the process). Or work with key-configuration frames. The simulation can take a long time but it goes through specific arrangements (the transitions between energy plateau), and you learn these key points.
The simulator can also be much faster because it doesn't have to consider all the pair of atom arrangements (n^2 behavior if you are naive) into O(n) with n the number of atoms (with the bigger constant which is running the neural network hidden inside the O notation).
The simulations are classical but fundamentally they rely on the shape of the electron clouds. The electron density can deform (that's what bonding is), providing additional degrees of liberty, allowing the atom configuration to slide more easily against itself and avoid getting stuck in local optimum. Fortunately all this mess is nicely encapsulated inside the neural network potential and we can work without worrying about the electrons, their shape being implicitly defined by the current position of the atoms (using the implicit function theorem make abstracting their behaviour sound because of the faster timescales).
Potential != free energy. Entropy is a driving force behind folding.
> The simulations are classical but fundamentally they rely on the shape of the electron clouds.
This is not what is meant by classical
https://www.youtube.com/watch?v=P_fHJIYENdI