Parsing Chemistry
Posted3 months agoActive2 months ago
re.factorcode.orgResearchstory
calmpositive
Debate
20/100
ChemistryParsingMolecular Representations
Key topics
Chemistry
Parsing
Molecular Representations
The post discusses parsing chemistry notations, specifically SMILES and other molecular representations, and the discussion revolves around the complexities and challenges of representing molecular structures.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
12d
Peak period
13
Days 11-12
Avg / period
9
Comment distribution27 data points
Loading chart...
Based on 27 loaded comments
Key moments
- 01Story posted
Oct 24, 2025 at 2:07 PM EDT
3 months ago
Step 01 - 02First comment
Nov 5, 2025 at 8:57 AM EST
12d after posting
Step 02 - 03Peak activity
13 comments in Days 11-12
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 12, 2025 at 10:45 AM EST
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45697414Type: storyLast synced: 11/20/2025, 8:42:02 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Yes. Here is the yacc grammar for the SMILES parser in the RDKit. https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Smi...
There's also one from OpenSMILES at http://opensmiles.org/opensmiles.html#_grammar . It has a shift/reduce error (as I recall) that I was not competent enough to fix.
I prefer to parser almost completely in the lexer, with a small amount of lexer state to handle balanced parens, bracket atoms, and matching ring closures. See https://hg.sr.ht/~dalke/opensmiles-ragel and more specifically https://hg.sr.ht/~dalke/opensmiles-ragel/browse/opensmiles.r... .
The lexer: https://hg.sr.ht/~dalke/smiview/browse/smiview.py?rev=tip#L3...
The lexer state transitions: https://hg.sr.ht/~dalke/smiview/browse/smiview.py?rev=tip#L3...
You also define Chain as:
I believe this means your grammar allows the invalid SMILES C=.NYou can do things like look up, using PubChem's API, similar molecules etc to a SMILES string.
I believe most molecule editors can load and save SMILES.
SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.
The VAE is trained with a very large number of valid SMILES strings, typically tokenized at the character level (so "C" is a token, and "Br" is "B" then "r"). I and others have observed that VAEs trained like this produce large number of embedding vectors that do not decode to valid SMILES strings- they have syntax errors, or perform chemical alchemy (personally, I saw the training set had Br (bromine) and Ca (Calcium), and the output molecules sometimes were Ba (barium) even though that's not in the original dataset at all.
There are other reasons why the tokenizer produces bad results- only about 1-10% of vectors decode to valid molecules. Invalid SMILES are mostly useless- they don't correspond to actual structures.
To respond to this, the SELFIES format makes a few changes so that it is effectively impossible to produce invalid SELFIES stringes when decoding a VAE. Among other things, tokenization matches the actual elements and so the model will only ever output valid elements.
I believe this is the SMILES paper that my own experiments were based on: https://arxiv.org/pdf/1610.02415 (see https://github.com/maxhodak/keras-molecules for an open source attempt at implementation)
And this is the paper introducing SELFIES: https://arxiv.org/abs/1905.13741 (open source packages for working with SELFIES, and some example training scripts https://github.com/aspuru-guzik-group/selfies see "Validity of Latent Space in VAE SMILES vs. SELFIES for more detail on the robustness).
BTW, as a side note: even though we put a bunch of effort into duplicating the original SMILES VAE, it was extremely slow to train and not very useful. Now you can just ask Gemini to write a full SELFIES VAE and train it in less than a day on a conventional GPU (thanks pytorch transformers!) to get a decent basic set of embeddings useful for exploring chemical space.
Was thinking of InChI[0] but on Googling SMILES and SELFIES I found this[1] talk, this[2] paper and my goodness I've been down a few rabbit holes since...
[0] https://en.wikipedia.org/wiki/International_Chemical_Identif... [1] https://www.inchi-trust.org/wp/wp-content/uploads/2019/12/18... [2] https://pubs.rsc.org/en/content/articlehtml/2022/dd/d1dd0001...