Spacy: Industrial-Strength Natural Language Processing (nlp) in Python
Posted5 months agoActive4 months ago
github.comTechstoryHigh profile
supportivepositive
Debate
20/100
NlpSpacyLlmsAI
Key topics
Nlp
Spacy
Llms
AI
The HN community is praising SpaCy, an industrial-strength NLP library in Python, for its well-designed API, active maintenance, and continued relevance in the age of LLMs.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4d
Peak period
30
Day 4
Avg / period
12.8
Comment distribution64 data points
Loading chart...
Based on 64 loaded comments
Key moments
- 01Story posted
Aug 23, 2025 at 5:07 AM EDT
5 months ago
Step 01 - 02First comment
Aug 26, 2025 at 7:14 PM EDT
4d after posting
Step 02 - 03Peak activity
30 comments in Day 4
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 1, 2025 at 10:25 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 44994535Type: storyLast synced: 11/20/2025, 2:49:46 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Also: https://explosion.ai/blog/back-to-our-roots-company-update
(Interesting tidbit: I got hired by Explosion after a HN comment on model distillation :))
The SpaCy API is just so nice. I love the ease of iterating over sentences, spans, and tokens and having the enrichment right there. Pipelines are super easy, and patterns are fantastic. It’s just a different use case than BERT.
The API is one of the best ever, and really set the bar high for language tooling.
I’m glad it’s still around and getting updates. I had a bit of trouble integrating it with uv, but nothing too bad.
Thanks to the explosion team for making such an amazing project and keeping it going all these years.
To the new “AI” people in the room: checkout SpaCy, and see how well it works and how fast it chews through text. You might find yourself in a situation where you don’t need to send your data to OpenAI for some small things.
Edit: I almost forgot to add this little nugget of history: one of Huggingfaces first projects was a SpaCy extension for conference resolution. Built before their breakthrough with transformers https://github.com/huggingface/neuralcoref
I’m writing a small library at work for some NLP tasks and I haven’t got a whole lot of experience in writing libraries for NLP, so I’m interested in what would make my library the best for the user.
These days NLP is quite different, because we look for outcomes rather than iterating over tokens.
What does your NLP library need to do? The way I design APIs is I write the calling code that I want to exist, and then I write the API to make it work. Here’s an example I’ve worked on for LLM integration. I just wanted to be able to get simple answers from an LLM and cast the answer to a type: https://www.npmjs.com/package/llm-primitives
API surface is designed well and it's still actively maintained almost 10 years after it initially went public.
I have a search background so learning to rank is always top of mind for me, but there other places like sentiment analysis, intent detection, and topic classification where it's great too.
This is one of the under-discussed areas of LLMs imho.
For anything that would have have required either word2vec embeddings of a tf-idf representation (classification tasks, sentiment analysis, etc) there are rare exceptions where it wouldn't just be better to start with a semantic embedding from an LLM.
For NER and similar data extraction tasks, the only advantage of traditional approaches is going to be speed, but my experience in practice is that accuracy is often much more important than speed. Again, I'm not sure why not start with an LLM in these cases.
There are still a few remaining use cases (PoS tagging comes to mind), but honestly, if I have a traditional NLP task today, I'm pretty sure I'm going to start with an LLM as my baseline.
I have stopped trying to use LLMs for this project and switched to discriminative models (Logistic Regression with TFIDF or Embeddings), which are both more computationally efficient and more debuggable. I'm not entirely sure why, but for anything with many possible answers, or to which there is some subjectivity, I have not had success with LLMs simply due to inconsistency of responses.
For VERY obvious tasks like: "is this store a restaurant or not?" I have definitely had success, so YMMV.
I've found encoder only models to be vastly better for anything that doesn't require natural language responses and the majority of them are small enough that _pretraining_ a model for each task costs a few hundred dollars.
In my original comment this is what I was referring to: using the embeddings produced by these models, not using something like GPT to classify text (that's wildly inefficient and in my experience gets subpar results).
To answer your question: you simply use the embedding vector as the features in whatever model you're trying to train. I've found this to get significantly superior results with significantly less examples than any traditional NLP approach to vector representation.
> What are you pre-training on, and for what task?
My experience has been that you don't need to pretrain at all. The embeddings are more information rich than anything you could attempt to achieve with other vector representations you might come up with using the set of data you have. This might not be true at extreme scales, but for nearly all traditional nlp classification tasks I've found this to be so much easier to implement and so much better performing there's really not a good reason to start with a "simpler" approach.
re: inconsistencies in output, OpenAI provide a seed and system_fingerprint options to (mostly) produce deterministic output.
For example:
If contains cafe and not internet/cyber/etc -> restaurant
No -> (tfidf) -> yes, no, unsure
unsure -> embeddings -> yes, no, unsure
unsure -> llm -> yes, no, unsure
unsure -> human queue ->...
One big benefit is that it uses the cheapest and most understandable approaches for the majority of cases, and scales up quite nicely. It has a neat place for very custom issues to be fixed too.
There will always be some things that simple approaches think are clear but aren’t, which is awkward but then all pipelines end up with that somewhere.
Edit - you can also deploy things earlier if you start from the beginning of the chain. Moving from big deploy to iteration on the remaining issues is often a win just in deployment issues.
Others have had success with SetFit as the training framework and Ettin as the base model.
I have also considered training a small language model for synthetic data generation.
Especially as LLMs continue to be better tuned to follow instructions that are intentionally colocated and intermingled with data in user messages, it becomes difficult to build systems that can provide real guarantees that "we'll follow your prompt, but not prompts that are in the data you provided."
But no amount of text appended to an input document, no matter how persuasive, can cause an NLP pipeline to change how it interprets the remainder of the document, or to leak its own system instructions, or anything of that nature. "Ignore the above prompt" is just a sentence that doesn't seem like positive or on-topic sentiment to an NLP classifier, and that's it.
There's an even broader discussion to be had about the relative reliability of NLP pipelines, outside of a security perspective. As always, it's important to pick the right tools for the job, and the SpaCy article linked in the parent puts this quite well.
Text added to a document can absolutely change how an NLP pipeline interprets the document.
> "Ignore the above prompt" is just a sentence that doesn't seem like positive or on-topic sentiment to an NLP classifier, and that's it.
And simple repeated words can absolutely make that kind of change for many NLP systems.
Have you actually worked with doing more traditional NLP systems? They're really not smart.
That's not what prompt injection is.
And NLP stands for natural language processing. If the result didn't change after you've made changes to the input... It'd be a bug?
> And NLP stands for natural language processing. If the result didn't change after you've made changes to the input... It'd be a bug?
No, I’d want my classifier to be unchanged by garbage words added. It likely will be, but that impact is a bug not a feature.
Adding words to the text to break the algorithm which does the NLP is more along the lines of providing 1 in a boolean field to break the system. And that's generally something you can mitigate to some degree via heuristics and sanity checking. Doing the same for LLMs is essentially impossible, because it's an effective black box, so you cannot determine the error scenarios and add some mitigations
No instruct tuning means prompt injection is curbed. Classification heads means you get results off a single forward pass.
If a traditional NLP solution can run under your control, and tackle the task at hand, it can be plainly much cheaper at scale.
* https://github.com/jftuga/deidentification
* https://pypi.org/project/text-deidentification/
Its annotation tooling was so far ahead. It is still crazy to me that so much of the value in the data annotation space went to Scale AI vs tools like SpaCy that enabled annotation at scale in the enterprise.
Used it again recently and the dev experience is 1000x that of wrangling LLMs.