Potential Issues in Curl Found Using AI Assisted Tools
Posted3 months agoActive3 months ago
mastodon.socialTechstoryHigh profile
calmpositive
Debate
60/100
AI-Assisted Security ToolsCode AnalysisSoftware Development
Key topics
AI-Assisted Security Tools
Code Analysis
Software Development
The story discusses how AI-assisted tools helped identify potential issues in the curl library, and the discussion revolves around the effectiveness and limitations of such tools in software development.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
91
0-12h
Avg / period
16
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 2, 2025 at 9:29 AM EDT
3 months ago
Step 01 - 02First comment
Oct 2, 2025 at 10:45 AM EDT
1h after posting
Step 02 - 03Peak activity
91 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 10, 2025 at 6:47 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45449348Type: storyLast synced: 11/22/2025, 11:00:32 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
This is notable given Daniel Stenberg's reports of being bombarded by total slop AI-generated false security issues in the past: https://www.linkedin.com/posts/danielstenberg_hackerone-curl...
Concerning HackerOne: "We now ban every reporter INSTANTLY who submits reports we deem AI slop. A threshold has been reached. We are effectively being DDoSed. If we could, we would charge them for this waste of our time"
Also this from January 2024: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
It's primarily from people just throwing source code at an LLM, asking it to find a vulnerability, and reporting it as-read, without having any actual understanding of if it is or isn't a vulnerability.
The difference in this particular case is it's someone who is: 1) Using tools specifically designed for security audits and investigations. 2) Takes the time to read and understand the vulnerability reported, and verifies that it is actually a vulnerability before reporting.
Point 2 is the most significant bar that people are woefully failing to meet and wasting a terrific amount of his time. The one that got shared from a couple of weeks ago https://hackerone.com/reports/3340109 didn't even call curl. It was straight up hallucination.
A few of these PRs are dependabot PRs which match on "sarif", I am guessing because the string shows up somewhere in the project's dependency list. "Joshua sarif data" returns a more specific set of closed PRs. https://github.com/curl/curl/pulls?q=is%3Apr+Joshua+sarif+da...
Don't write or fix the code for me (thanks but I can manage that on my own with much less hassle), but instead tell me which places in the code look suspicious and where I need to have a closer look.
When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
ChatGPT is even less useful since it basically just spend a lot of time to tell me 'everything looking great yay good job high-five!'.
So far, traditional static code analysis has been much more helpful in finding actual bugs, but static analysis being clean doesn't mean there are no logic bugs, and this is exactly where LLMs should be able to shine.
If getting more useful potential-bugs-information from LLMs requires an extensively customized setup then the whole idea is getting much less useful - it's a similar situation to how static code analysis isn't used if it requires extensive setup or manual build-system integration instead of just being a button or menu item in the IDE or enabled by default for each build.
I've been developing with LLMs on my side for months/about a year now, and feels like it's allowing me to be more creative, not less. But I'm not doing any "vibe-coding", maybe that's why?
The creative parts (for me) is coming up with the actual design of the software, and how it all fits together, what it should do and how, and I get to do that more than ever now.
The creative part for me includes both the implementation and the design, because the implementation also matters. The bots get in the way.
Maybe I would be faster if I paid for Claude Code. It's too expensive to evaluate.
If you like your expensive AI autocomplete, fine. But I have not seen any demonstrable and maintainable productivity gains from it, and I find understanding my whole implementation faster, more fun, and that it produces better software.
Maybe that will change, but people told me three years ago that we would be at the point today where I could not outdo the bot;
with all due respect, I am John Henry and I am still swinging my hammer. The steam pile driving machine is still too unpredictable!
The implementations LLMs end up writing are predicable, because my design locks down what it needs to do. I basically know exactly what they'll end up doing, and how, but it types faster than I do, that's why I hand it off while I go on to think about the next design iteration.
I currently send every single prompt to Claude, Codex, Qwen and Gemini (looks something like this: https://i.imgur.com/YewIjGu.png), and while the all most of the time succeed, doing it like this makes it clear that they're following what I imagined they'd do during the design phase, as they all end up with more or less the same solutions.
> If you like your expensive AI autocomplete
I don't know if you mean that in jest, but what I'm doing isn't "expensive AI autocomplete". I come up with what has to be done, the design for achieving so, then hand off the work. I don't actually write much code at all, just small adjustments when needed.
> and I find understanding my whole implementation faster
Yeah, I guess that's the difference between "vibe-coding" and what I (and others) are doing, as we're not giving up any understanding or control of the architecture and design, but instead focus mostly on those two things while handing off other work.
I've made great use of AI by keeping my boundaries clear and my requirements tight, and by rigorously ensuring I understand _every_ line of code I commit
I believe software development will transition to a role closer to director/reviewer/editor, where knowledge of programming paradigms are just as important as now, but also where _communication_ skills separate the good devs from the _great_ devs
The difference between a 1x dev and a 10x dev in future will be the latter knows how to clearly and concisely describe a problem or a requirement to laymen, peers, and LLMs alike. Something I've seen many devs struggle with today (myself included)
I think it has been that way since forever. If you look at all the great projects, it’s rare for the guy at the helm to not be a good communicator. And at corporate job, you soend a good chunk of the year writing stuff to people. Even the code you’re writing, you think abou the next person who’s going to read it.
It's 20 bucks a month
Getting wild ideas badly implemented on a silver plate is a slot machine, it leads nowhere but in circles.
And yes, if you're just using it as a slot machine, I understand it doesn't feel useful. But I don't think that's how most people use it, at least that's not how I use it.
At this point AI is best at the first thing and less good at the second. I like stacking blocks together. If I build a beautiful UI I don't enjoy writing the individual css code for every button but rather composing the big picture.
Not saying either is better or worse. But I can imagine that the people that loves to build the individual blocks like AI less because it takes away something they enjoy. For me it just takes away a step I had to do to get to the composing of the big picture.
After that, it’s all became routine work as easy as drinking water. You explain the problem and I can quicly find the solution. Using AI at this point would be like herding cats. I already know what code to write, having a handful being suggested is distracting. Like feeling a a tune, and someone playing another melody other than the one you know.
You can't successfully build the big picture on the sort of rotten foundation that AI produces though
I don't care how much you enjoy assembling building blocks over building the low level stuff, if you offload part of the building onto AI you're building garbage
It's fundamentally different from how a machine or some code makes a task actually go away or at least become smaller.
Also people claiming cleaning isn't "creative" or "fun". Steam has a whole genre of games simulating cleaning stuff because the act of cleaning is extremely fun and creative to a lot of people: https://store.steampowered.com/app/246900/Viscera_Cleanup_De... being a great example
Actually I do NOT want my robot to do my laundry for me! And because I'm garbage at painting and comparatively better at laundry, I DO want it to paint for me.
Someone making a game about an activity doesn't mean that the activity is fun or desirable in real life at all.
I mean yes there are people that find comfort in cleaning but they are not the target audience of cleaning simulators at all.
Washer dryer combos are good too. Folding laundry is biggest pain of my life so far. Also unloading dishwasher, but fix is easy here - get double one.
Lol, nope.
Dishwashers solve at best some 50% of the hassle that are the easy to wash table dishes, while being completely unable to clean oven ones. Floor cleaners solve a 5 minutes task in a couple-of-days-long house upkeep. Coffee makers... don't really automate anything, why did you list them here? And there's no automation available for heating and cooling food. And the part about drilling and turning screws also isn't automation at all.
The only thing on your list that is close to solved is clothes cleaning. And there's the entire ironing thing that is incredibly resistant to solving. But yeah, that puts it way beyond 90% solved.
I think my answer would be "Does it matter?"
If it brings joy to you or others, who cares about the semantics of creation
Some people like creative coding, others like being creative with apps and features without much care to how it's implemented under the hood.
I like both, but IMO there is a much larger crowd for higher level creativity, and in those cases AIs don't automate the creativity away, they enable it!
This is the complete opposite of my experiences with using AI Coding tools heavily
Such state management messes use up a lot of resources to copy around.
As an EE working in QA future chips with a goal of compressing away developer syntax art to preserve the least amount of state management possible to achieve maximum utility; sorry self selecting biology of SWEs, but also not sorry.
Above all this is capitalism not honorific obligationism. If hardware engineers can claim more of the tech economy for our shareholders, we must.
There are plenty of other creative outlets that are much less resource intensive. Rich first world programmers are a small subset of the population and can branch out then and explore life rather than believe everyone else has an obligation to conserve the personal story of a generation of future dead.
The writing has been on the wall with so called hallucinations where LLMs just make stuff up that the hype was way out over its skiis. The examples of lawyers being fined for unchecked LLM outputs being presented as fact type of stories will continue to take the shine off and hopefully some of the raw gungho nature will slow down a bit.
https://www.bbc.com/travel/article/20250926-the-perils-of-le...
I'm mildly bearish on the human capacity to learn from its mistakes and have a feeling in my gut that we've taken a massive step backwards as civilization.
From the layman's perspective, they did. That's the whole problem.
no-one wants stochastic computers
With 8 billion people on the planet, you could write a "man bites dog" story about any invention popular enough
"You never read about a plane that did not crash"
Where I still need to extend this, is to introduce function calling in the flow, when "it has doubts" during reasoning, would be the right time to call out a tool that would expand the context its working with (pull in other files, etc).
Yeah, don't listen to "wisdom of the crowd" when it comes to LLM models, there seems to be a ton of fud going on, especially on subreddits.
GPT-OSS was piled on for being dumb in the first week of release, yet none of the software properly supported it at launch. As soon as it was working properly in llama.cpp, it was clear how strong the model was, but at that point the popular sentiments seems to have spread and solidified.
Here's a technique that often works well for me: When you get unexpectedly poor results, ask the LLM what it thinks an effective prompt would look like, e.g. "How would you prompt Claude Code to create a plan to effectively review code for logic bugs, ignoring things like FIXME and TODO comments?"
The resulting prompt is too long to quote, but you can see the raw result here: https://gist.github.com/CharlesWiltgen/ef21b97fd4ffc2f08560f...
From there, you can make any needed improvements, turn it into an agent, etc.
I asked ChatGPT to analyze its weaknesses and give me a pre-prompt to best help mitigate them and it gave me this: https://pastebin.com/raw/yU87FCKp
I've found it very useful to avoid sycophancy and increase skepticism / precision in the replies it gives me
I explicitly asked it to read all the code (within Cline) and it did so, gave me a dozen action items by the end of it, on a Django project. Most were a bit nitpicky, but two or three issues were more serious. I found it pretty useful!
Even very simple prompts can yield very useful outputs.
“Report each bug you spot in this code with a markdown formatted report.” worked better than I expected.
It costs just a couple of dollars to scan through an entire codebase with something like Gemini Flash.
but i've been limiting it to a lot less than 20k LoC, i'm sticking with stuff i can just paste into the chat window.
I get it though, non programmers or weak programmers don't scrutinise the results and are more likely to be happy to pay. Still, bit of a shame.
Maybe these tools exist, but at least to me, they don't surface among all the noise.
I often use Claude/GPT-5/etc to analyze existing repositories while deliberately omitting the tests and documentation folders because I don't want them to influence the answers I'm getting about the code - because if I'm asking a question it's likely the documentation has failed to answer it already!
It still needs guidance, but it quashed bugs yesterday that I've previously spent many days on without finding a solution for.
It can be tricky, but they definitely can be significant aid for even very complex bugs.
Definitely optimistic for this way to use AI
GPT 5 has been disappointing with thinking and without.
Some history: https://hn.algolia.com/?q=curl+AI
Hard to compute, easy to verify things should be the case where AI excel at. So why do so many AI users insist on skipping the verify step?
The issue I keep seeing with curl and other projects is that people are using AI tools to generate bug reports and submitting them without understanding (that's the vetting) the report. Because it's so easy to do this and it takes time to filter out bug report slop from analyzed and verified reports, it's pissing people off. There's a significant asymmetry involved.
Until all AI used to generate security reports on other peoples' projects is able to do it with vanishingly small wasted time, it's pretty assholeish to do it without vetting.
They're using it correctly. It's a system of tools, not an autopilot.
The set seems to be:
https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
So he likes ZeroPath. Does that get us any further? No, the regular subscription costs $200 and the free one-time version looks extremely limited and requires yet another login.
Also of course, all low hanging fruit that these tools detect will be found quickly in open source (provided that someone can afford a subscription), similar to the fact that oss-fuzz has diminishing returns.
You can see the fixes that resulted from this in the PRs that mention "sarif" in the curl repository: https://github.com/curl/curl/pulls?q=is%3Apr+sarif+is%3Aclos...
https://joshua.hu/llm-engineer-review-sast-security-ai-tools... ("Hacking with AI SASTs: An overview of 'AI Security Engineers' / 'LLM Security Scanners' for Penetration Testers and Security Teams")
Tools included ZeroPath, Corgea and Almanax.
If you read Corgea's (one of the products used) "whitepaper", it seems that AI is not the main show:
> BLAST addresses this problem by using its AI engine to filter out irrelevant findings based on the context of the application.
It seems that AI is being used to post-process the findings of traditional analyzers. It reduces the amount of false positives, increasing the yield quality of the more traditional analyzers that were actually used in the scan.
Zeropath seems to use similar wording like "AI-Enabled Triage" and expressions like "combining Large Language Models with AST analysis". It also highlights that it achieves less false positives.
I would expect someone who developed this kind of thing to setup a feedback loop in which the AI output is somehow used to improve the static analysis tool (writing new rules, tweaking existing ones, ...). It seems like the logical next step. This might be going on on these products as well (lots of in-house rule extensions for more traditional static analysis tools, written or discovered with help of AI, hence the "build with AI" headline in some of them).
Don't get me wrong, this is cool. Getting an AI to triage a verbose static analysis report makes sense. However, it does not mean that AI found the bugs. In this model, the capabilities of finding relevant stuff are still capped at the static analyzer tools.
I wonder if we need to pay for it. I mean, now that I know it is possible (at least in my head), it seems tempting to get open source tools, set them to max verbosity, and find which prompts they are using on (likely vanilla) coding models to get them to triage the stuff.
That's an editorialized headline (so it may get fixed by dang and co) - if you click through to what Daniel Stenberg said he was more clear:
> Joshua Rogers sent us a massive list of potential issues in #curl that he found using his set of AI assisted tools.
AI-assisted tools seems right to me here.
Also, think about it: of course I read Joshua's report. Otherwise, how could I have known the names of the products he used?
All comments that want to know more are at the bottom.
How would you have worded it?
Daniel Stenberg on 22 curl bugs reported using AI-assisted security scanners
I will spend longer considering my title next time.
Cheers!
We do not use traditional static analyzers; our engine was built from the ground up to use LLMs as a primitive. The issues ZeroPath identified in Joshua's post were indeed surfaced and triaged by AI.
If you're interested in how it works under the hood, some of the techniques are outlined here: https://zeropath.com/blog/how-zeropath-works
Joshua describes it as follows: "ZeroPath takes these rules, and applies (or at least the debug output indicates as such) the rules to every .. function in the codebase. It then uses LLM’s ability to reason about whether the issue is real or not."
Would you say that is a fair assessment of the LLM role in the solution?
Even Joshua's blog post does not clearly state which parts and how much is "AI". Neither does the pdf.
Somethings we learnt alone the way, is that when it comes to specifically this field of security what we called low-level security (memory security etc.), validation and debugging had became more important than vulnerability discovery itself because of hallucinations.
From our trial-and-errors (trying validator architecture, security research methodology e.g., reverse taint propagation), it seems like the only way out of this problem is through designing a LLM-native interactive environment for LLMs, validate their findings of themselves through interactions of the environment or the component. The reason why web security oriented companies like XBOW are doing very well, is because how easy it is to validate. I seen XBOW's LLM trace at Black Hat this year, all the tools they used and pretty much need is curl. For web security, abstraction of backend is limited to a certain level that you send a request, it whether works or you easily know why it didn't (XSS, SQLi, IDOR). But for low-level security (memory security), the entropy of dealing with UAF, OOBs is at another level. There are certain things that you just can't tell by looking at the source but need you to look at a particular program state (heap allocation (which depends on glibc version), stack structure, register states...), and this ReACT'ing process with debuggers to construct a PoC/Exploit is what been a pain-in-the-ass. (LLMs and tool callings are specifically bad at these strategic stateful task, see Deepmind's Tree-of thoughts paper discussing this issue) The way I've seen Google Project Zero & Deepmind's Big Sleep mitigating this is through GDB scripts, but that's limited to a certain complexity of program state.
When I was working on our integration with GGML, spending around two weeks on context, tool engineering can already lead us to very impressive findings (OOBs); but that problem of hallucination scales more and more with how many "runs" of our agentic framework; because we're monitoring on llama.cpp's main branch commits, every commits will trigger a internal multi-agent run on our end and each usually takes around 1 hours and hundreds of agent recursions. Sometime at the end of the day we would have 30 really really convincing and in-depth reports on OOBs, UAFs. But because how costly to just validate one (from understanding to debugging, PoC writing...) and hallucinations, (and it is really expensive for each run) we had to stop the project for a bit and focus solving the agentic validation problem first.
I think when the environment gets more and more complex, interactions with the environment, and learning from these interactions will matters more and more.
Thanks for sharing your experience ! It correlates with this recent interview with Sutton [1]. That real intelligence is learning from feedback with a complex and ever changing environment. What an LLM does is to train on a snapshot of what has been said about that environment and operate on only on that snapshot.
[1] https://www.dwarkesh.com/p/richard-sutton
The key word is "potential", though. They're still wildly unpredictable and unreliable, which is why an expert human is required to validate their output.
The big problem is the people overhyping the technology, selling it as "AI", and the millions deluded by the marketing. Amidst the false advertising, uncertainty, and confusion, people are forced to speculate about the positive and negative impacts, with wild claims at both extremes. As usual, the reality is somewhere in the middle.
I guess mastodon link is simply a confirmation that bugs were indeed bugs, even with wrong code snippets?
https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
You did this with an AI and you do not understand what you're doing here: https://news.ycombinator.com/item?id=45330378
I'll be doing a retrospective in a few weeks when the dust has settled, as well as new tools I've been made aware of.
Seems like ZeroPath might be worth looking into if the price is reasonable
* Clearly useful to people who are already competent developers and security researchers
* Utterly useless to people who have no clue what they're doing
But the latter group's incompetency does not make AI useless in the same way that a fighter jet is not useless because a toddler cannot pilot it.
https://news.ycombinator.com/item?id=38845878
https://news.ycombinator.com/item?id=43907376
https://media.ccc.de/v/froscon2025-3407-ai_slop_attacks_on_t...
I disagree.
I'm making a board game of 6 colors of hexes, and I wanted to be able to easily edit the board. The first time around, I used a screenshot of a bunch of hexagons and used paint to color them (tedious, ugly, not transparent, poor quality). This time, I asked ChatGPT to make an SVG of the board and then make a JS script so that clicking on a hex could cycle through the colors. Easier, way higher quality, adjustable size, clean, transparent.
It would've taken me hours to learn and set that up for myself, but ChatGPT did it in 10min with some back and forth. I've made one SVG in my life before this, and never written any DOM-based JS scripts.
Yes, it's a toy example, but you don't have to knwo what you're doing to get useful things from AI.
You might be underestimating the expertise you applied in these 10 minutes. I know I often do.
> it's a toy example
This technology does exceptionally well on toy examples, I think because there are much fewer constraints on acceptable output than ‘real’ examples.
> you don't have to knwo what you're doing to get useful things from AI
You do need to know what is useful though, which can be a surprisingly high bar.
You're someone who knows the difference between a PNG and an SVG, knows enough Javascript to know that "DOM-based" JS is a thing, and has presumably previously worked in software/IT.
You're smart enough to know things, and you're also smart enough to know there's a lot that you don't know.
That's a far cry from the way a lot of laypeople, college kids, and fully nontechnical people try to use LLMs.
> Utterly useless to people who have no clue what they're doing
> the same way that a fighter jet is not useless
AI is currently like a bicycle, while we were all running hills before.
There's a skill barrier and getting less complicated each week.
The marketing goal is to say "Push the pedal and it goes!" like it was a car on a highway, but it is a bicycle, you have to keep pedaling.
The effect on the skilled-in-something-else folks is where this is making a difference.
If you were thinking of running, the goal was to strengthen your tendons to handle the pavement. And a 2hr marathon pace is almost impossible to do.
Like a bicycle makes a <2hr marathon distance "easy" for someone who does competitive rowing, while remaining impossible for those who have been training to do foot races forever.
Because the bicycle moves the problem from unsprung weights and energy recovery into a VO2 max problem, also into a novel aerodynamics problem.
And if you need to walk a rock garden, now you need to lug the bike too with you. It is not without its costs.
This AI thing is a bicycle for the mind, but a lot of people go only downhill and with no brakes.
I’m a reasonable developer with 30+ years of experience. Recently I worked on an API design project and had to generate a mock implementation based on a full openapi spec. Exactly what Copilot would be good at. No amount of prompting could make it generate a fully functional spring-boot project doing both the mock api and present the spec at a url at the same time. Yet it did a very neat job at just the mock for a simpler version of the same api a few weeks prior. Go figure.
Imagine what your doctors will be like two generations down the road.
You can read about my experience here: https://codepathfinder.dev/blog/introducing-secureflow-cli-t...
Old post: https://shivasurya.me/security-reviews/sast/2024/06/27/autom...
https://mastodon.social/@icing@chaos.social/1152440641434357...
>tldr
>The code was correct, the naming was wrong.
Red borders around every slide and very flashy images
Sounds like it was a lot more than 22, assuming most are valid.
Many people advocate for the use of AI technology for SAST testing. There are even people and companies that deliver SAST scanners based on AI technology. However: Most are just far from good enough.
In the best case scenario, you’ll only be disappointed. But the risk of a false sense of security is enormous.
Some strong arguments against AI scanners can be found on https://nocomplexity.com/ai-sast-scanners/
It’s like a police facial recognition, they can help police but there is no way they are “replacing police”
That makes its results unpredictable.
So don’t have AI create your bugs.
Instead have your AI look for problems - then have it create deterministic tools and let tools catch the issues in a repeatable, understandable, auditable way. Have it build short, easy to understand scripts you can commit to your repo, with files and line numbers and zero/nonzero exit codes.
It’s that key step of transforming AI insights into detection tools that transforms your outcomes from probabilistic to deterministic. Ask it to optimize the tools so they run in seconds. You can leave them in the codebase forever as linters, integrate them in your CI, and never have that same bug again.
29 more comments available on Hacker News