Launch HN: Webhound (YC S23) – Research agent that builds datasets from the web
Great decision to make it without a login so people can test.
Here is what I liked:
- The agent told me exactly what's happening, which sources it is checking, and the schema.
- The agent correctly identified where to look at, and how to obtain the data.
- Managing expectations: Webhound is extracting data Extraction can take multiple hours. We'll send you an email when it's complete.
Minor point:
- There is no pricing on the main domain, just the HN one https://hn.webhound.ai/pricing
Good luck!
We were heavily inspired by tools like Cursor - basically tried to prioritize user control and visibility above everything else.
What we discovered during iteration was that our users are usually domain experts who know exactly what they want. The more we showed them what was happening under the hood and gave them control over the process, the better their results got.
Instead of just search query → final result (though you can do that too), you can step in and guide it. Tell it exactly where to look, what sources to check, how to dig deeper, how to use its notepad.
We've found this gets you way better results that actually match what you're looking for, as well as being a more satisfying user experience for people who already know how they would do the job themselves. Plus it lets you tap into niche datasets that wouldn't show up with just generic search queries.
This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported here on HN (and experienced myself).
Does Webhound respect robots.txt directives and do you disclose the identity of your crawlers via user-agent header?
This is definitely something we need to address on our end. Site owners should have clear ways to opt out, and crawlers should be identifiable. We're looking into either working with Firecrawl to improve this or potentially switching to a solution that gives us more control over respecting these standards.
Appreciate you bringing this up.
If it isn't doing that in your session, you can usually just step in and tell it to and it will follow your instructions.
> It uses a text-based browser we built
Can you tell us more about this. How does it work?
A few design decisions we made that turned out pretty interesting:
1. We gave it an analyze results function. When the agent is on a search results page, instead of visiting each page one by one, it can just ask "What are the pricing models?" and get answers from all search results in parallel.
2. Long web pages get broken into chunks with navigation hints so the agent always knows where it is and can jump around without overloading its context ("continue reading", "jump to middle", etc.).
3. For sites that are commonly visited but have messy layouts or spread out information, we built custom tool calls that let the agent request specific info that might be scattered on different pages and consolidates it all into one clean text response.
4. We're adding DOM interaction via text in the next couple of days, so the agent can click buttons, fill forms, enter keys, but everything still comes back as structured text instead of screenshots.
My original interpretation was that you had built a full blown browser, something akin to a Chromium/Firefox fork
I am concerned about your pricing, as "unlimited" anything seems to be fading away from most LLM providers. Also, I don't think it makes sense for B2B clients who have no problem paying per usage. You are going to find customers that want to use this to poll for updates daily, for example.
Are you using proxies for your text-based browser? I am curious how you are circumventing web crawling blocking.
We've been having similar thoughts about pricing and offering unlimited, but since it is feasible for us in the short term due to credits we enjoy offering that option to early users, even if it may be a bit naive.
Having said that, we are currently working on a pilot with a company whom we are offering live updates, and they are paying per usage since they don't want to have to set it up themselves, so we can definitely see the demand there. We also offer an API for companies that want to reliably query the same thing at a preset cadence, which is also usage based.
For crawling we use Firecrawl. They handle most of the blocking issues and proxies.
It does say that extraction can take hours, but I was expecting it would be more of an 80/20 kind of thing, with a lot of data found quickly, then a long tail of searching to fill in gaps. Is my expectation wrong?
I worry for two related reasons. One, inefficient gathering of data is going to churn and burn more resources than necessary, both on your systems and on the sites being hit. Secondly, although this free opportunity is an amazing way to show off your tool, I fear the pricing of an actual run is going to be high.
We use Gemini 2.5 Flash which is already pretty cheap, so inference costs are actually not as high as they would seem given the number of steps. Our architecture allows for small models like that to operate well enough, and we think those kinds of models will only get cheaper.
Having said all that, we are working on improving latency and allowing for more parallelization wherever possible and hope to include that in future versions, especially for enrichment. We do think that one of the weaknesses of the product is for mass collection - it's better at finding medium sized datasets from siloed sources and less good at getting large comprehensive datasets, but we're also considering approaches that incorporate more traditional scraping tactics for finding these large datasets.
It's probably the best research agent that uses live search. Are you using Firecrawl, I assume?
We're soon launching a similar tool (CatchALL by NewsCatcher) that does the same thing but on a much larger scale because we already index and pre-process millions of pages daily (news, corporate, government files). We're seeing so much better results compared to parallel.ai for queries like "find all new funding announcements for any kind of public transit in California State, US that took place in the past two weeks"
However, our tool will not perform live searches, so I think we're complementary.
i'd love to chat.
We’re optimising for large enterprises and government customers that we serve, not consumers.
Even the most motivated people, such as OSINT or KYC analysts, can only skim through tens, maybe hundreds of web pages. Our tool goes through 10,000+ pages per minute.
An LLM that has to open each web page to process the context isn’t much better than a human.
A perfect web search experience for LLM would be to get just the answer, aka the valid tokens that can be fully loaded into context with citations.
Many enterprises should leverage AI workflows, not AI agents.
Nice to have // must have. Existing AI implementations are failing because it’s hard to rely on results; therefore, they’re used for nice-to-haves.
Most business departments know precisely what real-world events can impact their operations. Therefore, search is unnecessary; businesses would love to get notifications.
The best search is no search at all. We’re building monitors – a solution that transforms your catchALL query into a real-time updating feed.
I'll give a few examples of how they use the tool.
Example 1 -- real estate PE that invests in multi-family residential buildings. Let's say they operate in Texas and want to get notifications about many different events. For example, they need to know about any new public transport infrastructure that will make specific area more accessible -> prices wil go up.
There are hundreds of valid records each month. However, to derive those records, we usually have to sift through tens of thousands of hyper-local news articles.
Example 2 -- Logistics & Supply Chain at F100 Tracking of all the 3rd party providers, any kind of instability in the main regions, disruptions at air and marine ports, political discussions around the regulation that might affect them, etc. There are like 20-50 events, and all of them are multi-lingual at global scale.
thousands of valid records each week, millions of web pages to derive those from.
Quickly hit your limits but on a complex dataset requiring looking at a lot of unstructured data on a lot of different web page, it seems to do really well!https://hn.webhound.ai/dataset/c6ca527e-1754-4171-9326-11cc8...
Working on better task classification upfront to route simple requests more directly.
As an aside, we are about to launch something like similar at rtrvr.ai but having AI Web Agents navigate pages, fill forms and retrieve data. We are able to get our costs down to negligible by doing headless, serverless browsers and our own grounds up DOM construction/actuation (so no FireCrawl costs). https://www.youtube.com/watch?v=gIU3K4E8pyw
> I noticed you mentioned that "MCP stands for model context protocol." My current understanding, based on the initial problem description and the articles I've been reviewing, is that MCP refers to "Managed Care Plan." This is important because the entire schema and extraction plan are built around "Managed Care Plans."
Session ID: fcd1edb8-7b3c-480e-a352-ed6528556a63
I have to ask, how's that going? Genuinely curious to know!
Seems like y'all are doing well with it!
1. List building – finding targeted job titles with their contact information.
2. List research – finding contact and company details of given people.
3. List verification – manually checking if the data is correct, sometimes even calling the contact person to confirm.
Apollo is a big competitor for their 'B2B leads(*dataset)' business because it is much cheaper. A tool like this could have a huge impact on their business.
Curious: Have you compared it with manual research? How accurate is it?
Interestingly, we're working with B2B clients right now where we use Webhound to curate and then act as the "validation" layer ourselves. The agent lets us offer these datasets way cheaper with live updates, but still with human oversight.
I ask for the school board website for every public school district in the bay area (like boarddocs, etc.), and it mostly returned useless links to the page listing the board members.
I asked ChatGPT5-thinking to do the same and it completed the request correctly, and outputted a CSV with a better schema in a couple of minutes.
We're working on better query interpretation, but in the meantime you could try being more specific like "find BoardDocs or meeting document websites for each district" to guide it better. Also, you can usually figure out how it interpreted your request by looking at the entity criteria, those are all the criteria a piece of data needs to meet to make it in the set.
On the Data tab it says "no schema defined yet."
The Schema tab doesn’t seem to have a way to create a schema.
Most of the other tabs (except for Sources) looked blank.
I did see the chat on the right and the "51 items" counter at the top, but I couldn’t find any obvious way to view the results in a grid or table.
That's really strange, it sounds like Webhound for some reason deleted the schema after extraction ended, so although your data should still be tied to the session it just isn't being displayed. Definitely not the expected behavior.
First, you're using Firecrawl as your crawling infrastructure, but Firecrawl explicitly blocks Reddit. Yet one of your examples mentions "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site..."
How are you accomplishing this? The comment about whether it's legal to crawl Reddit remains unanswered in this thread.
Second, you're accepting credit cards without providing any Terms of Service. This seems like a significant oversight for a YC company.
Third, as another commenter mentioned, GPT-5 can already do this faster and more effectively, and Claude has similar capabilities. I'm struggling to see the value proposition here beyond a thin wrapper around existing LLM capabilities with some agent orchestration. We're beyond assuming prompts are useful IP nowadays, or am I wrong?
Perhaps most concerning is the lack of basic account management features - there's no way to delete an account after creation. I'd say I'd like clarification, but there's no way I couldn't just code this up with Codex to run locally and do it myself (with a local crawler that can actually crawl Reddit even).
Regarding Reddit, we have our own custom handler for Reddit URLs which uses the Reddit API, which we are billed for when we exceed free limits.
For Terms of Service, you're right, that is definitely an oversight on our part. We just published both our Terms of Service and Privacy Policy on the website.
When it comes to comparing with GPT-5 and Claude, we do believe that our prompting, agent orchestration, and other core parts of the product such as parallel search results analysis and parallel agents are improvements on just GPT-5 and Claude, while also allowing it to run at much cheaper costs on significantly smaller models. Our v1 which we built months ago was essentially the same as what GPT-5 thinking with web search currently does, and we've since made the explicit choice to focus on data quality, user controllability, and cost efficiency over latency. So while yes, it might give faster results and work better for smaller datasets, both we and our users have found Webhound to work better for siloed sources and larger datasets.
Regarding account deletion, that is also a fair point. So far we've had people email us when they want their account deleted, but we will add account deletion ASAP.
Criticism like this helps us continue to hold ourselves to a high standard, so thanks for taking the time to write it up.
current experience: https://imgur.com/a/2BB1mAA
cant seem to upgrade though? stripe seems unhappy
How’s it different from Parallel Web Systems?
I was actually building a version of this using NonBioS.ai, but this is already pretty well done, so will just use this instead.