Launch HN: Webhound (YC S23) – Research agent that builds datasets from the web

3 months ago

Thanks, glad to hear you had a good experience.

We were heavily inspired by tools like Cursor - basically tried to prioritize user control and visibility above everything else.

What we discovered during iteration was that our users are usually domain experts who know exactly what they want. The more we showed them what was happening under the hood and gave them control over the process, the better their results got.

3 months ago

1 reply

Pretty cool, how does it compare to Parallel?

3 months ago

Thanks! Unlike a lot of our competitors who use search-inspired UX, we went with an agentic approach inspired by tools like Cursor - basically iterative user control.

Instead of just search query → final result (though you can do that too), you can step in and guide it. Tell it exactly where to look, what sources to check, how to dig deeper, how to use its notepad.

We've found this gets you way better results that actually match what you're looking for, as well as being a more satisfying user experience for people who already know how they would do the job themselves. Plus it lets you tap into niche datasets that wouldn't show up with just generic search queries.

jackienotchan

3 months ago

1 reply

AI crawlers have lead to a big surge in scraping activity, and most of these bots don't respect any scraping best practices that the industry has developed over the past two decades (robots.txt, rate limits, user agents, etc.).

This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported here on HN (and experienced myself).

Does Webhound respect robots.txt directives and do you disclose the identity of your crawlers via user-agent header?

3 months ago

1 reply

We currently use Firecrawl for our crawling infrastructure. Looking at their documentation, they claim to respect robots.txt, but based on user reports in their GitHub issues, the implementation seems inconsistent - particularly for one-off scrapes vs full crawls.

This is definitely something we need to address on our end. Site owners should have clear ways to opt out, and crawlers should be identifiable. We're looking into either working with Firecrawl to improve this or potentially switching to a solution that gives us more control over respecting these standards.

Appreciate you bringing this up.

nextworddev

3 months ago

Firecrawl is egregiously expensive

Poomba

3 months ago

1 reply

When it finds a page that has the results I'm looking for in a particular website, does it paginate through all of them? When I searched for "stores that are on the Faire marketplace" it seems to return just the first page of results without paginating through all of them.

3 months ago

Right now it can do that via URL params if that is how the website handles pagination, although we are pushing a feature in the next couple of days which allows it to take action on the DOM.

If it isn't doing that in your session, you can usually just step in and tell it to and it will follow your instructions.

whinvik

3 months ago

1 reply

Nice, this looks interesting.

> It uses a text-based browser we built

Can you tell us more about this. How does it work?

3 months ago

1 reply

We maintain a constant browser state that gets fed into the system prompt which shows the most recent results, current page, where you are in the content, what actions are available, etc. It's markdown by default but can switch to HTML if needed (for pagination or CSS selectors). The agent always has full context of its browsing session.

A few design decisions we made that turned out pretty interesting:

1. We gave it an analyze results function. When the agent is on a search results page, instead of visiting each page one by one, it can just ask "What are the pricing models?" and get answers from all search results in parallel.

2. Long web pages get broken into chunks with navigation hints so the agent always knows where it is and can jump around without overloading its context ("continue reading", "jump to middle", etc.).

3. For sites that are commonly visited but have messy layouts or spread out information, we built custom tool calls that let the agent request specific info that might be scattered on different pages and consolidates it all into one clean text response.

4. We're adding DOM interaction via text in the next couple of days, so the agent can click buttons, fill forms, enter keys, but everything still comes back as structured text instead of screenshots.

whinvik

3 months ago

1 reply

Thanks. If I am interpreting this correctly, what you have is not a browser but a translation layer. You are still using something that scrapes the data and then you translate it to be in the format that works best for your agent.

My original interpretation was that you had built a full blown browser, something akin to a Chromium/Firefox fork

3 months ago

Yes, translation layer would probably be better terminology.

lubujackson

3 months ago

1 reply

Great job on the launch. This is fills an exact need my company has, and your UI is fantastic. One nit, it would be nice to be able to manually edit the schema instead of only interacting with the LLM.

I am concerned about your pricing, as "unlimited" anything seems to be fading away from most LLM providers. Also, I don't think it makes sense for B2B clients who have no problem paying per usage. You are going to find customers that want to use this to poll for updates daily, for example.

Are you using proxies for your text-based browser? I am curious how you are circumventing web crawling blocking.

3 months ago

Thanks a lot regarding UI and good point on the schema editing.

We've been having similar thoughts about pricing and offering unlimited, but since it is feasible for us in the short term due to credits we enjoy offering that option to early users, even if it may be a bit naive.

Having said that, we are currently working on a pilot with a company whom we are offering live updates, and they are paying per usage since they don't want to have to set it up themselves, so we can definitely see the demand there. We also offer an API for companies that want to reliably query the same thing at a preset cadence, which is also usage based.

For crawling we use Firecrawl. They handle most of the blocking issues and proxies.

admn2

3 months ago

1 reply

Can I give it a list of products and have it provide data enrichment for each of them?

3 months ago

Yep, you can paste the list as text, or we also accept file uploads. Then, you can prompt it to enrich with certain attributes and it will do that for you.

codingdave

3 months ago

1 reply

It looks promising overall, and I can see where this will save a ton of time for some types of data gathering. But I'm also seeing quite a lot of queries and processing going on, and it has only found about 3% of the data that I know is publicly available, after 20 minutes.

It does say that extraction can take hours, but I was expecting it would be more of an 80/20 kind of thing, with a lot of data found quickly, then a long tail of searching to fill in gaps. Is my expectation wrong?

I worry for two related reasons. One, inefficient gathering of data is going to churn and burn more resources than necessary, both on your systems and on the sites being hit. Secondly, although this free opportunity is an amazing way to show off your tool, I fear the pricing of an actual run is going to be high.

3 months ago

Thanks for the feedback! From what we've seen it's actually the other way around - once it gets a sense of where this information lives the latter stages of data collection go quicker, especially since it's able to deploy search agents in parallel to get information and doesn't need to do the manual work as much anymore. Having said that, it does sometimes forget to do that, and although we've added the critic agent to remind it to do that it can be inconsistent but usually if you step in and ask it to deploy agents in parallel that fixes it.

We use Gemini 2.5 Flash which is already pretty cheap, so inference costs are actually not as high as they would seem given the number of steps. Our architecture allows for small models like that to operate well enough, and we think those kinds of models will only get cheaper.

Having said all that, we are working on improving latency and allowing for more parallelization wherever possible and hope to include that in future versions, especially for enrichment. We do think that one of the weaknesses of the product is for mass collection - it's better at finding medium sized datasets from siloed sources and less good at getting large comprehensive datasets, but we're also considering approaches that incorporate more traditional scraping tactics for finding these large datasets.

3 months ago

2 replies

Congrats on the HN Launch!

It's probably the best research agent that uses live search. Are you using Firecrawl, I assume?

We're soon launching a similar tool (CatchALL by NewsCatcher) that does the same thing but on a much larger scale because we already index and pre-process millions of pages daily (news, corporate, government files). We're seeing so much better results compared to parallel.ai for queries like "find all new funding announcements for any kind of public transit in California State, US that took place in the past two weeks"

However, our tool will not perform live searches, so I think we're complementary.

i'd love to chat.

Poomba

3 months ago

1 reply

I like this approach better TBH - more reliable and robust. It probably satisfies 80% of most customer queries too as most want to query against the same sources

3 months ago

1 reply

Oh, I totally see your point.

We’re optimising for large enterprises and government customers that we serve, not consumers.

Even the most motivated people, such as OSINT or KYC analysts, can only skim through tens, maybe hundreds of web pages. Our tool goes through 10,000+ pages per minute.

An LLM that has to open each web page to process the context isn’t much better than a human.

A perfect web search experience for LLM would be to get just the answer, aka the valid tokens that can be fully loaded into context with citations.

Many enterprises should leverage AI workflows, not AI agents.

Nice to have // must have. Existing AI implementations are failing because it’s hard to rely on results; therefore, they’re used for nice-to-haves.

Most business departments know precisely what real-world events can impact their operations. Therefore, search is unnecessary; businesses would love to get notifications.

The best search is no search at all. We’re building monitors – a solution that transforms your catchALL query into a real-time updating feed.

Poomba

3 months ago

1 reply

So your customers just want to use this for their own internal data, not external data from the web. Is that correct?

3 months ago

no no, they want to use it on external data, we do not do any internal data.

I'll give a few examples of how they use the tool.

Example 1 -- real estate PE that invests in multi-family residential buildings. Let's say they operate in Texas and want to get notifications about many different events. For example, they need to know about any new public transport infrastructure that will make specific area more accessible -> prices wil go up.

There are hundreds of valid records each month. However, to derive those records, we usually have to sift through tens of thousands of hyper-local news articles.

Example 2 -- Logistics & Supply Chain at F100 Tracking of all the 3rd party providers, any kind of instability in the main regions, disruptions at air and marine ports, political discussions around the regulation that might affect them, etc. There are like 20-50 events, and all of them are multi-lingual at global scale.

thousands of valid records each week, millions of web pages to derive those from.

3 months ago

1 reply

Hey, would be happy to chat. Shoot us an email at team@webhound.ai and we can set up a time.

3 months ago

done

caltonji

3 months ago

1 reply

On a simple dataset (that can be answered by a single table in a single Wikipedia page) it seems to overthink https://www.webhound.ai/dataset/57a5c745-909e-466d-bcac-1dd7....

Quickly hit your limits but on a complex dataset requiring looking at a lot of unstructured data on a lot of different web page, it seems to do really well!https://hn.webhound.ai/dataset/c6ca527e-1754-4171-9326-11cc8...

3 months ago

Yeah, we've noticed it overthinks simple tasks that could be solved with a single table scrape. The agent architecture is built for complex, multi-source problems so it overengineers straightforward queries.

Working on better task classification upfront to route simple requests more directly.

mips_avatar

3 months ago

1 reply

This is really cool, I could imagine a number of ways I'd want to incorporate this into my data pipeline. Your agent is actually doing a great job of finding useful sources, though I wish there was a way it could really dig into some of these sources and pull hard from them. For my query it actually found some cool sources I didn't know existed, but it only pulled a single query from them. I'm now thinking I should go and write a custom scraper for those sources to actually get the full data I want.

3 months ago

Thanks, we have noticed that it can tend to "give up" early on certain sources. Ideally the critic agent would guide it back to the correct path of continuing to go deeper, but if that doesn't work usually just adding something to the prompt or sending it a message later on telling it to go deep on these sources would work.

arjunchint

3 months ago

1 reply

Since it looks like you are built on FireCrawl, and FireCrawl itself has similar products like FireEnrich, how do you see yourself maintaining a differentiation and compete with FireCrawl directly, if they just decide to copy you?

As an aside, we are about to launch something like similar at rtrvr.ai but having AI Web Agents navigate pages, fill forms and retrieve data. We are able to get our costs down to negligible by doing headless, serverless browsers and our own grounds up DOM construction/actuation (so no FireCrawl costs). https://www.youtube.com/watch?v=gIU3K4E8pyw

3 months ago

Good point. Our main differentiation is the shared workspace - users can step in and guide the agent mid-task, kind of like Cursor vs Claude (which can technically generate the same code that Cursor does). Firecrawl (or any crawler we may use) is only part of the process, we want to make the collaborative process for user <> agent as robust and user controllable as possible.

_false

3 months ago

1 reply

I found the ability to stop and clarify a task in "one-shot" mode impressive. In my original prompt it misunderstood MCP to stand for Medical Care Plan. I was worried I wasted a generation but being able to stop and clarify fixed it.

_false

3 months ago

1 reply

Oh, nevermind. It became confused and was unable to complete the task:

> I noticed you mentioned that "MCP stands for model context protocol." My current understanding, based on the initial problem description and the articles I've been reviewing, is that MCP refers to "Managed Care Plan." This is important because the entire schema and extraction plan are built around "Managed Care Plans."

Session ID: fcd1edb8-7b3c-480e-a352-ed6528556a63

3 months ago

Sorry about that. If you tell it to restructure the schema and search plan around MCP as model context protocol it should work. The agent can get stuck on its initial interpretation sometimes.

rick1290

3 months ago

1 reply

Are you able to share some of the tech stack choices behind the scenes? React, fastapi, django, go, pydantic ai, ag-ui, firecrawl, vertex ai, etc. Would love to see the tools and frameworks others are deploying with.

3 months ago

1 reply

Yep: NextJS frontend, NodeJS backend, Gemini 2.5 Flash LLM, Firecrawl for crawling, self-hosted SearXNG for web search, and fly.io for hosting. Beyond that everything else is built internally, we don’t use many frameworks.

rick1290

3 months ago

1 reply

What about for the db? Prisma? Postgres?

3 months ago

Supabase

cco

3 months ago

1 reply

As someone that works in the auth space (Stytch), this is the first time I've _ever_ seen someone's landing page be a login screen.

I have to ask, how's that going? Genuinely curious to know!

3 months ago

1 reply

Fair point, most of our users have come from referrals/word of mouth so it hasn't really been an issue for us, but you're probably right that we should have more information on the landing page

cco

3 months ago

Oh definitely not trying to make a point, I'm really just curious about how it is working out in case it is something I should update in my recommendations to customers.

Seems like y'all are doing well with it!

piecerough

3 months ago

1 reply

Have you tried 2.5 Flash Lite to cut costs further?

3 months ago

We have, 2.5 Flash is about as small as we've been able to go while still delivering consistent results.

3 months ago

1 reply

You can fix your column headers like this so they don't break: https://imgur.com/a/EV4piJ0

3 months ago

Thanks, fixed!

ukuina

3 months ago

1 reply

Awesome platform.

I wanted to upgrade!

But your "upgrade to Pro" button on the Account page gets stuck on "Processing..."

3 months ago

Did this resolve itself? If not shoot us an email at team@webhound.ai and we can get it figured out.

bey0nder

3 months ago

2 replies

There is a company in India that does something similar, but they do it manually. They call their people “research analysts” and pay them around $130–$150 per month. They mainly use Google Techniques and LinkedIn/Sales navigation to find information about companies and the people who work there. Their services include:

1. List building – finding targeted job titles with their contact information.

2. List research – finding contact and company details of given people.

3. List verification – manually checking if the data is correct, sometimes even calling the contact person to confirm.

Apollo is a big competitor for their 'B2B leads(*dataset)' business because it is much cheaper. A tool like this could have a huge impact on their business.

Curious: Have you compared it with manual research? How accurate is it?

3 months ago

1 reply

Accuracy-wise we think it's almost there but probably still a few iterations away from being perfect. It's great at eliminating a lot of the collection time though.

Interestingly, we're working with B2B clients right now where we use Webhound to curate and then act as the "validation" layer ourselves. The agent lets us offer these datasets way cheaper with live updates, but still with human oversight.

bey0nder

3 months ago

The indian company I mentioned earlier mainly had exhibitions and events as clients. These clients usually need huge datasets rather than just a few leads, which makes them a good target market for a tool like yours.

ta12653421

3 months ago

"there is no AI that can do something more cheaply than 1.5 billion indians" :-D

gpt5

3 months ago

1 reply

I like the idea, but it failed for me (sharing in case it's helpful).

I ask for the school board website for every public school district in the bay area (like boarddocs, etc.), and it mostly returned useless links to the page listing the board members.

I asked ChatGPT5-thinking to do the same and it completed the request correctly, and outputted a CSV with a better schema in a couple of minutes.

3 months ago

Thanks for testing it! That's definitely a miss, sounds like it got confused about what you were looking for and went after board member pages instead of the actual meeting/document sites.

We're working on better query interpretation, but in the meantime you could try being more specific like "find BoardDocs or meeting document websites for each district" to guide it better. Also, you can usually figure out how it interpreted your request by looking at the entity criteria, those are all the criteria a piece of data needs to meet to make it in the set.

plasma

3 months ago

1 reply

I gave the demo a try and was able to run a search that showed "51 results" - great start! A few things I noticed though:

On the Data tab it says "no schema defined yet."

The Schema tab doesn’t seem to have a way to create a schema.

Most of the other tabs (except for Sources) looked blank.

I did see the chat on the right and the "51 items" counter at the top, but I couldn’t find any obvious way to view the results in a grid or table.

3 months ago

Could you share the session url via the feedback form if you still have access to it?

That's really strange, it sounds like Webhound for some reason deleted the schema after extraction ended, so although your data should still be tied to the session it just isn't being displayed. Definitely not the expected behavior.

kordlessagain

3 months ago

1 reply

I looked at this product and found some concerning issues.

First, you're using Firecrawl as your crawling infrastructure, but Firecrawl explicitly blocks Reddit. Yet one of your examples mentions "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site..."

How are you accomplishing this? The comment about whether it's legal to crawl Reddit remains unanswered in this thread.

Second, you're accepting credit cards without providing any Terms of Service. This seems like a significant oversight for a YC company.

Third, as another commenter mentioned, GPT-5 can already do this faster and more effectively, and Claude has similar capabilities. I'm struggling to see the value proposition here beyond a thin wrapper around existing LLM capabilities with some agent orchestration. We're beyond assuming prompts are useful IP nowadays, or am I wrong?

Perhaps most concerning is the lack of basic account management features - there's no way to delete an account after creation. I'd say I'd like clarification, but there's no way I couldn't just code this up with Codex to run locally and do it myself (with a local crawler that can actually crawl Reddit even).

3 months ago

Hey, appreciate the feedback. Will address all your points.

Regarding Reddit, we have our own custom handler for Reddit URLs which uses the Reddit API, which we are billed for when we exceed free limits.

For Terms of Service, you're right, that is definitely an oversight on our part. We just published both our Terms of Service and Privacy Policy on the website.

When it comes to comparing with GPT-5 and Claude, we do believe that our prompting, agent orchestration, and other core parts of the product such as parallel search results analysis and parallel agents are improvements on just GPT-5 and Claude, while also allowing it to run at much cheaper costs on significantly smaller models. Our v1 which we built months ago was essentially the same as what GPT-5 thinking with web search currently does, and we've since made the explicit choice to focus on data quality, user controllability, and cost efficiency over latency. So while yes, it might give faster results and work better for smaller datasets, both we and our users have found Webhound to work better for siloed sources and larger datasets.

Regarding account deletion, that is also a fair point. So far we've had people email us when they want their account deleted, but we will add account deletion ASAP.

Criticism like this helps us continue to hold ourselves to a high standard, so thanks for taking the time to write it up.

3 months ago

1 reply

The agents need to be able to work on a cellular level. Then you can add N number of enrichments per subscription tier.

3 months ago

and it's annoying that i have to rerun the entire dataset task just for adding a single column

3 months ago

I need to be able to refine sources by hardcoded sources. I am getting failure in a column yet I know multiple bonafide information sources. In the sources tab I would expect to see the sources broken down by schema name. For example, I have a schema element called "Recent News". I it was possible for me to do so, I would define pr newswire and business newswire as a source. But instead the agents are failing here.

selinkocalar

3 months ago

We've tried to automate similar processes and always end up needing human review because websites are inconsistent and messy.

3 months ago

When there are more than one link in a cell result you can easily fix the formatting. Just add this one class https://imgur.com/a/SoRKnZQ

current experience: https://imgur.com/a/2BB1mAA

4ndrewl

3 months ago

Congratulations on the launch. I'm currently using it to run a very specific task that I was thinking about just earlier today. Will let you know how it gets on.

Uptrenda

3 months ago

Nice tool, man. It found 240 STUN servers in my test run. I hope you don't mind but I was curious to see if I could trick it into giving me results for TURN servers (with user + pass.) I wasn't able to trick it. Not even if the TURN servers were part of my grandmas famous cookie recipe did it give in. Oh well, no cookies for me I guess.

papacostas

3 months ago

ok this looks amazing, its the product i've been wanting for a long time and lowkey considered building. been trying accomplish similar things with Dia and the Strawberry Browser but this is straight to the point.

cant seem to upgrade though? stripe seems unhappy

vira28

3 months ago

I am on phone. Sorry if it’s already discussed.

How’s it different from Parallel Web Systems?

giancarlostoro

3 months ago

This is how I use Perplexity, I will have to give this a try, I am always on the lookout for newer tools.

3 months ago

VC will want this tool for creating a competitive landscape. Startups should do it to but it will replace hundreds of hours of junior analyst time. For better or worse. THis is going to be much better research than what a junior analyst could do. Or maybe jr analyst will pay for it.

eichi-pikachu

3 months ago

Plz hire the best engineers and scientists in the area. It's a competitive market, but the potential is undeniable.

https://imgur.com/a/A7qrBvi

3 months ago

Your column sorting only looks at the first character and if the cell value is a number it's all wrong because numbers are converted from text strings. For example, I have a column title "number of employees". What happens is that this column is not sortable numerically. What should happen is that I should be able to sort numerical data from greatest to least or vice versa

raymond_goo

3 months ago

Emergency Override - Force ending extraction

ashivkum

3 months ago

this is very polished and intuitive

3 months ago

If I were you guys I would prioritize adding a multi-user instance otherwise you get people who don't want to make their dataset public, they will download and leave the platform. From a acquisition strategy perspective you want people to become more familiar with being on your site vs exporting to their favorite editor. You probably have it on your roadmap but a main competitor for you in gonna be clay. People will export from your tool and import into clay.

nextworddev

3 months ago

Is there an open source alternative to building this type of UI / job scheduling backend that works for grids?

suninsight

3 months ago

Very cool and nicely executed ! Definitely see a lot of value in this.

I was actually building a version of this using NonBioS.ai, but this is already pretty well done, so will just use this instead.

bananamansion

3 months ago

“I see you’re dynamically generating schemas per dataset. Have you explored using a domain ontology or building a reusable ontology layer so the agent’s outputs can interoperate more consistently across projects

jquaint

3 months ago

Used this in a "vibe coding hackathon" the other day. Works really well!