Guarding My Git Forge Against AI Scrapers

22 days ago

4 replies

This article shows no evidence for anything it claims. None. All of that while claiming we can’t believe almost anything we read online… well you’re god damn right.

> undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.

Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.

When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!

nutjob2

21 days ago

1 reply

> Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.

That's actually pretty much spot on.

21 days ago

2 replies

When you start believing that there are only good and bad, black and white, them vs us, you know for sure you’ve been brainwashed. Goes to both sides.

hurturue

21 days ago

1 reply

so between 0 (good) and 100 (bad), what would be your gray score "badness/evilness" value for the following: Russia, US, China, EU

tkfoss

21 days ago

95,95,95,{depends on the country, from 30 to 100}

nutjob2

21 days ago

1 reply

For someone who complains about unsupported claims, you seem to make a lot of them.

The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.

19 days ago

1 reply

What?? I am just saying that if you think the world is made of black and white villains vs heroes, you are buying into the propaganda from one side or another. This is not a bold claim, this is basic logic from anyone mature enough to know that no country, and no person, is just simply either good or bad. They do bad things in order to accomplish what they believe to be good things. The US drop two atomic bombs on Japan, a horrifically evil act, but it did so in order to defeat what it believed to be an even bigger evil. Russia invaded Ukraine, a violent, barbaric act that caused the deaths of at least a million on both sides, but it did so because it, like the US, believed to be doing what's right to ensure their country's independence in the longer term since, as they'd been saying for decades, Ukraine must never be allowed to join a hostile military alliance as that would compromise forever Russia's own ability to defend itself from invasion from western powers, when Operation Barbarossa is still very, very alive in their minds to this very day. It doesn't matter if you agree with either the US or Russia on whether they were actually right, what matters is that they themselves thought they were right, given their own circumstances, and people love to ignore that and judge them by their own perceptions of what they should think. This is a sign of immaturity: you probably judge people around you in your life like that as well, by what you see from the outside without any idea what's going on inside their heads.

mopsi

19 days ago

1 reply

Ukraine was not joining NATO. Foreign policy professionals, both in Russia and abroad, do not consider NATO a threat to Russia. Nor does the war have anything to do with Barbarossa or many other historical comparisons. The war is simply the endgame of the radicalization of Putin's dictatorship, which became so brazen after suffocating all opposition in Russia that it began to pillage its neighbors in the same way it looted Russia, turning entire sectors of the Russian economy into the personal wallets of its elite.

It's funny how trolls like you talk about the importance of listening to what Russians say, but flat out refuse to see and listen to what Russians who are not members of Putin's gang are saying.

https://www.nato.int/en/what-we-do/partnerships-and-cooperat...

18 days ago

1 reply

> Ukraine was not joining NATO

Oh my god, you're just ignorant.

NATO itself said they would join the alliance back in 2008.

> At the 2008 Bucharest summit, NATO declined to offer Ukraine a Membership Action Plan, but said that Ukraine would eventually join the alliance.

Source: https://en.wikipedia.org/wiki/Ukraine%E2%80%93NATO_relations

Also mentioned even in NATO's own website:

You can read the full statement from 2008 here: https://www.nato.int/en/about-us/official-texts-and-resource...

This was the trigger for the Russian invasion of Georgia, which was also mentioned in the above statement.

Do you think NATO is spreading Russian propaganda on its website??

mopsi

15 days ago

  > NATO itself said they would join the alliance back in 2008.

No, that was a polite statement after the allies decided at the 2008 summit not to invite Ukraine into NATO.

nightpool

21 days ago

1 reply

They link multiple sources, including a Sunshine Foundation report summarizing other research into the area, and a NewsGuard report where they tested claims from the Pravda network directly against leading LLM chatbots: https://static1.squarespace.com/static/6612cbdfd9a9ce56ef931... https://www.newsguardtech.com/special-reports/generative-ai-...

[1] https://www.newsguardtech.com/special-reports/john-mark-doug...

19 days ago

I think this research seems a little bit suspicious. First of all, it focuses almost entirely on the brawl the NewsGuard is having with one guy[1]. Notice how that's where most of the "fake news" they mention come from. Secondly, asking the LLM a "leading question" is a very well known way to get biased answers and they did that to an extreme extent in this piece. Read this article to understand how you can get LLMs to say almost anything that supports whatever you want it to support[2]. That unfortunately weakens what may have been good points they had: that LLMs seem to just trust any news websites equally regardless of their "accuracy". I would like to point out that American news websites are not known for their accuracy either and the proliferation of fact-checking websites that routinely debunk their half lies proves that.

I do agree with you that there are many news sites spreading misinformation, but I think that most of it is not coming from governments... and while governments are also doing this, most, I would think, do it with good intentions (they do believe the information is true and barely verify that when it favours their preconceived points of view). When propaganda spreads information you like, you tend to call it just news.

The way current Western media is currently dismissing anything at all that comes from Russian sources as lies and propaganda, however, is way overblown in my opinion. That's causing a huge blind spot in the public discourse which just makes the fake news sources seem even more attractive, since they seem to be whistleblowers fighting against a campaign of silence from the mainstream media, which is not completely incorrect.

[2] https://medium.com/@amithnmbr/why-its-important-to-know-how-...

frogperson

21 days ago

Can you point me to any examples of russia doing something good or helping anyone except billionaires? No? Then their reputation is well deserved.

ekropotin

21 days ago

As a Russian, I have to say that Putin is indeed way too focused on geopolitics instead of internal state of affairs.

zwnow

22 days ago

> Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.

hashar

22 days ago

4 replies

I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.

FieryMechanic

22 days ago

2 replies

The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.

tigranbs

22 days ago

1 reply

And obviously, you need things fast, so you parallelize a bunch!

FieryMechanic

22 days ago

I was collecting UK bank account sort code numbers (to a buy a database at the time costs a huge amount of money). I had spent a bunch of time using asyncio to speed up scraping and wondered why it was going so slow, I had left Fiddler profiling in the background.

22 days ago

1 reply

So the easiest strategy to hamper them of you know you're serving a page to an AI bot is simply to take all the hyperlinks off the page...?

That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached

FieryMechanic

22 days ago

1 reply

When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.

Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.

22 days ago

1 reply

I guess my point is since it's a subtle interference that leaves the explicitly requested code/content fully intact you could just do it as a blanket measure for all non-authenticated users. The real benefit is that you don't need to hide that you're doing it or why...

22 days ago

1 reply

You could add a feature kind of like "unlocked article sharing" where you can generate a token that lives in a cache so that if I'm logged in and I want to send you a link to a public page and I want the links to display for you, then I'd send you a sharing link that included a token good for, say, 50 page views with full hyperlink rendering. After that it just degrades to a page without hyperlinks again.

Neat https://docs.gitea.com/administration/config-cheat-sheet#ser...

21 days ago

I would build a little stoplight status dot into the page header. Red if you're fully untrusted. Yellow if you're semi-trusted by a token, and it shows you the status of the token, e.g. the number of requests remaining on it. Green if you're logged in or on a trusted subnet or something. The status widget would links to all the relevant docs about the trust system and makes no attempt to hide the workings of the trust system.

immibis

22 days ago

Because they don't have any reason to give any shits. 90% of their collected data is probably completely useless, but they don't have any incentive to stop collecting useless data, since their compute and bandwidth is completely free (someone else pays for it).

ACCount37

22 days ago

Because that kind of optimization takes effort. And a lot of it.

Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.

The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.

dspillett

22 days ago

> I do not understand why the scrappers do not do it in a smarter way

If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.

If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.

If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.

> why the scrappers do not do it in a smarter way

A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.

----

[0] the fact this load might be inconvenient to you is immaterial to the scraper

[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.

mappu

22 days ago

3 replies

Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.

greenavocado

21 days ago

2 replies

Are you the only user of your web-facing Gitea? If so, put it behind Wireguard VPN, and basically never worry about bandwidth and security again.

jauntywundrkind

21 days ago

This is the most assured best way to make sure your remain the only user of your stuff.

I highly encourage folks to put stuff out there! Put your stuff on the internet! Even if you don't need it even if you don't think you'll necessarily benefit: leave the door open to possibility!

fragmede

21 days ago

So much this. Wireguard is so easy to do and no, the whole world doesn't need access to my shit, just me and a couple of close friends.

wiether

21 days ago

1 reply

I don't understand the purpose of this parameter value?

I have `REQUIRE_SIGNIN_VIEW=true` and I see nothing but my own traffic on Gitea's logs.

Is it because I'm using a subdomain that doesn't imply there's a Gitea instance behind?

mappu

21 days ago

2 replies

Crawlers will find everything on the internet eventually regardless of subdomain (e.g. from crt.sh logs, or Google finds them from 8.8.8.8 queries).

REQUIRE_SIGNIN_VIEW=true means signin is required for all pages - that's great and definitely stops AI bots. The signin page is very cheap for Gitea to render. However, it is a barrier for the regular human visitors to your site.

'expensive' is a middle-ground that lets normal visitors browse and explore repos, view README, and download release binaries. Signin is only required for "expensive" pageloads, such as viewing file content at specific commits git history.

wiether

21 days ago

Thanks for the clarification!

From Gitea's doc I was under the impression that it was going further than "true" so I didn't understood why because "true" was enough for me to not be bothered by bots.

But in your case you want a middle-ground, which is provided by "expensive"!

nextaccountic

19 days ago

oh.. that's why 8.8.8.8 is free

01HNNWZ0MV43FF

21 days ago

> Enable this to force users to log in to view any page or to use API. It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources, for example: block anonymous AI crawlers from accessing repo code pages. The "expensive" mode is experimental and subject to change.

Forgejo doesn't seem to have copied that feature yet

xyzal

22 days ago

2 replies

Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.

tpxl

21 days ago

Create a few insecure implementations, parse them into an AST, then turn them back into code (basically compile/decompile) except rename the variables and reorder stuff where you can without affecting the result.

moooo99

22 days ago

Ironically, probably the fastest way to create insecure code is by asking AI chatbots to code

hurturue

22 days ago

5 replies

in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.

do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?

dns_snek

22 days ago

1 reply

[delayed]

pelotron

21 days ago

Just as private businesses are allowed "no shirt, no shoes, no service" policies, my website should be allowed a "no heartbeat, no qualia, no HTTP 200".

wrxd

22 days ago

3 replies

The general consensus here is also that a DDOS attack is bad. I haven't seen objections against respectful scraping. You can say many things about AI scrapers but I wouldn't call them respectful at all.

BenjiWiebe

21 days ago

1 reply

Do people truly dislike an organic DDoS?

So much real human traffic that it brings their site down?

I mean yes it's a problem, but it's a good problem.

voidUpdate

21 days ago

If my website got hugged to death, I would be very happy. If my website got scraped to hell and back by people putting it into the plagiarism machine so that it can regurgitate my content without giving me any attribution, I would be very displeased

microtherion

21 days ago

a) There are too damn many of them.

b) They have a complete lack of respect for robots.txt

I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…

charcircuit

22 days ago

Yet HN does it when linking to poorly optimized sites. I doubt people running forges would complain about AI scrapers if their sites were optimized for serving the static content that is being requested.

johneth

21 days ago

I think, for many, the web should be free for humans.

When scraping was mainly used to build things like search indexes, which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.

But, for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.

komali2

22 days ago

I'm completely happy for everything to be free. Free as in freedom, especially! Agpl3, creative commons, let's do it!

But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/

WhyOhWhyQ

22 days ago

If net neutrality is a trojan horse for 'Sam Altman and the Antrhopic guy own everything I do' then I voice my support for a different path.

FabCH

22 days ago

3 replies

If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.

In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.

komali2

22 days ago

1 reply

Reminds me of when 4chan banned Russia entirely to stop DDOSes. I can't find it but there was a funny post from Hiro saying something like "couldn't figure out how to stop the ddos. Banned Russia. Ddos ended. So Russia is banned. /Shrug"

ralferoo

21 days ago

Similarly, for my e-mail server, I manually add spammers into my exim local_sender_blacklist a single domain at a time. About a month into doing this, I just gave up and added * @* .ru and that instantly cut out around 80% of the spam e-mail.

It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.

It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.

ThatPlayer

21 days ago

2 replies

We've had a similar discussion at my work. E-commerce that only ships to North America. So blocking anyone outside of that is an option.

FabCH

21 days ago

2 replies

Be slightly careful with commerce websites, because GeoIP databases are not perfect in my experience.

I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.

As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.

ThatPlayer

21 days ago

That's a good point! I'll probably start with a blocklist.

dotancohen

20 days ago

There is a small OTC medical device that is about $60 in the US, quadruple the price in my country. I tried to order one to be sent to a US family member's house, who was coming the following month to visit. However I could not order because I was not in the US.

In the end I found another online store, paid $74, and got the device. So the better store lost the sale due to blocking non-US orders.

I don't know how much of a corner case this is.

lsaferite

21 days ago

2 replies

Just keep in mind, that could block legit users who are outside the country. One case being someone traveling and wanting to buy something to deliver home. Another case being a non-resident wanting to buy something to send to family in the service zone.

I'm not saying don't block, just saying be aware of the unintended blocks and weigh them.

fragmede

21 days ago

Yeah and just tourists. If I'm in Indonesia for black Friday, trying to buy things from back home and the site is blocked; shit. Personally I'm technical enough to have a Tailscale exit node back home to work around it, but most people aren't.

DANmode

21 days ago

Great comment - thank you.

krupan

21 days ago

4 replies

This makes me a little sad. There's an ideal built into the Internet, that it has no borders, that individuals around the world can connect directly. Blocking an entire geographic region because of a few bad actors kills that. I see why it's done, but it's unfortunate

BobaFloutist

21 days ago

1 reply

It's not because of a few bad actors, it's because of a hostile or incompetent government.

Every country has (at the very least) a few bad actors, it's a small handful of countries that actively protect their bad actors from any sort of accountability or identification.

victorbjorklund

21 days ago

1 reply

To be fair most of my bad traffic is from the US.

BobaFloutist

21 days ago

I mean if that's the case, the conversation obviously changes.

halJordan

21 days ago

2 replies

You can't make the argument that it's a small group of bad actors. It's quite a massive group of unrelentingly malicious actors

tkfoss

21 days ago

1 reply

I read it as small compared to total population affected by the block

anon7000

20 days ago

But that’s not the case either. A large attack or scrape generates far more traffic than legitimate users.

01HNNWZ0MV43FF

21 days ago

Massive in terms of money and power, small in terms of souls

FabCH

21 days ago

I know what you mean.

But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.

The tradeoff is just too big to ignore.

hombre_fatal

21 days ago

The ideal should have been backed by a solution to malicious actors, but it wasn't.

Because of that, everyone has to come up with their own clever solution.

dspillett

22 days ago

3 replies

> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.

This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.

ArcHound

22 days ago

1 reply

Disagree on the method:

I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.

No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.

But yes, selling botnets made of compromised devices is also a thing.

Nextgrid

21 days ago

SIM cards is (one) of the ways the big boys do it. It gives you a nice CGNAT to hide behind and essentially can’t be blocked without blocking a nontrivial chunk of the country. Although more and more fixed-line ISPs are moving to CGNAT too so you can get that advantage there as well.

simonw

21 days ago

1 reply

I have trouble imagining any home LLM tinkerer who tries to run a naive scraper against the rest of the internet as part of their experiments.

Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.

st3fan

21 days ago

Correct. These are called "residential proxies".

johneth

21 days ago

I think at some point residential connections that send abusive traffic will need to be blocked more aggressively, even if innocents are caught up. Obviously in such a scenario, you'd have to put up an explanation, in plain language, why they've been blocked.

If you're just blocking their IP, e.g. "we've had to block your access because your device, or a device on your network, appears to be sending abusive traffic to us. this sometimes happens because an app on your device is doing this without your knowledge".

If you've had to block a whole AS, you could name and shame them, e.g. "we've had to block your access because your internet provider, <name>, appears to be sending abusive traffic to us. consider letting them know, or moving providers".

Of course, this would almost certainly be commercially stupid. It would also only really work if everyone decided to do this, which realistically will never happen.

klaussilveira

22 days ago

5 replies

I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.

eddyg

21 days ago

Just search for "residential proxies" and you'll see why this wouldn't help.

immibis

20 days ago

There is one. It's called the RIRs.

johneth

21 days ago

There are services like ipinfo. They've got a free database, but it only gives you the ASN name and domain; you need to pay to get their IP address is business or residential details.

wrxd

22 days ago

Scrapers use residential VPNs so such a database would help only up to a certain point

ronsor

21 days ago

There is... It's literally available in every RIR database through WHOIS.

qudat

22 days ago

3 replies

This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io

drzaiusx11

21 days ago

1 reply

Do solutions like gitea not have prebuilt indexes of the git file contents? I know GitHub does this to some extent, especially for main repo pages. Seems wild that the default of a web forge would be to hit the actual git server on every http GET request.

danudey

21 days ago

1 reply

The author discusses his efforts in trying caching; in most use cases, it makes no sense to pre-cache every possible piece of content (because real users don't need to load that much of the repository that fast), and in the case of bot scrapers it doesn't help to cache because they're only fetching each file once.

drzaiusx11

21 days ago

I'd argue every git-backed loadable page in a web forge should be "that fast", at least in this particular use-ase.

Hitting the backing git implementation directly within the request/response loop seems like a good way to burn cpu cycles and create unnecessary disk reads from .git folders, possibly killing you drives prematurely. Then stick a memcache in front and call it a day, no?

In the age of cheap and reliable SSDs (approaching memory read speeds), you should just be batch rendering file pages from git commit hooks. Leverage external workers for rendering the largely static content. Web hosted git code is more often read than written in these scenarios, so why hit the underlying git implementation or DB directly at all? Do that for POSTs, sure but that's not what we're talking about (I think?)

lsaferite

21 days ago

1 reply

Why not render the markdown as HTML in this scenario?

qudat

19 days ago

Markdown is readable as-is I didn’t see the need to add more complexity here.

cxr

21 days ago

[delayed]

zoobab

21 days ago

1 reply

Use stagit, static pages served with a simple nginx is blazing fast and should resist any scrapers.

toastal

21 days ago

Darcs by it’s nature can just be hosted by HTTP server too, but without needing a special tool. I use H2O with a small mruby script to throttle IPs.

frogperson

21 days ago

3 replies

Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.

Is this viable?

hamdingers

21 days ago

Say you have identified a non-human reader, you have a (probably fake) user agent and an IP address. How do you imagine you'll extract a dollar from that?

grayhatter

21 days ago

> Is this viable?

for many reasons

kstrauser

21 days ago

Most of my scraper traffic came from China and Brazil. How am I going to enforce that?

Bender

21 days ago

1 reply

Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people.

Anyway, test some scrapers and bots here [1] and let me know if they get through.

[1] - https://mirror.newsdump.org/bot_test.txt

cortesoft

21 days ago

1 reply

> I do block some data-centers ASN's as I do not expect real people to come from them even though they could.

My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)

Bender

21 days ago

It's of course optional to block whatever one finds appropriate for their use case. My hobby stuff is not revenue generating so I have more options at my disposal.

Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.

GoblinSlayer

21 days ago

2 replies

>Iocaine has served 38.16GB of garbage

And what is the effect?

nitwit005

21 days ago

I got the 418 I'm a teapot response.

oconnore

21 days ago

The only disappointing aspect of the Iocaine maze is that it is not a literal maze. There should be a narrow, treacherous path through the interconnected web of content that lets you finally escape after many false starts.

jepj57

21 days ago

1 reply

What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.

adastra22

21 days ago

Contract law doesn’t work that way.

benlivengood

21 days ago

1 reply

It would be nice if there was a common crawler offering deltas on top of base checkpoints of the entire crawl; I am guessing most AI companies would prefer not having to mess with their own scrapers. Google could probably make a mint selling access.

ccgreg

20 days ago

1 reply

commoncrawl.org

Our public web dataset goes back to 2008, and is widely used by academia and startups.

pdimitar

20 days ago

1 reply

I always wanted to ask:

- How often is that updated?

- How current is it at any point in time?

- Does it have historical / temporal access i.e. be able to check the history of a page a la The Internet Archive?

ccgreg

19 days ago

- monthly

- it's a historical archive, the concept of "current" is hard to turn into a metric

- not only is our archive historical, it is included in the Internet Archive's wayback machine.

21 days ago

1 reply

>When I pretend to be human

>i am lux (it/they/she in English, ça/æl/elle in French)

This blog is written by an insane activist who's claiming to be an animal.

21 days ago

1 reply

So? Does that somehow invalidate this article?

21 days ago

1 reply

First 20 seconds of reading the article already indicated something like this. It's all the same.

21 days ago

1 reply

I don’t think you actually read the article then. The first few paragraphs go into how many pages a git repo can actually be. In fact, you had to actually click on an entirely different link to read about them being a furry, meaning you purposefully went looking for something to complain about, rather than having an argument based on the merits of the post alone.

21 days ago

1 reply

There are two unfortunate realities here. One, the blatant obviousness of this general complex of ideas and behaviors. Two, how I won't take the knee and just listen to a guy who's broadcasting his b*stiality-adjacent fetish to the world.

Sell it to someone else.

20 days ago

1 reply

I’m no fan furries either, but I also don’t see what bearing someone’s personal life has on how to configure a git forge. Maybe you should grow the fuck up?

20 days ago

1 reply

>personal life

>Maybe you should grow the f*ck up?

Not very "personal" when this form of psychopathy oozes through on every step and happens to be on full public display. But that's the intent, for both of youse. And silent capitulation is the minimal subscription.

But that's not happening. Eat sh*t.

17 days ago

1 reply

You know you’re allowed to swear online right?

You actively had to go looking for something about this blog poster to be angry about. It must be painful to live that way, needing to find fault in people so desperately. Therapy might do you some good my friend.

We’re talking about a furry, not a war criminal. “Silent capitulation” lol. You’re hilarious.

16 days ago

1 reply

>actively

passively confirming

And the b*stiality-adjacent fetish defender continues to cry out in pain as he tries to hit me. As if there was any doubt about -this- being part and parcel of your ideology.