Key Takeaways
uint8_t *data = (void *)(long)ctx->data;
before I stopped reading. I had to go look up the struct xdp_md [1], it is declared like this: struct xdp_md {
__u32 data;
__u32 data_end;
__u32 data_meta;
/* ... further fields elided ... */
};
So clearly the `data` member is already an integer. The sane way to cast it would be to cast to the actual desired destination type, rather than first to some other random integer and then to a `void` pointer.Like so:
uint8_t * const data = (uint8_t *) ctx->data;
I added the `const` since the pointer value is not supposed to change, since we got it from the incoming structure. Note that that `const` does not mean we can't write to `data` if we feel like it, it means the base pointer itself can't change, we can't "re-point" the pointer. This is often a nice property, of course.[1]: https://elixir.bootlin.com/linux/v6.17/source/include/uapi/l...
The code segment containing that code looks like a no-op.
The rest of the post seems sane and well informed, so my theory is that this is a C / packet filtering idiom I’m not aware of, working far from that field.
Otherwise I’m already freaked out by treating a 32 bit field as a pointer… even if you extend it to first.
The cast from a 32 bit pointer to a 64 bit pointer is in fact an eBPF oddity. So what's happening here is that the virtual machine is just giving us a fake memory address just to use in the program and when the read actually needs to happen the kernel just rewrites the virtual addresses to the real ones. I'm assuming this is just a byproduct of the memory separation that eBPF does to prevent filters from accidentally reading kernel memory.
Also yes the double cast is just to keep the compiler from throwing a warning.
To the contrary - if someone "bypasses" Anubis by setting the user agent to Googlebot (or curl), it means it's effective. Every Anubis installation I've been involved with so far explicitly allowed curl. If you think it's counterproductive, you probably just don't understand why it's there in the first place.
(yes, there are also people who use it as an anti-AI statement, but that's not the reason why it's used on the most high-profile installations out there)
Very interesting, so we're about to come full circle?
Can't wait to have to mask myself as a (paying?) AI scraper to bypass annoying captchas when accessing "bot protected" websites...
In a month or two, I can be annoyed when I see some vibe-coded AI startup's script making five million requests a day to work's website with this.
They'll have been ignoring the error responses:
{"All data is public and available for free download": "https://example.edu/very-large-001.zip"}
— a message we also write in the first line of every HTML page source.Then I will spend more time fighting this shit, and less time improving the public data system.
Having an open source version allows regular people to do scraping and not just those rich in capital.
Much of the best data services on the internet all start with scraping, the README lists many of them.
This reminds me of how Stripe does user tracking for fraude detection https://mtlynch.io/stripe-update/ I wonder if thermoptic could handle that.
Unless headless browsers become cheap enough for that base cost to go to effectively zero too, of course, but I trust web bloat to continue pushing out that intersection point for a bit more.
$ md5 How\ I\ Block\ All\ 26\ Million\ Of\ Your\ Curl\ Requests.html
MD5 (How I Block All 26 Million Of Your Curl Requests.html) = e114898baa410d15f0ff7f9f85cbcd9d(downloaded with Safari)
$ curl https://foxmoss.com/blog/packet-filtering/ | md5sum
e114898baa410d15f0ff7f9f85cbcd9d -
I'm aware of curl-impersonate https://github.com/lwthiker/curl-impersonate which works around these kinds of things (and makes working with cloudflare much nicer), but serious scrapers use chrome+usb keyboard/mouse gadget that you can ssh into so there's literally no evidence of mechanical means.Also: If you serve some Anubis code without actually running the anubis script in the page, you'll get some answers back so there's at least one anubis-simulator running on the Internet that doesn't bother to actually run the JavaScript it's given.
Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?
That doesn't matter, does it? Those 26 million requests could be going to actual users instead and 300 requests per second is non-trivial if the requests require backend activity. Before you know it you're spending most of your infra money on keeping other people's bots alive.
The Number in the Title is basically fantasy. (Not based on the authors RL experience.) So is saying a DDoS is well distributed over 24 hours.
Keystroke dynamics and mouse movement analysis are pretty fun ways to tackle more advanced bots: https://research.roundtable.ai/proof-of-human/
But of course, it is a game of cat and mouse and there are ways to simulate it.
It sucks so bad. If I solve the captchas by moving the mouse too quickly, Google asks me to try again. If I'm deliberately slow and erratic with my movements as I click on pictures, it almost always lets me through on the first click. Been manually A/B testing this for years and remains true today.
The simple tools still work remarkably well.
There are very very effective services for bot detection that still rely heavily on keyboard and mouse behavior.
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
https://github.com/lexiforest/curl_cffi
https://github.com/encode/httpx
Granted it wasn't a whole lot of money spent, but why waste money and resources so "claude" can scrape the same cgit repo over and over again?
>(1) root@gentoo-server ~ # grep 'claude' /var/log/lighttpd/access.log | wc -l
>1099323All of it is fairly easy to fake. JavaScript is the only thing that poses any challenge and what challenge it poses is in how you want to do it with minimal performance impact. The simple truth is that a motivated adversary can interrogate and match every single minor behavior of the browser to be bit-perfect and there is nothing anyone can do about it - except for TPM attestations which also require a full jailed OS environment in order to control the data flow to the TPM.
Even the attestation pathway can probably be defeated, either through the mandated(?) accessibility controls or going for more extreme measures. And putting the devices to work in a farm.
If that means blocking foreign access, the problem is solved anyway.
If laws appear, the entire planet, all nations must agree and ensure prosecuting on that law. I cannot imagine that happening. It hasn't with anything compute yet.
So it'll just move off shore, and people will buy the resulting data.
Also is your nick and response sarcasm?
Isn't this how we get EU's digital ID nonsense? Otherwise, how do you hold an anon user behind 5 proxies accountable? What if its from a foreign country?
Imo use ja3/ja4 as a signal and block on src IP. Don't show your cards. Ja4 extensions that use network vs http/tls latency is also pretty elite to identify folks proxying.
Blocking on source IP is tricky, because that frequently means blocking or rate-limiting thousands of IPs. If you're fine with just blocking entire subnets or all of AWS, I'd agree that it's probably better.
It really depends on who your audience is and who the bad actors are. For many of us the bad actors are AI companies, and they don't seem to randomize their TLS extensions. Frankly many of them aren't that clever when it comes to building scrapers, which is exactly the problem.
~$ curl https://foxmoss.com/.git/config [core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true [remote "origin"] url = https://github.com/FoxMoss/PersonalWebsite fetch = +refs/heads/:refs/remotes/origin/ [branch "master"] remote = origin merge = refs/heads/master
The author is probably using git to push the content to the hosting server as an rsync alternative, but there does not seem to be much leaked information, apart from the url of the private repository.
You can wget the whole .git folder and look through the commit history, so if at any point something had been pushed which should not have been its available
I wouldn't call it "much harder". All you need to bypass the signature is to choose random ciphers (list at https://curl.se/docs/ssl-ciphers.html) and you mash them up in a random order separated by colons in curl's --ciphers option. If you pick 15 different ciphers in a random order, there are over a trillion signatures possible, which he couldn't block. For example this works:
$ curl --ciphers AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA:... https://foxmoss.com/blog/packet-filtering/
But, yes, most bots don't bother randomizing ciphers so most will be blocked.Most of the abusive scraping is much lower hanging fruit. It is easy to identify the bots and relate that back to ASNs. You can then block all of Huawei cloud and the other usual suspects. Many networks aren't worth allowing at this point.
For the rest, the standard advice about performant sites applies.
Edit: Nevermind, I see part of the default config is allowing Googlebot, so this is literally intended. Seems like people who criticize Anubis often don't understand what the opinionated default config is supposed to accomplish (only punish bots/scrapers pretending to be real browsers).
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.