It Seems That Openai Is Scraping [certificate Transparency] Logs

https://www.merklemap.com/search?query=ycombinator.com&page=...

18 days ago

3 replies

Shameless plug :)

chuckadams

18 days ago

1 reply

[delayed]

18 days ago

2 replies

Most of merklemap is stored on ZeroFS [0] and thus allows to scale IO ressources quite crazily :)

[0] https://github.com/Barre/ZeroFS

rendaw

18 days ago

1 reply

How does ZeroFS handle consistency with writes?

17 days ago

1 reply

Id you use 9P or NBD it handles fsync as expected. With NFS, it's time based https://github.com/Barre/ZeroFS#9p-recommended-for-better-pe...

rendaw

17 days ago

Oh awesome! I was searching for consistency, but I guess durability is the word used for filesystems. Thanks!

jddj

18 days ago

> Watch Ubuntu boot from ZeroFS

Love it

nerdsniper

18 days ago

1 reply

Thank you!!! Needed exactly this at work.

18 days ago

Glad it was helpful!

rendaw

18 days ago

1 reply

The first page of results doesn't include ycombinator.com. I get `app.baby-ycombinator.com`, `ycombinator.comchat.com`, everything in between.

Substring doesn't seem like what I'd want in a subdomain search.

https://www.merklemap.com/search?query=*.ycombinator.com&pag...

17 days ago

> The first page of results doesn't include ycombinator.com. I get `app.baby-ycombinator.com`, `ycombinator.comchat.com`, everything in between.

That's the whole point.

> Substring doesn't seem like what I'd want in a subdomain search.

Well, if you want only subdomains search for *.ycombinator.com.

1vuio0pswjnm7

18 days ago

1 reply

Considering how it must be getting hammered what with the "AI" nonsense, it's interesting how crt.sh continues to remain usable, particularly the (limited) direct PostgresSQL db access

To me, this is evidence that SQL databases with high traffic can be made directly accessible on the public internet

crt.sh seems to be more accessible at certain times of the day. I can remember when it had no such accessibility issues

miki123211

18 days ago

It is not usable.

It's the only website I know of where queries can just randomly fail for no reason, and they don't even have an automatic retry mechanism. Even the worst enterprise nightmares I've seen weren't this user unfriendly.

pavel_lishinAuthor

18 days ago

3 replies

What's the yawn for?

18 days ago

2 replies

It implies that this is boring and not article/post-worthy (which I agree with).

pavel_lishinAuthor

18 days ago

2 replies

> It implies that this is boring and not article/post-worthy (which I agree with).

It's certainly news to me, and presumably some others, that this exists.

18 days ago

1 reply

Which part is news?

If certificate transparency is new to you, I feel like there are significantly more interesting articles and conversations that could have been submitted in place than "A public log intended for consumption exists, and a company is consuming that log".

If the fact that OpenAI is scraping certificate transparency logs is new and interesting to you, I'd love to know why it is interesting. Perhaps I'm missing something.

dylan604

18 days ago

> I feel like there are significantly more interesting articles

if this is the article that introduces someone to the concept of certificate transparency, then there's nothing wrong with that. graciously, you followed through with links to what you consider more interesting. that is not something a lot of commenters do and just leave it as a snarky comment for someone being one of the lucky 10000 for the day.

18 days ago

1 reply

Yeah, this is the unspoken part about HTTPS: you enable it, you also announce to the entire world you're serving stuff from specific DNS names :).

(Which is why I hate it that it's so hard to test things locally without having to get a domain and a certificate. I don't want to buy domain names and announce them publicly for the sake of random script that needs to offer a HTTP endpoint.)

Modern security is introducing a lot of unexpected couplings into software systems, including coupling to political, social and physical reality, which is surprising if you think in terms of programs you write, which most likely shouldn't have any such relationships.

My favorite example of such unexpected coupling, whose failures are still regularly experienced by users, is wall clock time. If your program touches anything related to certificates, even indirectly, suddenly it's coupled to actual real clock and your users better make sure their system time is in synch with the rest of the world, or else things will stop working.

imtringued

17 days ago

1 reply

You do know that /etc/hosts is a file you can edit, right? You hopefully also know that you can create your own certificate authority or self signed certificates and add them to your CA store.

16 days ago

> You do know that /etc/hosts is a file you can edit, right?

Yes. What does it have to do with HTTPS?

> You hopefully also know that you can create your own certificate authority or self signed certificates and add them to your CA store.

Sorta, kinda. Does it actually work with third-party apps? Does it work with mobile systems? If not, then it's not a valid solution, because it doesn't allow me to run my stuff in my own networks without interfacing with the global Internet and social and political systems backing its cryptographic infrastructure.

JumpCrisscross

18 days ago

1 reply

> Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting

Oh, I read this as indicating OpenAI may make a move into the security space.

prettyblocks

18 days ago

1 reply

Even if it's just for their internal security initiatives it would make sense given how massive they are. Threat hunting via cert monitoring is very effective.

ozim

18 days ago

But it isn’t. Guy posted the fact they sent bot for scraping.

That’s not the intended use for CT logs.

moralestapia

18 days ago

Because it's hardly news in its context.

xpe

18 days ago

Presumably this is well-known among people that already know about this. Some people are born ready. The rest of us should have already known this and not wasted the time of Those Who Already Know. But (yawn), alas, everybody should have already known this meta-commentary as well.

irishcoffee

18 days ago

4 replies

Everyone does it, it’s no big deal. “Yes officer I was speeding, so was everyone else!”

Gross.

edvinbesic

18 days ago

You are implying that a law is being broken, but isn't this the equivalent of going to city hall to pull public land records?

formerly_proven

18 days ago

The whole point of CT logs is to make issuance of certificates in the public WebPKI… public.

tsimionescu

18 days ago

The whole point of the CT logs is to be a public list of all domains which have TLS certs issued by the Web PKI. People are reading this list. I really don't see what is either surprising or in any way problematic in doing so.

18 days ago

The intended purpose of certificate transparency logs is to be viewed by others!

Perhaps you should save your "gross" judgement for when you better understand what's happening?

ekr____

18 days ago

3 replies

With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.

827a

18 days ago

1 reply

These exist for apex domains; the real use-case is subdomains.

ekr____

18 days ago

Sure, but the subdomains will be duplicated for the same reasons.

agwa

18 days ago

There's an extension to static-ct-api, currently implemented by Sunlight logs, that provides a feed of just SANs and CNs: https://github.com/FiloSottile/sunlight/blob/main/names-tile...

For example:

  curl https://tuscolo2026h1.skylight.geomys.org/tile/names/000 | gunzip

(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)

18 days ago

Merklemap offers that: https://www.merklemap.com/documentation/live-tail

raldi

18 days ago

1 reply

What reason?

electroly

18 days ago

The CT log tells you about new websites as soon as they come online. Good if you're intending to scrape the web.

1vuio0pswjnm7

18 days ago

2 replies

"... for exacty this reason."

Needs clarification. What reason

18 days ago

1 reply

Knowing what DNS names are actually used.

thesuitonym

18 days ago

1 reply

I don't really see how this is a flip-side. If you're putting something on the web, presumably you want it to be accessed by others, so this is actually a benefit.

If you didn't want others to access your service, maybe consider putting it in a private space.

aziaziazi

18 days ago

1 reply

There’s usages of https that don’t overlap with "the (public) web".

tbrownaw

18 days ago

1 reply

All of the internal stuff at $employer uses a private CA. I suspect this is fairly universal at places that aren't super tiny.

https://github.com/joohoi/acme-dns

16 days ago

Problem is a lack of solutions that work at places that are tiny, such as a small company, or a household. This is yet another area of the computing ecosystem that forgets there are other use cases for computers than commerce.

1vuio0pswjnm7

18 days ago

s/exacty/exactly

"I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:"

The reason presented by the blog post is "for what I assume are things to scrape from"

Putting aside the "assume" part (see below^1), is this also the reason that the other "systems" are "scraping" CT logs

After OpenAI "scrapes" then what does OpenAI do with the data (readers can guess)

But what about all the other "systems", i.e., parties that may use CT logs. If the logs are public then that's potentially a lot of different parties

Imagine in an age before the internet, telephone subscriber X sets up a new telephone line, the number is listed in a local telephone directory ("the phone book") and X immediately receives a phone call from telephone subscriber Z^2

X then writes an op-ed that suggests Z is using the phone book "for who to call"

This is only interesting if X explains why Z was calling or if the reader can guess why Z was calling

Anyone can use the phone book, anyone can use ICANN DNS, anyone can use CT logs, etc.

Why does someone use these public resources. Online commenter: "To look up names and numbers"

Correct. But that alone is not very interesting. Why are they looking up the names and numbers

We can make assumptions about why someone is using a public resource, i.e., what they will use the data for. But that's all they are: assumptions

With the telephone, X could ask "Why are you calling?"

With the internet, that's not possible.^3 This leads to speculation and assumptions. Online commenters love to speculate, and often to make conclusions without evidence

No one knows _everything_ that OpenAI does with the data it collects except OpenAi employees. The public only knows about what OpenAi chooses to share

Similarly no one knows what OpenAI will do with the data in the future

One could speculate that it's naive to think that, in the longterm, data collected by "AI" companies will only be used for "AI"

2. The telephone service also had the notion of "unlisted numbers", but that's another tangent for discussion

3. Hence for example people who do port scans of the IPv4 address space will try to prevent the public from accessing them by restricting access to "researchers", etc. Getting access always involves contacting the people with the scans and explaining what the requester will do the data. In other words, removing speculation

tech234a

18 days ago

The Web Archive also uses the Certificate Transparency logs, some websites that aren't linked anywhere end up in the Wayback Machine this way: https://archive.org/details/certificate-transparency?tab=abo...

kccqzy

18 days ago

Certificate transparency log is a Google project. They don’t need to scrape it. They host all the data.

jcims

18 days ago

4 replies

Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?

>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;

deathanatos

18 days ago

… the UA is malformed, even.

Makes me want to reconfigure my servers to just drop such traffic. If you can't be arsed to send a well-formed UA, I have doubts that you'll obey other conventions like robots.txt.

snowwrestler

18 days ago

Right. Crawler user agent strings in general tend to include all sorts of legacy stuff for compatibility.

This actually is a well-behaved crawler user agent because it identifies itself at the end.

Hrun0

18 days ago

Yes, it is very common to change your useragent for web scraping. Mainly because there are websites which will block you just based on that alone

benjojo12

18 days ago

the ip address the this comes from is a OpenAI search bot range:

> "ipv4Prefix": "74.7.175.128/25"

from https://openai.com/searchbot.json

throwaway613745

18 days ago

2 replies

OpenAI is scraping everything that is publicly accessible. Everything.

Aachen

18 days ago

1 reply

Yet they provide the user agents and IP address ranges which they scrape from, and say they respect robots.txt

I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it

immibis

18 days ago

There's no evidence the barrage of residentially-proxied bot accesses hitting every public website have anything to do with OpenAI, but then again, there's also no evidence they don't.

warkdarrior

18 days ago

So do Google, Microsoft/Bing, Yandex, etc. How else would they make sure their search/chatbot/q&a products are up to date?

bombcar

18 days ago

3 replies

If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...

Then all they know is the main domain, and you can somewhat hide in obscurity.

lysace

18 days ago

4 replies

Unfortunately they are a bit extra bothersome to automate (depending on your DNS provider/setup) because of the DNS CNAME-method validation requirement.

jsheard

18 days ago

2 replies

Yep, but next year they intend to launch a new challenge type which doesn't require write access to your DNS records every time it renews. Just add a public key to your DNS once and you're done.

8cvor6j844qw_d6

18 days ago

Great to hear, one less API keys needed for the DNS records.

Ajedi32

18 days ago

Oh, sweet! I didn't know about this. I have no need of wildcard certs, but this will greatly simplify the process of issuing certificates for internal services behind my local firewall. No need to maintain an acme-dns server; just configure the ACME client, set the DNS record and you're done? Very nice.

Reventlov

18 days ago

1 reply

also you can use https://github.com/krtab/agnos if you don't have any api access

Ajedi32

18 days ago

I hadn't heard of Agnos before, interesting alternative to ACME-DNS.

Looking at the README, is the idea that the certificates get generated on the DNS server, not by the ACME client on each machine that needs a certificate? That seems like a confusing design choice to me.

cortesoft

18 days ago

If you are using a non-standard DNS provider that doesn’t have integration with certbot or cert-manager or whatever you are using, it is pretty easy to set up an acme-dns server to handle it

ls612

18 days ago

When I set up a wildcard cert for my homelab services it was easy to have Cloudflare give me an API token to do the DNS validation for LE.

vault

18 days ago

1 reply

Correct, that's what I did with caddy, which is now periodically renewing my wildcard certificate through a DNS-01 challenge.

8cvor6j844qw_d6

18 days ago

May I know does Caddy automatically update with apt if you built custom Caddy binaries for the DNS provider plugin?

Also, may I know which DNS provider you went with? The GitHub issues pages with some of the DNS provider plugins seems to suggest some are more frequently maintained, while some less so.

bityard

18 days ago

2 replies

Yep, but this comes with a tradeoff: all of your services now have a valid key/cert for your whole domain, significantly increasing the blast radius if one service is compromised.

silverwind

18 days ago

1 reply

Not a problem if you have the cert on a shared load balancer, not on the services directly.

0127

18 days ago

This is what we do for development containers/hosts - put them behind *.dev.example.com, allows us to hide most testing instances using a shared load balancer. And with a single wildcard CNAME, No info is leaked in CT logs or DNS. Said LB is firewalled, but why pay for extra traffic that's just going to be blocked?

nh2

18 days ago

1 reply

Is it technically possible to obtain a wildcard cert from LetsEncrypt, but then use OpenSSL / X.509 tooling to derive a restricted cert/key to be deployed on servers, which only works for specific domains under the wildcard?

alphager

16 days ago

xpe

18 days ago

1 reply

Meta: People are quite excellent at jumping to conclusions or assuming their POV is the only one.

Consider this simplified scenario.

    - X happened
    - Person P says "Ah, *X* happened."
    - Person Q *interprets* this in a particular way
      and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
      (and indifferent to what others notice
       or might know or be interested in)
      ...says "(yawn)".

This failure mode is incredibly common. And preventable. What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.

See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions

47282847

18 days ago

1 reply

I agree with your analysis but try to not agree with your conclusion, purely for my own metal hygiene: I believe one can retrain the pattern matching of one’s brain for happier outcomes. If I let my brain judge this as a “failure“ (judgment “it is wrong“), I will either get sad about it (judgment “… and I can’t change it“) or angry (… and I can do something about it“). In cases such as this I prefer to accept it as is, so I try to rewrite my brain rule to consider it a necessary part of life (judgment “true/good/correct“).

xpe

18 days ago

Ah, my conclusion isn't to blame the individuals. My assessment is to seek out better communication patterns, which is partly about "technology" and partly about culture (expectations). People could indeed learn not to act this way with a bit of subtle nudging, feedback, and mechanism design.

matt3210

18 days ago

3 replies

Your content is stolen for training the moment you put it up