Claude Is Down

https://youtu.be/DdqXT9k-050?si=L5ymXl-fYe7Fjqye

about 2 months ago

1 reply

> all the AI API's statii

The Latin plural of "status", in the accusative form, would actually be "status" as well.

Something like

  omnes status intelligentiae artificialis in eadem pagina videre amem.

oersted

about 2 months ago

1 reply

Life of Brian :)

https://news.ycombinator.com/item?id=45840078

about 2 months ago

Amazing. This movie is a treasure. Maybe one day historians will consider it to be canon to the Roman Empire and the birth of Christianity.

StarlaAtNight

about 2 months ago

If you build it, they will come

garrettjoecox

about 2 months ago

2 replies

Not intending to defend OpenAI here, but their MAU (800 million) does dwarf most other AI companies, anthropic included. I do not envy the engineers there working on scaling.

moralestapia

about 2 months ago

1 reply

Would you do 9-9-6 if your comp. is 8-9 figures/year?

garrettjoecox

about 2 months ago

1 reply

absolutely not. I have a life and hobbies

moralestapia

about 2 months ago

Not everyone is fortunate enough to not need a job to sustain themselves.

OisinMoran

about 2 months ago

Not sure MAU is the best metric here. I was recently surprised to find out their revenues are actually kind of close 12B vs 7B, so maybe closer (than could be fairly described as being dwarfed) in terms of token count?

xrd

about 2 months ago

4 replies

This is why I asked this question yesterday:

"Ask HN: Why don't programming language foundations offer "smol" models?"

If I could run smol single language models myself, I would not have to worry.

XzAeRosho

about 2 months ago

3 replies

The answer to most convenient solutions is money. There's no money in that.

xrd

about 2 months ago

But, for example, Zig as a language has prominent corporate support. And, Mitchell Hashimoto is incredibly active and a billionaire. It feels like this would be a rational way to expand the usage of a language.

xvector

about 2 months ago

No, it's because that's not how training an LLM works.

jazzyjackson

about 2 months ago

And or, the lower parameter models are straight up less effective than the giants? Why is anyone paying for sonnet and opus if mixtral could do what they do?

trvz

about 2 months ago

2 replies

Have you even tried Qwen3-Coder-30B-A3B?

xrd

about 2 months ago

I haven't. I will.

I wonder if you could ablate everything except for a specific language.

Balinares

about 2 months ago

Qwen3 Coder 30B A3B is shockingly capable for its parameter count, but I wouldn't overlook how much weight the words "for its parameter count" are carrying here.

acedTrex

about 2 months ago

because a smol model that any of the nonprofits could feasibly afford to train would be useless for actual code generation.

Hell, even the huge foundational models are still useless in most scenarios.

embedding-shape

about 2 months ago

> I wonder why I can't find a model that only does Python and is good only at that

I don't think it's that easy. The times I've trained my own tiny models on just one language (programming or otherwise), they tend to get worse results than the models I've trained where I've chucked in all the languages I had at hand, even when testing just for single languages.

It seems somewhat intuitive to me that it works like that too, programming in different (mainstream) languages is more similar than it's different (especially when 90% of all the source code is Algol-like), so makes sense there is a lot of cross-learning across languages.

about 2 months ago

6 replies

On flights with shitty wifi I have been running gpt-oss:120b on my macbook using ollama. Ok model for coding if you can't reach a good one.

embedding-shape

about 2 months ago

3 replies

GPT-OSS-120b/20b is probably the best you can run on your own hardware today. Be careful with the quantized versions though, as they're really horrible compared to the native MXFP4. I haven't looked in this particular case, but Ollama tends to hide their quantizations for some reason, so most people who could be running 20B with MXFP4, are still on Q8 and getting much worse results than they could.

about 2 months ago

1 reply

What’s the distinction between MXP4 and Q8 exactly?

embedding-shape

about 2 months ago

1 reply

It's a different way of doing quantization (https://huggingface.co/docs/transformers/en/quantization/mxf...) but I think the most important thing is that OpenAI delivered their own quantization (the MXFP4 from OpenAI/GPT-OSS on HuggingFace, guaranteed correct) whereas all the Q8 and other quantizations you see floating around are community efforts, with somewhat uneven results depending on who done it.

Concretely from my testing, both 20B and 120B has a lot higher refusal rate with Q8 compared to MXFP4, and lower quality responses overall. But don't take my word for it, the 20B weights are tiny and relatively effortless to try both versions and compare yourself.

about 2 months ago

3 replies

Wow, thanks for the info. I'm planning on testing this on my M4 Max w/ 36 GB today.

edit:

So looking here https://ollama.com/library/gpt-oss/tags it seems ollama doesn't even provide the MXFP4 variants, much less hide them.

Is the best way to run these variants via llama.cpp or...?

ode

about 2 months ago

1 reply

LMStudio

about 2 months ago

Can you be more specific? I've got LM Studio downloaded but it's not clear where are the official OpenAI releases? Are they all only available via transformers? The only one that shows up in search appears to be the distilled gpt-oss 20B...

about 2 months ago

1 reply

on the model description page they claim they support it:

Quantization - MXFP4 format

OpenAI utilizes quantization to reduce the memory footprint of the gpt-oss models. The models are post-trained with quantization of the mixture-of-experts (MoE) weights to MXFP4 format, where the weights are quantized to 4.25 bits per parameter. The MoE weights are responsible for 90+% of the total parameter count, and quantizing these to MXFP4 enables the smaller model to run on systems with as little as 16GB memory, and the larger model to fit on a single 80GB GPU.

Ollama is supporting the MXFP4 format natively without additional quantizations or conversions. New kernels are developed for Ollama’s new engine to support the MXFP4 format.

Ollama collaborated with OpenAI to benchmark against their reference implementations to ensure Ollama’s implementations have the same quality.

about 2 months ago

1 reply

Can you link to that page? I’m not finding these variants.

https://ollama.com/library/gpt-oss

about 2 months ago

as far as I can tell that is the only variant.

Patrick_Devine

about 2 months ago

The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.

https://ollama.com/library/gpt-oss

about 2 months ago

they support that format according to the model page on their site:

jmorgan

about 2 months ago

The gpt-oss weights on Ollama are native mxfp4 (the same weights provided by OpenAI). No additional quantization is applied, so let me know if you're seeing any strange results with Ollama.

Most gpt-oss GGUF files online have parts of their weights quantized to q8_0, and we've seen folks get some strange results from these models. If you're importing these to Ollama to run, the output quality may decrease.

about 2 months ago

1 reply

Could you share which Macbook model? And what context size you're getting.

onion2k

about 2 months ago

1 reply

I just checked gpt-oss:20b on my M4 Pro 24GB, and got 400.67 tokens/s on input and 46.53 tokens/s on output. That's for a tiny context of 72 tokens.

about 2 months ago

This message was amazing and I want about to hit [New Tab] and purchase one myself until the penultimate word.

turblety

about 2 months ago

3 replies

Are you running the full 65GB model on a MacBook Pro? What tokens per second do you get? What specs? M5?

iAMkenough

about 2 months ago

1 reply

If they're running 120B on a M5 (32GB max of memory today), I'd like to know how.

thaw13579

about 2 months ago

Probably an M4 which has up to 128GB currently