Apple Silicon GPU Support in Mojo

3 months ago

1 reply

Modular/Mojo is faster than NVIDIA's libraries on their own chips, and open source instead of binary blob. See the 4 part series that culimates in https://www.modular.com/blog/matrix-multiplication-on-blackw... for Blackwell for example.

fooblaster

3 months ago

thanks

saagarjha

3 months ago

CUTLASS would like to have a word with you.

timmg

3 months ago

I was under the impression that their uptake it slow or non-existant. Am I wrong on that?

3 months ago

1 reply

This is attempt number 2, it was already tried before with Swift for Tensorflow.

Guess why it wasn't a success, or why Julia is having adoption issues among the same community.

Or why although Zig is basically Modula-2 type system, it is being more hyped than Modula-2 ever was since 1978 (it is even part of GCC nowadays).

Syntax and familiarity matters.

a96

3 months ago

1 reply

I think the only Zig hype I'm seeing is about its compiler and compatibility. Those might well be the same two reasons why you never hear about modula-2.

3 months ago

I am older than Modula-2, so I heard a lot, many of the folks hyping Zig still think the world started with UNIX.

ziofill

3 months ago

Exactly, the idea of not having to learn yet a new language is very compelling

golly_ned

3 months ago

The syntax is based on python, but its runtime is not. So nothing about the contrast between the python language and mojo's use as a super-parallelized parallel processing system is inconsistent.

behnamoh

3 months ago

1 reply

signing up to try a programming language (Mojo) is as bad as logging in to your terminal before using it (Warp).

timmyd

3 months ago

Co-founder here. There isn't any signup - that was 2+ years ago and we've been iterating a lot with the community and listening to feedback - which has been wonderful. Go freely and install with Pip, UV, Pixi etc -> https://docs.modular.com/mojo/manual/install

3 months ago

7 replies

I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012. Mojo is never going to be anything but a vanity project.

growthwtf

3 months ago

1 reply

Nah. There's huge alpha here, as one might say. I feel like this comment could age even more poorly than the infamous dropbox comment.

Even with Jax, PyTorch, HF Transformers, whatever you want to throw at it--the dx for cross-platform gpu programming that are compatible with large language models requirements specifically is extremely bad.

I think this may end up be the most important thing that Lattner has worked on in his life (And yes, I am aware of his other projects!)

3 months ago

2 replies

Comments like this view the ML ecosystem in a vacuum. New ML models are almost never written—all LLMs for example are basically GPT-2 with extremely marginal differences—and the algorithms themselves are the least of the problem in the field. The 30% improvements you get from kernels and compiler tricks are absolutely peanuts compared to the 500%+ improvements you get from upgrading hardware, adding load balancing and routing, KV and prefix caching, optimized collective ops etc. On top of that, the difficulty even just migrating Torch to the C++11 ABI to access fp8 optimizations is nigh insurmountable in large companies.

I say the ship sailed in 2012 because that was around when it was decided to build Tensorflow around legacy data infrastructure at Google rather than developing something new, and the rest of the industry was hamstrung by that decision (along with the baffling declarative syntax of Tensorflow, and the requirement to use Blaze to build it precluding meaningful development outside of Google).

The industry was so desperate to get away from it that they collectively decided that downloading a single giant library with every model definition under the sun baked into it was the de facto solution to loading Torch models for serving, and today I would bet you that easily 90% of deep learning models in production revolve around either TensorRT, or a model being plucked from Huggingface’s giant library.

The decision to halfass machine learning was made a long time ago. A tool like Mojo might work at a place like Apple that works in a vacuum (and is lightyears behind the curve in ML as a result), but it just doesn’t work on Earth.

If there’s anyone that can do it, it’s Lattner, but I don’t think it can be done, because there’s no appetite for it nor is the talent out there. It’s enough of a struggle to get big boy ML engineers at Mag 7 companies to even use Python instead of letting Copilot write them a 500 line bash script. The quality of slop in libraries like sglang and verl is a testament to the futility of trying to reintroduce high quality software back into deep learning.

3 months ago

1 reply

Thank you for the kind words! Are you saying that AI model innovation stopped at GPT-2 and everyone has performance and gpu utilization figured out?

Are you talking about NVIDIA Hopper or any of the rest of the accelerators people care about these days? :). We're talking about a lot more performance and TCO at stake than traditional CPU compilers.

3 months ago

1 reply

I’m saying actual algorithmic (as in not data) model innovation has never been a significant part of the revenue generation in the field. You get your random forest, or ResNet, or BERT, or MaskRCNN, or GPT-2-with-One-Weird-Trick, and then you spend four hours trying to figure out how to preprocess your data.

On the flipside, far from figuring out GPU efficiency, most people with huge jobs are network bottlenecked. And that’s where the problem arises: solutions for collective comms optimization tend to explode in complexity because, among other reasons, you now have to package entire orchestrators in your library somehow, which may fight with the orchestrators that actually launch the job.

Doing my best to keep it concise, but Hopper is like a good case study. I want to use Megatron! Suddenly you need FP8, which means the CXX11 ABI, which means recompiling Torch along with all those nifty toys like flash attention, flashinfer, vllm, whatever. Ray, jsonschema, Kafka and a dozen other things also need to match the same glibc and glibc++ versions. So using that as an example, suddenly my company needs C++ CICD pipelines, dependency management etc when we didn’t before. And I just spent three commas on these GPUs. And most likely, I haven’t made a dime on my LLMs, or autonomous vehicles, or weird cyborg slavebots.

So what all that boils down to is just that there’s a ton of inertia against moving to something new and better. And in this field in particular, it’s a very ugly, half-assed, messy inertia. It’s one thing to replace well-designed, well-maintained Java infra with Golang or something, but it’s quite another to try to replace some pile of shit deep learning library that your customers had to build a pile of shit on top of just to make it work, and all the while fifty college kids are working 16 hours a day to add even more in the next dev release, which will of course be wholly backwards and forwards incompatible.

But I really hope I’m wrong :)

growthwtf

3 months ago

1 reply

Lattner's comment aside (which I'm fanboying a little bit at), I do tend to agree with your pessimism/realism for what it's worth. It's gonna be a long long time before that whole mess you're describing is sorted out, but I'm confident that over the next decade we will do it. There's just too much money to be made by fixing it at this point.

I don't think it's gonna happen instantly, but it will happen, and Mojo/Modular are really the only language platform I see taking a coherent approach to it right now.

3 months ago

I tend to agree with you, but I hoped the field would start collectively figuring out how to be big boys with CICD and dependency management back in 2017–I thought Google’s awkward source release of BERT was going to be the low point, and we’d switch to Torch and be saved. Instead, it’s gotten so much worse. And the kind of work that the Python core team has been putting into package and dependency management is nothing short of heroic, and it still falls short because PyTorch extends the Python runtime itself, and now Torch compile intercepting Py_FrameEval and NVIDIA is releasing Python CUDA bindings.

It’s just such a massive, uphill, ugly moving target to try to run down. And I sit here thinking the same as many of these comments—on the one hand, I can’t imagine we’re still using Python 3 in 2035? 2050?? But on the other hand I can’t envision a path leading out of the mess making money, or at least continue pretending they’ll start to soon.

wolvesechoes

3 months ago

And comments like this forget that there is more to AI and ML than just LLMs or even NNs.

rvz

3 months ago

1 reply

> I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012.

Nope. There's certainly room for another alternative that's performant and portable than the rest without the hacks needed to meet it.

Maybe you caught the wrong ship, but Mojo is a speedboat.

> Mojo is never going to be anything but a vanity project.

Will come back in 10 years and we'll see if your comment needs to be studied like the one done for Dropbox.

3 months ago

Any actual reasoning for that claim?

erichocean

3 months ago

You could have said the same about MLX on Apple Silicon, yet here we are.

epistasis

3 months ago

Pytorch didn't even start until 2016, taking a lot of market share from Tensorflow.

I don't know if this is a language that will catch on, but I guarantee there will be another deep learning focused language that catches on in the future.

atty

3 months ago

To be fair, triton is in active use, and this should be even more ergonomic for Python users than triton. I dont think it’s a sure thing, but I wouldn’t say it has zero chance either.

golly_ned

3 months ago

Tritonlang itself is a deep learning DSL.

3 months ago

Now that NVidia finally got serious with Python tooling and JIT compilers for CUDA, I also see it becoming harder, and those I can use natively on Windows, instead of having to be on WSL land.

3 months ago

1 reply

Using this in Julia since 2022. :D

bahmboo

3 months ago

1 reply

I would be truly interested if you could expand on this. I know I can do my own research but I'm starting down the path of what could be called performance python or something similar and real world stories help.

3 months ago

1 reply

My use case is realtime audio processing (VST plugins).

Metal.jl can be used to write GPU kernels in Julia to target an Apple Silicon GPU. Or you can use KernelAbstractions.jl to write once in a high-level CUDA-like language to target NVIDIA/AMD/Apple/Intel GPUs. For best performance, you'll want to take advantage of vendor-specific hardware, like Tensor Cores in CUDA or Unified Memory on Mac.

You also get an ever-expanding set of Julia GPU libraries. In my experience, these are more focused on the numerical side rather than ML.

If you want to compile an executable for an end user, that functionality was added in Julia 1.12, which hasn't been released yet. Early tests with the release candidate suggest that it works, but I would advise waiting to get a better developer experience.

larme

3 months ago

1 reply

I'm very interesting in this field (realtime audio + GPU programming). How do you deal with the latency? Do you send or multiple single vectors/buffers to GPU?

Also I think because samples in one channel need to be processed sequentially, does that mean mono audio processing won't benefit a lot from GPU programming. Or maybe you are dealing with spectral signal processing?

3 months ago

1 reply

Yes, I process per-buffer, same as on CPU.

You need to find parallelism somewhere to make it worth it. This can be multiple independent channels/voices, one large simulation, one high quality simulation, a large neural network, solving PDEs, voxel simulation (https://www.youtube.com/watch?v=1bS7sHyfi58), additive synthesis, a multitude of FFTs...

larme

3 months ago

Thanks for the answers!

rvz

3 months ago

3 replies

We need a Pythonic language that is compatible with the Python ecosystem designed for machine learning use-cases and compiles directly to an executable with direct specialized access to the low-level GPU cores and is a fast as Rust.

The closest to that is Mojo and borrows many of Rust's ideas, built in type safety with the aim of being compatible with the existing Python ecosystem which is great.

I've never heard a sound argument against Mojo and continue to see the weakest arguments that go along the lines of:

"I don't want to learn another language"

"It will never take off because we don't need another deep learning DSL"

"It's bad that a single company owns the language just like Google and Golang, Microsoft and C# and Apple and Swift".

Well I prefer tools that are extremely fast, save time and make lots of money, instead of spinning up hundreds of costly VMs as the solution. If Mojo excels in performance and reduces cost then I'm all for that, even better if it achieves Python compatibility.

3 months ago

2 replies

The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.

By itself, that's not so bad. Plenty of "buy, don't build" choices out there.

However, every other would-be Mojo user also knowns that. And they don't want to build on top of an ecosystem that's not fully open.

Why don't Mathematica/MATLAB have pytorch-style DL ecosystems? Because nobody in their right mind would contribute for free to a platform owned by Wolfram Research or Mathworks.

I'm hopeful that Modular can navigate this by opening up their stack.

3 months ago

1 reply

> The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.

You realize that CUDA isn't open source or planned to be open source in the future, right?

Meanwhile parts of Mojo are already open source with the rest expected to be opened up next year.

deagle50

3 months ago

1 reply

parent said free, not open source. I want Mojo to succeed, but I'm also doubtful of the business model.

3 months ago

1 reply

Do you get a functional version of CUDA free with AMD's much more reasonably priced hardware?

Mojo is planned to be both free and open source by the end of next year and it's not vendor locked to extremely expensive hardware.

3 months ago

1 reply

To take full advantage of Mojo you will need Modular's ecosystem, and they need to pay the VCs back somehow.

Also as of today anything CUDA works out of the box in Windows, Mojo might eventually work outside WSL, some day.

3 months ago

1 reply

Commercial use of Mojo on Nvidia hardware is already free today.

There is no disadvantage vs CUDA.

3 months ago

A language without ecosystem isn't that interesting.

yowlingcat

3 months ago

I really want to like Mojo but you nailed what gives me pause. Not to take an anecdotal example of Polars too far beyond, but I get the sense the current gravity in Python for net new stuff that needs to be written outside Python (obviously a ton of highly performant numpy/scipy/pytorch ecosystem stuff aside) is for it to be written in Rust when necessary.

Not an expert, but though I wouldn't be surprised if Mojo ends up being a better language than Rust for the use case we're discussing, I'm not confident it will ever catch up to Rust in ecosystem and escape velocity as a sane general purpose compiled systems language. It really does feel like Rust has replaced C++ for net new buildouts that would've previously needed its power.

krzat

3 months ago

1 reply

In an alternative reality, Chris invented Mojo at Apple (instead of Swift).

If one language was used for iOS apps and gpu programming, with some compatibility with python, it would be pretty neat.

zer0zzz

3 months ago

That was originally what swift was meant to be actually.

timeon

3 months ago

> "It's bad that a single company owns the language just like Google and Golang, Microsoft and C# and Apple and Swift".

I do not think that is same as VC-backed. Google/Microsoft/Apple need those languages for their ecosystem/infrastructure. Danger there is "just" vendor lock-in. With VC-backed language there is also possibility of enshittification.

lordofgibbons

3 months ago

6 replies

How many people here actually write custom cuda/triton kernels? An extremely small hand full of people write them (and they're all on one discord server).. which then gets integrated into Pytorch, Megatron-LM, vLLM, SGLang, etc. The rest of us in the ML/AI ecosystem have absolutely no incentive to migrate off of python due to network effects even though I think it's a terrible language for maintainable production systems.

If Mojo focuses on systems software ( and gets rid of exceptions - Chris, please <3 ) it will be a serious competitor to Rust and Go. It has all the performance and safety of Rust with a significantly easier learning curve.

huevosabio

3 months ago

2 replies

Which Discord server? I want in!

3 months ago

The Mojo discord and forums are all listed here: https://www.modular.com/community

fprog

3 months ago

Not OP, but my guess is GPU MODE. https://discord.gg/gpumode

latemedium

3 months ago

1 reply

I think part of the reason why just a few people write custom CUDA / triton kernels is that it's really hard to do well. Languages like Mojo aim to make that much easier, and so hopefully more people will be able to write them (and do other interesting things with GPUs that are too technically challenging right now)

dogma1138

3 months ago

The only question will there be a benefit in writing your own kernels in something like Mojo than to skip that part altogether and use the primitives with already highly optimized kernels that frameworks like torch provide especially when it comes to performance.

3 months ago

1 reply

Probably tens of thousands of people. You do know that CUDA is used for more than just AI/ML?

3 months ago

1 reply

I guess given all the hype people tend to forget why GPGPU is used for, it is like the common memme of why CUDA when there is PyTorch.

dogma1138

3 months ago

1 reply

Non-ML GPGPU uses also have frameworks/libraries which provide primitives which run on the back of already highly optimized kernels.

3 months ago

1 reply

Only for the some use cases. Many of us don’t.

dogma1138

3 months ago

1 reply

Outside of novel research what field doesn’t have this? Not challenging your assertion just curious.

3 months ago

Novel research. That in itself is a large “field.” Lots of people in academia and industry exploring new use cases and developing new libraries.

3 months ago

Mojo doesn't have C++-like exceptions, but does support throwing. The codegen approach is basically like go's (where you return a bool + error conceptually) but with the python style syntax to make it way more ergonomic than Go.

We have a public roadmap and are hard charging about improving the language, check out https://docs.modular.com/mojo/roadmap/ to learn more.

-Chris

pzo

3 months ago

If Mojo is superset of Python same like TypeScript is superset of Javascript I would really be willing to use Mojo even if just for the option to have support for mobile platforms. Right now python trying to support iOS and Android but it would take years before people will mange to compile at least so of the ecosystem.

achierius

3 months ago

Plenty of people do, many more than are in that server -- I asked some of my former coworkers and none knew about it, but we all spent a whole lot of hours tuning CUDA kernels together :). You have one perspective on this sector, but it's not the only one!

Some example motivations:

- Strange synchronization/coherency requirements

- Working with new hardware / new strategies that Nvidia&co haven't fine-tuned yet

- Just wanting to squeeze out some extra performance