TinyTinyTPU
github.comKey Features
Tech Stack
Key Features
Tech Stack
I suppose in the tradition of Bayesian influence, VAEs and the like are still common though.
A lot of silicon on a GPU is dedicated to upscaling and matrix multiply.
Ultimately GPU's main use is multimedia and graphics focused.
See all the miners that used to do GPU based mining...or the other niche markets where eventually the cost of custom asic becomes to attractive to ignore even if you as a consume have to handle a few years of growing pains.
really excited about Rubin CPX / Feynman generations, let's see what the LPU does to the inference stack
from tpu_compiler import TPUCompiler, TPURuntime
class Custom(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(2, 2, bias=False)
self.layer2 = nn.Linear(2, 2, bias=False)
def forward(self, x):
x = self.layer1(x)
x = torch.relu(x)
x = self.layer2(x)
return x
model = train_model(your_data)# compile to the tiny tiny TPU format
compiler = TPUCompiler()
compiled = compiler.compile(model)
# run and enjoy :)
runtime = TPURuntime(tpu)
result = runtime.inference(compiled, input_data)
Will update soon with some better documentation, but hopefully this will get you started!
- Alan and Abiral
I suspect GPU inference will come to an end soon, as it will likely be wildly inefficient by comparison to purpose built transformer chips. All those Nvidia GPU-based servers may become obsolete should transformer ASICs become mainstream. GPU bitcoin mining is just an absolute waste of money (cost of electricity) now. I believe the same will be true for GPU-based inference soon. The hundreds of billions of dollars being invested on GPU-based inference seems like an extremely risky bet that ASIC transformers won't happen, although Google has already widely deployed their own TPUs.
When people say things like this I always wonder if they really think they're smarter than all of the people at Nvidia lolol
> Google Gemini already uses their own TPU chips
Google has been using TPUs in prod for like a decade.
Is there any public info about % inference on custom vs GPU, for these companies?
https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...
Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.
Nice username btw.
You would need to re-implement a general purpose cpu to beat it, or that was the idea behind RandomX
I mean, I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago.
B200 can do 10 peta ops at fp8, theoretically.
> I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago
You say FPGAs won't get dedicated logic for ML, then you say they did.
Why does it matter whether the matrix multiplication units inside the AI Engine are a systolic array or not? The multipliers support 512 bit inputs which means 4x8 times 8x4 for bfloat16 with one multiplication per cycle and bigger multiplications with smaller data types. Since it is a VLIW processor, it is much easier to achieve full utilisation of the matrix multiplication units, because you can run loads, stores and process tiles all simultaneously in the same cycle.
The only thing that might be a challenge is arranging the communication between the AI Engines, but even that should be blatantly obvious. If you are doing matrix multiplication, you should be using the entire array in exactly the pattern you think they should be using internally.
Who knows, maybe there is a way to implement flash attention like that too.
That being said, versal aie for ml has been a terrible failure. The reasons for why are complicated. One reason is because the memory hierarchy for SRAM is not a unified pool. It's partitioned into tiles and can't be accessed by all cores. additionally, access of this SRAM is only via dma engines and not directly from the cores. Thirdly, the datapaths for feeding the VLIW cores are statically set, and require a software configuration to change at runtime which is slow. Programming this thing makes the cell processor look like a cakewalk. You gotta program dma engines, you program hundreds of VLIW cores, you need to explicitly setup on chip network fabric. I could go on.
Anyway, my point is FPGAs aren't getting ML slices. Some FPGAs do have a completely separate thing that can do ML, but what is shipped is terrible. Hopefully that makes sense.
I think you spoke too soon about their failure, sooner they will be much easier to program [1].
Interestingly, Nvidia GPU now is also moving to tile-based GPU programming model that targets portability for NVIDIA Tensor Cores [2]. Recently there're discussions on the topic at HN [3].
[1] Developing a BLAS Library for the AMD AI Engine [pdf]:
https://uni.tlaan.nl/thesis/msc_thesis_tristan_laan_aieblas....
[2] NVIDIA CUDA Tile:
https://developer.nvidia.com/cuda/tile
[3]CUDA Tile Open Sourced (103 comments):
ASIC transformers won't happen (defined as a chip with single instruction to do sdpa from anything that is not broadly marketed as GPU, and won't have annualized sale more than $3B). Mark my word. I am happy to take a bet on longbets.org with anyone on this for $1000 and my part will go to PSF.
A special-purpose transformer inference ASIC would be like Etched's Sohu chip.
> A TPU is an application-specific integrated circuit (ASIC) designed by Google for neural networks.
- Alan and Abiral
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.