@hasanunlu9
After 8+ years on the Tesla Autopilot team and 3 years at Intel, I started @apexcompute to design a new architecture for efficient AI inference. For the past 9 months, we’ve been building our custom inference accelerator. Today we’re releasing Unified Engine v1. Last June we raised our seed round with @maxitechinc , DeepFin Research, @Soma_Capital and an incredible group of angel investors. In less than 9 months, we completed our RTL architecture and brought our first pre-silicon prototype to life on FPGA. Our architecture combines systolic array and vector processing in a single compute engine with multiple architectural optimizations, achieving very high FLOPs utilization. A single engine is super lean and it uses less than 90K LUTs and 1 MB Block RAM. It may also be one of the smallest logic-footprint compute engines developed so far. Our Unified Engine v1 supports: -matrix-matrix multiplication (~95% FLOPs utilization) -softmax (~90% FLOPs utilization) -broadcast and element-wise operations -RMSNorm / LayerNorm -block quantization/dequantization (fp4, int4) -multi-engine synchronization and many other operations. We even implemented memory-efficient attention similar to FlashAttention, reaching ~90% FLOP utilization. Full benchmarks and the software stack are available on our GitHub: https://t.co/KqTKbB2Inl We have basic compiler written in Python and it supports PyTorch tensors directly to easily test and transfer tensors between the accelerator and host using bf16, fp4 and int4 formats. Our FPGA prototype can already run LLM inference and outperform NVIDIA Jetson Orin Nano, even on a mid-tier FPGA setup (6.4x lower memory bandwidth, 18% slower clock speed at 4.5 Watts). Check the side-by-side comparison video below. Our GitHub includes low-level operator implementations, examples for tiled matrix multiplication, operation chaining, tensor parallelism, attention kernel and a full Gemma 3 1B model implementation. Many more models(Vision Transformers and VLA) are coming soon. Our accelerator IP is AXI-ready for deployment on any AMD(Xilinx) FPGA platform today. Even better, our two-engine prototype runs on an entry-level AMD(Xilinx) FPGA as a PCIe accelerator card. You can purchase it here https://t.co/8B9NOcueVu for $50 to experiment our pre-silicon prototype on your desktop PC or Raspberry Pi 5. We will be releasing hardware bitstream updates as the architecture gets new features. More to come soon! We are expanding our team and looking for compiler engineers and floating-point hardware design engineers. If you're interested, please send me a DM.