Your curated collection of saved posts and media

Showing 24 posts Β· last 30 days Β· by score
S
sudoingX
@sudoingX
πŸ“…
Mar 07, 2026
2d ago
πŸ†”74402115

spent the entire day testing Qwopus (Claude 4.6 Opus distilled into Qwen 3.5 27B) on a single RTX 3090 through Claude Code. this is my new favourite to host locally. no jinja crashes. thinking mode works natively. 29-35 tok/s. 16.5 GB. the harness matches the distillation source and you can feel it. the model doesn't fight the agent. my flags: llama-server -m Qwopus-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 if you want raw speed, base Qwen 3.5 MoE still wins at 112 tok/s. but for autonomous coding where the model needs to think, wait for tool outputs, and selfcorrect without stalling, Qwopus on Claude Code is the cleanest setup i've found on this card. i want to see what everyone else is running. drop your GPU, model, harness, flags, and tok/s below. doesn't matter if it's a 3060 or a 4090, nvidia or amd. configs help everyone. let's push these cards to their ceilings. let's make this thread the reference.

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 08, 2026
19h ago
πŸ†”05872435

Pay attention to this one if you are building terminal-based coding agents. OpenDev is an 81-page paper covering scaffolding, harness design, context engineering, and hard-won lessons from building CLI coding agents. It introduces a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction. The industry is shifting from IDE plugins to terminal-native agents. Claude Code, Codex CLI, and others have proven the model works. This paper formalizes the design patterns that make these systems reliable, covering topics like event-driven system reminders to counteract instruction fade-out, automated memory across sessions, and strict safety controls for autonomous operation. Paper: https://t.co/tpAZFaSnog Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

Media 1
πŸ–ΌοΈ Media
C
classiclarryd
@classiclarryd
πŸ“…
Mar 08, 2026
1d ago
πŸ†”18908884

New NanoGPT Speedrun WR at 86.1 (-0.7s), by replacing partitioned hyperconnections with a simple idea: feed the exact same context vector into the last 3 attn layers, so late stage attn doesn't get polluted by prediction MLPs. Opinion: AI research agents are handicapped until they have a mech-interp toolkit. Many sub-3min architecture improvements came from analyzing weights. https://t.co/o9WeUF7PHl

Media 1Media 2
πŸ–ΌοΈ Media
D
dair_ai
@dair_ai
πŸ“…
Mar 07, 2026
1d ago
πŸ†”60889466

New research: FlashAttention-4 FlashAttention-4 achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton on B200 GPUs with BF16. FlashAttention-4 co-designs algorithms and kernel pipelines for Blackwell GPUs, where tensor core throughput doubles but memory bandwidth and exponential units scale more slowly. The techniques include fully asynchronous MMA operations, software-emulated exponential rescaling, and leveraging tensor memory to reduce shared memory traffic. FlashAttention-4 achieves up to 1.3x speedup over cuDNN and 2.7x over Triton on B200 GPUs, reaching 1613 TFLOPs/s at 71% utilization. Implemented entirely in Python via CuTe-DSL with 20-30x faster compile times compared to C++ templates. Paper: https://t.co/wBiS51m8Bm Learn to build effective AI agents in our academy: https://t.co/LRnpZN7deE

Media 1Media 2
πŸ–ΌοΈ Media
K
karpathy
@karpathy
πŸ“…
Mar 06, 2026
3d ago
πŸ†”22254014
⭐0.42

@BrownCoyoteStd The code to train a GPT is only ~1,000 lines of code. In the case of GPT training the success criteria is quite simple: reach the lowest possible loss (meaning that your GPT is predicting the next token well), but don't regress running time, keep memory in check, and keep a sense of simplicity/aesthetics (don't bloat the code too much to get a small gain). Because 1) the criteria is objective and 2) because AI agents can now write code quite well, instead of having a human think up experiment ideas and try them out one by one (e.g. my entire PhD basically), you just get the AI to do the whole thing. My prompt ("AI source code") in this example is just ~120 lines of markdown document explaining the thing to the AI. The AI of today is very good at implementing ideas, but a lot less good at coming up with creative ones. So honestly, it's a lot closer to hyperparameter tuning right now than coming up with new/novel research, but 1) i didn't super tune the prompts yet, maybe you can just try to ask and 2) it's clear what the trajectory of this is as the AI capability improves - it's AI improving the next version of itself autonomously, maybe with human researchers throwing some ideas into the mix once in a while.

T
tri_dao
@tri_dao
πŸ“…
Mar 05, 2026
4d ago
πŸ†”51263082
⭐0.42

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.Β  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

H
HuggingPapers
@HuggingPapers
πŸ“…
Mar 04, 2026
5d ago
πŸ†”28876865

SWE-rebench V2 A language-agnostic pipeline that automatically harvests 32,000+ executable real-world software engineering tasks across 20 programming languages. Built for large-scale RL training of code agents with reproducible Docker environments. https://t.co/JJ0vLH5N7B

Media 1
πŸ–ΌοΈ Media
πŸ”_akhaliq retweeted
H
DailyPapers
@HuggingPapers
πŸ“…
Mar 04, 2026
5d ago
πŸ†”28876865
⭐0.34

SWE-rebench V2 A language-agnostic pipeline that automatically harvests 32,000+ executable real-world software engineering tasks across 20 programming languages. Built for large-scale RL training of code agents with reproducible Docker environments. https://t.co/JJ0vLH5N7B

❀️53
likes
πŸ”6
retweets
R
rasbt
@rasbt
πŸ“…
Mar 03, 2026
5d ago
πŸ†”72425941

A small Qwen3.5 from-scratch reimplementation for edu purposes: https://t.co/OnupgeE55l (probably the best "small" LLM today for on-device tinkering) https://t.co/LwyF8x6sle

Media 1Media 2
πŸ–ΌοΈ Media
K
karpathy
@karpathy
πŸ“…
Mar 05, 2026
3d ago
πŸ†”47630069

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to NVIDIA ClimbMix (nice work NVIDIA!). I had tried Olmo, FineWeb, DCLM which all led to regressions, ClimbMix worked really well out of the box (to the point that I am slightly suspicious about about goodharting, though reading the paper it seems ~ok). In other news, after trying a few approaches for how to set things up, I now have AI Agents iterating on nanochat automatically, so I'll just leave this running for a while, go relax a bit and enjoy the feeling of post-agi :). Visualized here as an example: 110 changes made over the last ~12 hours, bringing the validation loss so far from 0.862415 down to 0.858039 for a d12 model, at no cost to wall clock time. The agent works on a feature branch, tries out ideas, merges them when they work and iterates. Amusingly, over the last ~2 weeks I almost feel like I've iterated more on the "meta-setup" where I optimize and tune the agent flows even more than the nanochat repo directly.

Media 1
πŸ–ΌοΈ Media
K
karpathy
@karpathy
πŸ“…
Feb 26, 2026
10d ago
πŸ†”73286087
⭐0.34

@industriaalist love it! :) nanogpt/nanochat were explicitly designed to be the most forkable repo, i love the different directions people take them in!

O
orhundev
@orhundev
πŸ“…
Mar 01, 2026
7d ago
πŸ†”83973153

New TUI dropped for managing LLM traffic and GPU resources πŸ”₯ πŸŒ€ ollamaMQ β€” Async message queue proxy for Ollama πŸ’― Per-user queues, fair-share scheduling, OpenAI-compatible endpoints, streaming πŸ¦€ Written in Rust & built with @ratatui_rs ⭐ GitHub: https://t.co/0UthA7KPIg #rustlang #ratatui #tui #gpu #llm #ollama #backend #proxy #terminal

Media 2
πŸ–ΌοΈ Media
R
Reza_Zadeh
@Reza_Zadeh
πŸ“…
Feb 02, 2026
34d ago
πŸ†”57798463
⭐0.42

Tesla FSD deep dive, by VP of AI at Tesla, Ashok Elluswamy @aelluswamy One big neural network consumes video, sound, telemetry, basic maps, & outputs at 36hz steering & speed. All run on an custom chip on car. Outstanding fleet data management, and way more! cc @elonmusk https://t.co/rvboZRl7Fw

R
Reza_Zadeh
@Reza_Zadeh
πŸ“…
Feb 27, 2026
9d ago
πŸ†”00378846
⭐0.30

@Jason @aelluswamy Deep dive into this system. Well done Ashok and team!

O
Ofirlin
@Ofirlin
πŸ“…
Mar 08, 2026
1d ago
πŸ†”00482056
⭐0.36

Can we have an optimizer as fast as Muon but with a reduced memory footprint? In our recent NeurIPS paper, we show it's possible and introduce SUMOπŸŽ‰ Muon's speed comes from fast moment orthogonalization using Newton-Schultz (NS). But its NS approximation breaks down when gradients are projected into a low-dim subspace. SUMO's fix: exact SVD orthogonalization inside a low-rank subspace, gives us Muon-level geometry awareness at a fraction of the memory cost (1/5)

πŸ”Scobleizer retweeted
J
JoΓ«l Niklaus
@joelniklaus
πŸ“…
Mar 08, 2026
1d ago
πŸ†”85585544
⭐0.34

Introducing the Synthetic Data Playbook: We generated over a 1T tokens in 90 experiments with 100k+ GPUh to figure out what makes good synthetic data and how to generate it at scale https://t.co/iaHuodWVAa https://t.co/48gBUYE6R2

❀️687
likes
πŸ”106
retweets
O
omarsar0
@omarsar0
πŸ“…
Mar 07, 2026
1d ago
πŸ†”88604376

New research from Yann LeCun and collaborators at NYU. It's a really good read for anyone working on efficient Transformer inference. The paper dissects two recurring phenomena in Transformer language models: massive activations (where a few tokens exhibit extreme outlier values, and attention sinks (where certain tokens attract disproportionate attention regardless of semantic relevance). They show the co-occurrence is largely an architectural artifact of pre-norm design, not a fundamental property. Massive activations function as implicit model parameters. Attention sinks modulate outputs locally. Why does it matter? These phenomena directly impact quantization, pruning, and KV-cache management. Understanding their root cause could enable better engineering decisions for efficient inference at scale. Paper: https://t.co/wfzeDpfu4x

Media 1
πŸ–ΌοΈ Media
πŸ”omarsar0 retweeted
O
elvis
@omarsar0
πŸ“…
Mar 07, 2026
1d ago
πŸ†”88604376
⭐0.36

New research from Yann LeCun and collaborators at NYU. It's a really good read for anyone working on efficient Transformer inference. The paper dissects two recurring phenomena in Transformer language models: massive activations (where a few tokens exhibit extreme outlier values, and attention sinks (where certain tokens attract disproportionate attention regardless of semantic relevance). They show the co-occurrence is largely an architectural artifact of pre-norm design, not a fundamental property. Massive activations function as implicit model parameters. Attention sinks modulate outputs locally. Why does it matter? These phenomena directly impact quantization, pruning, and KV-cache management. Understanding their root cause could enable better engineering decisions for efficient inference at scale. Paper: https://t.co/wfzeDpfu4x

❀️243
likes
πŸ”37
retweets
R
rasbt
@rasbt
πŸ“…
Mar 07, 2026
2d ago
πŸ†”87037906

While waiting for DeepSeek V4 we got two very strong open-weight LLMs from India yesterday. There are two size flavors,Β Sarvam 30BΒ andΒ Sarvam 105BΒ model (both reasoning models). Interestingly, the smaller 30B model uses β€œclassic” Grouped Query Attention (GQA), whereas the larger 105B variant switched to DeepSeek-style Multi-Head Latent Attention (MLA). As I wrote about in my analyses before, both are popular attention variants to reduce KV cache size (the longer the context, the more you save compared to regular attention). MLA is more complicated to implement, but it can give you better modeling performance if we go by the ablation studies in theΒ 2024 DeepSeek V2 paperΒ (as far as I know, this is still the most recent apples-to-apples comparison). Speaking of modeling performance, the 105B model is on par with LLMs of similar size: gpt-oss 120B and Qwen3-Next (80B). Sarvam is better on some tasks and worse on others, but roughly the same on average. It’s not the strongest coder in SWE-Bench Verified terms, but it is surprisingly good at agentic reasoning and task completion (Tau2). It’s even better than Deepseek R1 0528. Considering the smaller Sarvam 30B, the perhaps most comparable model to the 30B model is Nemotron 3 Nano 30B, which is slightly ahead in coding per SWE-Bench Verified and agentic reasoning (Tau2) but slightly worse in some other aspects (Live Code Bench v6, BrowseComp). Unfortunately, Qwen3-30B-A3B is missing in the benchmarks, which is, as far as I know, is the most popular model of that size class. Interestingly, though, the Sarvam team compared their 30B model to Qwen3-30B-A3B on a computational performance analysis, where they found that Sarvam gets 20-40% more tokens/sec throughput compared to Qwen3 due to code and kernel optimizations. Anyways, one thing that is not captured by the benchmarks above is Sarvam’s good performance on Indian languages. According to a judge model, the Sarvam team found that their model is preferred 90% of the time compared to others when it comes to Indian texts. (Since they built and trained the tokenizer from scratch as well, Sarvam also comes with a 4 times higher token efficiency on Indian languages.

Media 1
πŸ–ΌοΈ Media
V
vllm_project
@vllm_project
πŸ“…
Mar 07, 2026
2d ago
πŸ†”12671148

πŸš€ vLLM v0.17.0 is here! 699 commits from 272 contributors (48 new!) This is a big one. Highlights: ⚑ FlashAttention 4 integration 🧠 Qwen3.5 model family with GDN (Gated Delta Networks) πŸ—οΈ Model Runner V2 maturation: Pipeline Parallel, Decode Context Parallel, Eagle3 + CUDA graphs πŸŽ›οΈ New --performance-mode flag: balanced / interactivity / throughput πŸ’Ύ Weight Offloading V2 with prefetching πŸ”€ Elastic Expert Parallelism Milestone 2 πŸ”§ Quantized LoRA adapters (QLoRA) now loadable directly

Media 1
πŸ–ΌοΈ Media
A
avnermay
@avnermay
πŸ“…
Mar 04, 2026
5d ago
πŸ†”34041232
⭐0.40

Excited to announce our new LLM inference algorithm, speculative speculative decoding (SSD)! It is fast πŸš€ β€” up to 2x faster than state-of-the-art inference engines (vLLM, SGLang). Working on this with @tanishqkumar07 and @tri_dao was a blast. Details in thread:

T
tanishqkumar07
@tanishqkumar07
πŸ“…
Mar 04, 2026
5d ago
πŸ†”96631872

I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.

πŸ–ΌοΈ Media
T
tedzadouri
@tedzadouri
πŸ“…
Mar 05, 2026
4d ago
πŸ†”06841236

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

Media 1
πŸ–ΌοΈ Media
M
MayankMish98
@MayankMish98
πŸ“…
Mar 05, 2026
3d ago
πŸ†”79317378

FA4 now available in lm-engine: https://t.co/n47TEinAfG 13.4% end-to-end speedup for Llama 8B training on 4x GB200s (1 node) πŸš€πŸš€πŸš€ 1005.55 TFLOPs for SDPA vs 1140.73 for FA4 (BF16 precision) @tedzadouri @ultraproduct @__tensorcore__ @tri_dao cooked Thanks to @bharatrunwal2 for running the experiment!

Media 1
πŸ–ΌοΈ Media
Page 1 of 2317Next β†’