Your curated collection of saved posts and media
Together Research has produced FlashAttention, ATLAS, ThunderKittens and more. This week at AI Native Conf: seven more releases, all coming to production soon. Thread β #ainativeconf #ainativecloud https://t.co/XXIXMRRiLe
Recover more than 70% accuracy degradation from 4-bit quantization using TorchAOβs (https://t.co/Jr0qtnIAgZ) Quantization-Aware Training (QAT), now available through fine-tuning in Unsloth and Axolotl! Following the previous TorchAO QAT blog(https://t.co/kXAGBfOSMZ), the PyTorch team at @Meta extended the TorchAO QAT flow to support an end-to-end GPU server flow, targeting fast CUDA kernels for fast inference in @vllm_project, and integrated this flow into popular fine-tuning frameworks like Unsloth and Axolotl. Read our latest blog: https://t.co/nFx4MYHoRj #PyTorch #vLLM #OpenSourceAI #TorchAO

Today we are introducing GPT-5.4 in codex. It's more token efficient and better at tool calling, computer use, and frontend development. We are also introducing /fast to get a faster version of Codex. Enjoy β€οΈ https://t.co/uTOlQsK7hE
If the engine is strong enough, you should be able to build real products on top of it. That's the whole point of LTX-2.3. Introducing LTX Desktop. A fully local, open-source video editor running directly on the LTX engine, optimized for NVIDIA GPUs and compatible hardware. https://t.co/aApm06E6RZ
@pamelafox I mean I am just gonna say do evals β’οΈ
Impressive if true. The agent harness is powered by recursive and parallel planning. Clever planning is a big deal. Everyone should be trying to build their own harness. Trust me, you really want to be exploring higher levels of orchestration for your agents right now.
When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Stop treating evals like unit tests. That works for deterministic software. For LLM products, it creates false confidence because real-world usage changes over time. Example: an insurance prompt passed 20 eval cases. The team shipped. In production, a new class of requests showed up and failed quietly. No crash, no alert, just wrong answers at scale. The fix is not "write more eval cases," which is what many teams do. It is building evals as a living feedback loop. Start with a small set, ship, watch what breaks in production, add those failures back, and re-run on every prompt or model change. What eval failure caught your team off guard? Blog: https://t.co/HCVhcow5rA Stanford CS 224G lecture: https://t.co/q667gGwckt

huggingface_hub v1.5.0 just dropped! The highlight: Buckets. Think S3, but native to the Hub. No git history. Just fast, chunk-deduplicated object storage. hf buckets sync ./outputs hf://buckets/me/my-checkpoints And that's it. Currently in beta preview. DM me if interested!
huggingface_hub v1.5.0 just dropped! The highlight: Buckets. Think S3, but native to the Hub. No git history. Just fast, chunk-deduplicated object storage. hf buckets sync ./outputs hf://buckets/me/my-checkpoints And that's it. Currently in beta preview. DM me if interested!
π₯ Learn how to build your own tool-calling agent with @huggingface TRL + @Alibaba_Qwen Qwen3.5 on @Azure Machine Learning! - @NousResearch hermes-function-calling-v1, 500 single-turn samples - SFT with TRL on Qwen3.5 2B (released today!) on a single NVIDIA H100 - Everything on Azure, from Container Registry to Machine Learning! Step-by-step in the thread π§΅
There is no best VLM OCR model - rankings can flip completely by document type. I built ocr-bench: run open OCR models on YOUR documents, get a per-collection leaderboard. VLM-as-judge with Bradley-Terry ELO, all running on @huggingface. No local GPU needed. https://t.co/qZOwI0Wbes
@BarathAnandan7 hard to say, but I think it's data as usual. I think the architecture change with linear attention is more for efficiency purposes (see Qwen3-Next)
What's actually nice about Gated DeltaNet modules is that they don't grow the KV cache size. So with that 3:1 ratio, Qwen3.5 is much more memory friendly than the previous Qwen3 models. https://t.co/AhegasegIB
Wrote a blog post about my journey here. Has some scalability limitations & will fix them soon. Appreciate any pointers/feedback! https://t.co/javKm9ebYa
We're publishing a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. We find that GPT-5.4 Thinking shows low ability to obscure its reasoningβsuggesting CoT monitoring remains a useful safety tool. https://t.co/isZkNkPXZm
An NVIDIA powered farming machine uses Al vision and precision lasers to eliminate weeds in milliseconds without herbicides and without harming crops, a potential shift toward chemical free agriculture https://t.co/aIbDWseMjD
We don't expect LLMs to multiply numbers or sort lists directly within their output token stream. Instead, we ask them emit code and execute it in a separate runtime. Why predict the opposite outcome for simulating interactive worlds? https://t.co/b2QNOBTWjN
> 385ms average tool selection. > 67 tools across 13 MCP servers. > 14.5GB memory footprint. > Zero network calls. LocalCowork is an AI agent that runs on a MacBook. Open source. π§΅ https://t.co/bnXupspSXc
already on mlx :) https://t.co/NXxd7hAWMh
Meet GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill: a distilled powerhouse that brings elite reasoning to local machines. This GGUF model delivers Claude-level intelligence in a compact package, perfect for developers wanting high-performance AI without cloud costs. https://t.co/Q0HCPTI2oe
Claude Code acabou de lancar Voice Mode. Voce fala. O agente de IA codifica. "/voice" pra ativar. Rollout pra 5% dos usuarios agora, expandindo nas proximas semanas. Hoje: KREA AI Voice no iPad. Claude Code Voice no terminal. A era da programacao por voz chegou. https://t.co/9adiksDX0r
Comprehensive Python API for Google NotebookLM. Full programmatic access to NotebookLM's featuresβincluding capabilities the web UI doesn't exposeβfrom Python or the command line. https://t.co/5YQhAKiGuD
π Introducing the Qwen 3.5 Small Model Series Qwen3.5-0.8B Β· Qwen3.5-2B Β· Qwen3.5-4B Β· Qwen3.5-9B β¨ More intelligence, less compute. These small models are built on the same Qwen3.5 foundation β native multimodal, improved architecture, scaled RL: β’ 0.8B / 2B β tiny, fast, great for edge device β’ 4B β a surprisingly strong multimodal base for lightweight agents β’ 9B β compact, but already closing the gap with much larger models And yes β weβre also releasing the Base models as well. We hope this better supports research, experimentation, and real-world industrial innovation. Hugging Face: https://t.co/wFMdX5pDjU ModelScope: https://t.co/9NGXcIdCWI

A trillion-parameter model just made half its brain disappear. It got smarter. Yuan3.0 Ultra is a new open-source multimodal MoE model from Yuan Lab. 1010B total parameters, only 68.8B active at inference. It beat GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on RAG benchmarks by wide margins. 67.4% on Docmatix vs GPT-4o's 56.8%. Here's what it unlocks: > Enterprise RAG with 68.2% avg accuracy across 10 retrieval tasks > Complex table understanding at 62.3% on MMTab > Text-to-SQL generation scoring 83.9% on Spider 1.0 > Multimodal doc analysis with a 64K context window The key innovation: Layer-Adaptive Expert Pruning (LAEP). During pretraining, expert token loads become wildly imbalanced. Some experts get 500x more tokens than others. LAEP prunes the underused ones layer by layer, cutting 33% of parameters while boosting training efficiency by 49%. They also refined "fast-thinking" RL. Correct answers with fewer reasoning steps get rewarded more. This cut output tokens by 14.38% while improving accuracy by 16.33%. The bigger signal here: MoE models are learning to self-compress during training, not after. If pruning becomes part of pretraining, the cost curve for trillion-scale models shifts dramatically.