Your curated collection of saved posts and media
Talking to a voice AI LLM over ham radio (on UHF 420.69 megahertz, of course!) (Note: cool experiment, but be careful: FCC regs require a licensed control operator to be present at the control point the entire time the LLM is operating.) https://t.co/S2WcCrkp83
70 hackers joined us in SF for the first-ever World Labs Hackathon. In just 3.5 hours, 32 teams used Marble for projects ranging from robotics sims and agents to AR/VR interfaces, games, art experiences, and real estate tools. Check out what they built β https://t.co/cX0bAlvhh1

New research on evaluating coding agents via continuous integration. Coding agents are moving beyond isolated bug fixes. If they're going to own CI pipelines, we need benchmarks that reflect the actual complexity of codebase maintenance. Most coding agent benchmarks today test whether an agent can fix a single issue. But real software engineering involves maintaining entire codebases over time. SWE-CI evaluates agent capabilities through continuous integration workflows: running test suites, catching regressions, and maintaining code quality across multiple changes. Paper: https://t.co/p8bOTJ9QPX Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

PDFs are the bane of every AI agent's existence: here's why parsing them is so much harder than you think π Every developer building document agents eventually hits the same wall: PDFs weren't designed to be machine-readable. They're drawing instructions from 1982, not structured data. π PDF text isn't stored as characters: it's glyph shapes positioned at coordinates with no semantic meaning π Tables don't exist as objects: they're just lines and text that happen to look tabular when rendered π Reading order is pure guesswork β content streams have zero relationship to visual flow π€ Seventy years of OCR evolution led us to combine text extraction with vision models for optimal results We built LlamaParse using this hybrid approach: fast text extraction for standard content, vision models for complex layouts. It's how we're solving document processing at scale. Read the full breakdown of why PDFs are so challenging and how we're tackling it: https://t.co/K8bQmgq7xN

Parsing PDFs is insanely hard This is completely unintuitive at first glance, considering PDFs are the most commonly used container of unstructured data in the world. I wrote a blog post digging into the PDF representation itself, why its impossible to βsimplyβ read the page into plaintext, and what the modern parsing techniques are π The crux of the issue is that PDFs are designed to display text on a screen, and not to represent what a word means. 1οΈβ£ PDF text is represented as glyph shapes positioned at absolute x,y coordinates. Sometimes thereβs no mapping from character codes back to a unicode representation 2οΈβ£ Most PDFs have no concept of a table. Tables are described as grid lines drawn with coordinates. Traditional parser would have to find intersections between lines to infer cell boundaries and associate with text within cells through algorithms 3οΈβ£ The order of operators has no relationship with reading order. You would need clustering techniques to be able to piece together text into a coherent logical format. Thatβs why everyone today is excited about using VLMs to parse text. Which to be clear has a ton of benefits, but still limitations in terms of accuracy and cost. At @llama_index weβre building hybrid pipelines that interleave both text and VLMs to give both extremely accurate parsing at the cheapest price points. Blog: https://t.co/iLJpIr7cbH LlamaParse: https://t.co/TqP6OT5U5O

Parsing PDFs is insanely hard This is completely unintuitive at first glance, considering PDFs are the most commonly used container of unstructured data in the world. I wrote a blog post digging into the PDF representation itself, why its impossible to βsimplyβ read the page into plaintext, and what the modern parsing techniques are π The crux of the issue is that PDFs are designed to display text on a screen, and not to represent what a word means. 1οΈβ£ PDF text is represented as glyph shapes positioned at absolute x,y coordinates. Sometimes thereβs no mapping from character codes back to a unicode representation 2οΈβ£ Most PDFs have no concept of a table. Tables are described as grid lines drawn with coordinates. Traditional parser would have to find intersections between lines to infer cell boundaries and associate with text within cells through algorithms 3οΈβ£ The order of operators has no relationship with reading order. You would need clustering techniques to be able to piece together text into a coherent logical format. Thatβs why everyone today is excited about using VLMs to parse text. Which to be clear has a ton of benefits, but still limitations in terms of accuracy and cost. At @llama_index weβre building hybrid pipelines that interleave both text and VLMs to give both extremely accurate parsing at the cheapest price points. Blog: https://t.co/iLJpIr7cbH LlamaParse: https://t.co/TqP6OT5U5O
Something we've been thinking about: planning in the age of capable coding agents. Agents can now build entire requirements end-to-end. They code longer, handle more complexity, and break work down on their own. Granular task breakdown? That's the agent's job now. Requirements are what matter. We shipped a new Build experience in @BrainGridAI that reflects this. No more breaking down tasks upfront. Specify your requirement, pick your agent or paste one command. The agent creates tasks as it works β so you have a record and can resume any session without losing progress.
Read the full write-up: https://t.co/yeMb7gtnai
Beyond Language Modeling An Exploration of Multimodal Pretraining paper: https://t.co/GmtPAQDo8T
Yann LeCun π€ Saining Xie insane crossover of the 2 biggest visual representation researchers in the AI field βBeyond Language Modeling: An Exploration of Multimodal Pretrainingβ Right now, most multimodal models are basically a language model with a vision adapter bolted on, so they can describe images, but they donβt really think in images or video. This paper shows what happens when you do it the hard way: train one model from scratch on text, images, and video with a unified setup. They key idea is if you give the model a good visual internal format and it can use vision for both understanding and generating. Additionally, multimodal data can improve language instead of distracting it, and mixture-of-experts lets you scale visionβs huge data intake without bloating everything else. This paves the way towards changing the vision paradigm from βcaptioning add-onβ model to native multimodal foundation model.
Yann LeCun π€ Saining Xie insane crossover of the 2 biggest visual representation researchers in the AI field βBeyond Language Modeling: An Exploration of Multimodal Pretrainingβ Right now, most multimodal models are basically a language model with a vision adapter bolted on, so they can describe images, but they donβt really think in images or video. This paper shows what happens when you do it the hard way: train one model from scratch on text, images, and video with a unified setup. They key idea is if you give the model a good visual internal format and it can use vision for both understanding and generating. Additionally, multimodal data can improve language instead of distracting it, and mixture-of-experts lets you scale visionβs huge data intake without bloating everything else. This paves the way towards changing the vision paradigm from βcaptioning add-onβ model to native multimodal foundation model.
In both LeetCode's Weekly Contests (Weekly Contests 489β491) and the HMMT February 2026 (Harvard-MIT Mathematics Tournament), Nanbeige4.1-3B's performance not only significantly outperformed that of Qwen3.5-4B but also surpassed Qwen3.5-9B. https://t.co/2guwzB3yNa
Attack of the asynchronous machines. Weβve seen this a lot in GPU kernels. This time the same principle applies in speculative decoding
π New paper: MambaβTransformer hybrid VLMs can go fast without forgetting. We introduce stateful token reduction for long-video VLMs. β Only 25% of visual tokens π 3.8β4.2Γ faster prefilling (TTFT) π― Near-baseline accuracy (can exceed baseline with light finetuning) https://t.co/CJaCktyWCt
Together Research has produced FlashAttention, ATLAS, ThunderKittens and more. This week at AI Native Conf: seven more releases, all coming to production soon. Thread β #ainativeconf #ainativecloud https://t.co/XXIXMRRiLe
Recover more than 70% accuracy degradation from 4-bit quantization using TorchAOβs (https://t.co/Jr0qtnIAgZ) Quantization-Aware Training (QAT), now available through fine-tuning in Unsloth and Axolotl! Following the previous TorchAO QAT blog(https://t.co/kXAGBfOSMZ), the PyTorch team at @Meta extended the TorchAO QAT flow to support an end-to-end GPU server flow, targeting fast CUDA kernels for fast inference in @vllm_project, and integrated this flow into popular fine-tuning frameworks like Unsloth and Axolotl. Read our latest blog: https://t.co/nFx4MYHoRj #PyTorch #vLLM #OpenSourceAI #TorchAO

Today we are introducing GPT-5.4 in codex. It's more token efficient and better at tool calling, computer use, and frontend development. We are also introducing /fast to get a faster version of Codex. Enjoy β€οΈ https://t.co/uTOlQsK7hE
If the engine is strong enough, you should be able to build real products on top of it. That's the whole point of LTX-2.3. Introducing LTX Desktop. A fully local, open-source video editor running directly on the LTX engine, optimized for NVIDIA GPUs and compatible hardware. https://t.co/aApm06E6RZ
@pamelafox I mean I am just gonna say do evals β’οΈ
Impressive if true. The agent harness is powered by recursive and parallel planning. Clever planning is a big deal. Everyone should be trying to build their own harness. Trust me, you really want to be exploring higher levels of orchestration for your agents right now.
When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Stop treating evals like unit tests. That works for deterministic software. For LLM products, it creates false confidence because real-world usage changes over time. Example: an insurance prompt passed 20 eval cases. The team shipped. In production, a new class of requests showed up and failed quietly. No crash, no alert, just wrong answers at scale. The fix is not "write more eval cases," which is what many teams do. It is building evals as a living feedback loop. Start with a small set, ship, watch what breaks in production, add those failures back, and re-run on every prompt or model change. What eval failure caught your team off guard? Blog: https://t.co/HCVhcow5rA Stanford CS 224G lecture: https://t.co/q667gGwckt

huggingface_hub v1.5.0 just dropped! The highlight: Buckets. Think S3, but native to the Hub. No git history. Just fast, chunk-deduplicated object storage. hf buckets sync ./outputs hf://buckets/me/my-checkpoints And that's it. Currently in beta preview. DM me if interested!
huggingface_hub v1.5.0 just dropped! The highlight: Buckets. Think S3, but native to the Hub. No git history. Just fast, chunk-deduplicated object storage. hf buckets sync ./outputs hf://buckets/me/my-checkpoints And that's it. Currently in beta preview. DM me if interested!
π₯ Learn how to build your own tool-calling agent with @huggingface TRL + @Alibaba_Qwen Qwen3.5 on @Azure Machine Learning! - @NousResearch hermes-function-calling-v1, 500 single-turn samples - SFT with TRL on Qwen3.5 2B (released today!) on a single NVIDIA H100 - Everything on Azure, from Container Registry to Machine Learning! Step-by-step in the thread π§΅