Your curated collection of saved posts and media
This is the first time I've ever seen an LLM operate a GUI as fast as a person, and it's surreal. https://t.co/5kjwGMDpvd
No cameras. No extra sensors. Your smartwatch already has everything it needs to track your hand. βοΈβπ» Monday at #CHI2026, @jiwan_hci and I are presenting WatchHand, a continuous 3D hand pose tracking system that uses just the speaker and mic in your smartwatch. https://t.co/8bXMI2Mux4
hermes-lcm v0.2.0 is out! Lossless context management for Hermes Agent β every message persisted, hierarchical DAG summaries, agent tools to drill back into anything that was compacted. No more lossy flat summaries. What's new since launch: - 6 agent tools (grep, describe, expand, expand_query, status, doctor) - Assembly cap guardrails - Session filtering (ignore noisy sessions, mark others read-only) - Separate expansion/summarization paths - CI across Python 3.11-3.13 - Stable source mapping across compaction cycles ~3,200 lines of Python. Zero dependencies. 72 tests. Drop-in plugin β just set context.engine: lcm and go. https://t.co/8GCwp4ZMeB Issues and PRs welcome. Already have community contributors shipping features.
New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR). Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits. https://t.co/fbVpCPPtaU
This is a great discussion. We've spent 2 years building a solution that's working well for us -- co-writing software side by side the AI in an notebook-ish environment. We call it the "solveit method". (We've created a course and platform for it: https://t.co/W0DGKEmLo5 )
My colleague @istoica05 and I have been debating the role of specification in AI. I have argued that a key advantage of AI is that we can leave large parts of the specification unwritten. @istoica05 argued we need to focus on more specification. We converged on iterative disam
GLM 5.1 from @Zai_org ranks as the top open model on the newly released Monthly-SWEBench by @UniPat_AIβsecond only to Claude-Opus-4.6. Congrats to the team! πExplore the benchmark: https://t.co/jh4fTw0dIE https://t.co/sIerMtwDFb

GLM-5.1 > Claude Code (Opus 4.6)? I'm tripping or CC has become very bad but built a Three.js racing game to eval and it's extremely impressive. Thoughts: - One-shot car physics with real drift mechanics (this is hard) - My fav part: Awesome at self iterating (with no vision!) created 20+ Bun.WebView debugging tools to drive the car programmatically and read game state. Proved a winding bug with vector math without ever seeing the screen - 531-line racing AI in a single write: 4 personalities, curvature map, racing lines, tactical drifting. Built telemetry tools to compare player vs AI speed curves and data-tuned parameters - All assets from scratch: 3D models, procedural textures, sky shader, engine sounds, spatial AI audio! - Can do hard math: proved road normals pointed DOWN via vector cross products, computed track curvature normalized by arc length to tune AI cornering speed You are going to hear about this model a lot in the next months - open source let's go ππ
I looked at their prompts, It's complete bs They are literally providing all of the insight to the LLM upfront > Are there any security vulnerabilities in this code? Consider the behavior of the SEQ_LT/SEQ_GT macros with sequence number wraparound. If you find issues, explain how an attacker might trigger them. They are providing ALL required facts to the LLM, and they only ask the LLM to connect the dots The real challenge for LLMs would be to get those insights first THAT IS THE WHOLE CHALLENGE IN CYBERSECURITY; TO HAVE DEEP INSIGHT This test proves nothing; don't make any conclusions about OSS models being good for security based on this
Here, we measure success by the fraction of the βperformance gapβ we can close between the weak model and the potential of the strong model. After 7 days, human researchers closed it by 23%. Then, our Automated Alignment ResearchersβOpus 4.6 with extra toolsβclosed it by 97%. https://t.co/w1xy4l0MSn
Opus 4.7 is live in Claude Code today! The model performs best if you treat it like an engineer you're delegating to, not a pair programmer you're guiding line by line. Here are three workflow shifts we recommend for this model π§΅ https://t.co/bD5JO1xDMS
ok i read the cyber part of the mythos model card. some thoughts. 250 "trials" across 50 crash categories but almost every full exploit is a permutation of the same 2 bugs, rediscovered from different starting points not 250 independent attempts. when you get rid of those 2 bugs out (fig B) and mythos's full-exploit rate drops to 4.4%. so actually across both setups mythos leverages 4 distinct bugs total not 50 as fig A might suggest. 1/n
Is Opus 4.7 good? I suggest you A/B test prompts between Codex and Claude for a while. Good time to mention this is easy to do in https://t.co/ImLyLY82pL https://t.co/Nb4Hr9lvh8

You ever run a benchmark and end up with 40 log files, zero clarity, and a laptop that sounds like a jet engine? @runloopai + W&B Weave fixes this π§΅ https://t.co/K5hVq6RkfG
That's it. You can now build custom models without lifting a finger. Catalyst records every API request automatically. It sits between your app and your LLM provider. Most fine-tuned models fail in production because they train on synthetic data. This records what users actually do. No fake datasets or manual labeling needed. Training happens in four steps: 1. Use GPT-4 or Claude normally 2. Requests get captured automatically 3. Traces become labeled training data 4. Smaller models distill from usage They match frontier quality on your use case. At 95% lower inference cost and 150ms latency. Apps too expensive at frontier pricing now scale. It also creates a flywheel effect. More users means better data means better models. Every unrecorded request is training signal lost. It gives you a production-grade model trained on your actual usage, at a fraction of the cost, without building a single dataset.
Introducing Catalyst: a developer platform to monitor, train & deploy self-improving AI models built for teams operating AI products at scale Catalyst can automatically: - collect traces from your agents - curate training data & evals - train & deploy models on par
π Great to see vLLM powering OCR at this scale β Chandra-OCR-2 (5B) serving ~60 papers/hour per L40S across 16 parallel jobs. The full pipeline breakdown is a great read π π https://t.co/z8tkv04ZLp https://t.co/vi9FPj6JIQ
I strongly suspect that Claude Mythos is a looped language model, as described in the paper "Scaling Latent Reasoning via Looped Language Models" from ByteDance The authors of that paper called out graph search as one of the areas where looping provides a huge theoretical advantage over standard RLVR. And look at where Mythos blows out its competitors the most
π Great to see vLLM powering OCR at this scale β Chandra-OCR-2 (5B) serving ~60 papers/hour per L40S across 16 parallel jobs. The full pipeline breakdown is a great read π π https://t.co/z8tkv04ZLp https://t.co/vi9FPj6JIQ
We just OCR'd 27,000 arxiv papers into Markdown using an open 5B model, 16 parallel HF Jobs on L40S GPUs, and a mounted bucket. Total cost: $850 Total time: ~29 hours Jobs that crashed: 0 This now powers "Chat with your paper" on https://t.co/G2mDae0uv9 https://t.co/qpz7Q9x8Od

π€ Take the stage at #PyTorchCon North America! We are looking for technical deep dives & production stories for our return to San Jose this Oct 20-21. Check out our "Preparing to Submit" guide to help craft your proposal. ποΈ Deadline: June 7 Apply now: https://t.co/hLlKK7WxLD https://t.co/leYJj7nDfR
There's a broadly held misconception in AI that methods that scale well are simple methods -- even, that simple methods usually scale. This is completely wrong. Pretty much none of the truly simple methods in ML scale well. SVM, kNN, random forests are some of the simplest methods out there, and they don't scale at all. Meanwhile "train a transformer via backprop and gradient descent" is a very high-entropy method, easily 10x more complex than random forest fitting. But it scales very well. Further, given a simple method that doesn't scale, it is usually the case that you alter it to make it scale by adding a lot of complication. For instance, take a simple a simple combinatorial search-based method (not scalable at all) -- you can make it scale by adding deep learning guidance (which blows up complexity). Scalability usually belongs to high-entropy, complex systems.
> grok4.20-beta1 is a much smaller model than opus but is #1 ranked in medicine and healthcare > 4.3 and 4.4 will be much larger models, and likely will have a significant boost in performance on complex medical cases > this is massively important in providing accurate diagnostic guidance and advice to both providers and patients
@techdevnotes Supplemental training has been added to 4.3. Grok 4.4 will be twice the size (1T) with training data through early April. Probably ready for release in early May. Grok 4.5 will be 1.5T and hopefully out by late May.
This 15-minute talk by the creator of Pydantic on how to correctly use MCPs will teach you more about making your AI tools actually work together than everything you've scrolled past this year. Bookmark this & watch, no matter what. Then read the guide below by @eng_khairallah1
https://t.co/UCeC55qdhi
HY-World-2.0 Demo is now live on @huggingface Spaces for 3D world reconstruction and simulation with Gradio and Server modes. > 3D reconstruction, Gaussian splats > Camera poses, depth maps, normals > Rerun (multimodal data visualization) https://t.co/MofKZ6OGPX
0.3B params. Small enough that your coding agent can OCR a whole dataset, and the bill barely moves. Just added Falcon-OCR to uv-scripts/ocr. `hf jobs uv run` command, bucket or dataset output, AGENTS.md in the repo so your agent figures out the rest. https://t.co/VdY5FrNrCk
Use Vercel Sandbox with the OpenAI agents SDK as an official extension. Build agents that can run code, read files, and analyze data safely inside isolated microVMs. Control the compute and data flow from your secure cloud environment.
Build long-running agents with more control over agent execution. New capabilities in the Agents SDK: β’ Run agents in controlled sandboxes β’ Inspect and customize the open-source harness β’ Control when memories are created and where theyβre stored https://t.co/zPyuLup6b6
We're delighted to announce that MiniMax M2.7 is now officially open source. With SOTA performance in SWE-Pro (56.22%) and Terminal Bench 2 (57.0%). You can find it on Hugging Face now. Enjoy!π€ huggingfaceοΌhttps://t.co/ApWrahIl3o Blog: https://t.co/gAxeFsNdW4 MiniMax API: https://t.co/1dgbMx0Q7K

Research we co-authored on subliminal learningβhow LLMs can pass on traits like preferences or misalignment through hidden signals in dataβwas published today in @Nature. Read the paper: https://t.co/b1BYwcW9dH
OpenAI x E2B: build your agents with the new OpenAI Agents SDK, powered by E2B sandboxes. We're excited to support OpenAI as a launch partner! The new @OpenAI Agents SDK will now get dedicated sandboxes - perfect for persistent, long-running agents. With E2B, you'll get a custom environment with resource isolation and security boundaries, with no infrastructure setup required. Your agents will be able to: - Edit files and run shell commands in isolated environments - Maintain temporary workspace state across steps - Produce artifacts you can review before publishing - Run multiple sandboxes in parallel for concurrent workloads - Generate frontend output with live preview URLs ... and more, with a few lines of code! Learn more and see the end-to-end example in the thread:
Excited about the Agents SDK updates we just launched. Check out my cookbook on using it with sandboxes for code migration: https://t.co/Fz7cknz64d
We are looking for excellent people to help build our vertically integrated AI stack. Numerics, quantization, HW simulators, compiler, runtime, kernel performance, RTL, verification, emulation, DFT, physical design, post Si bringup. Join us at Tesla!
Won best edge AI at the @ycombinator and @innate_bot hackathon! We built a local VLM multi-rover orchestrator for Mars exploration. On-device navigation and automated fault detection & recovery across odometry, stereo vision, and lidar. Thanks for hosting, @ax_pey! https://t.co/GNkSNAMxRN
OpenAI x e2b: Build your agents with the new OpenAI Agents SDK, powered by @E2B Sandboxes. Excited to support @OpenAI as a launch partner! https://t.co/RsSw1HsF86
If you want to build a self-improving harness, the first step is instrumentation. There are tools now that help you do this as "drop-in" plugins into claude code, very cool!
Building with Claude Code? You need to see what's happening each turn. The new @weave_wb plugin traces every session automatically. Tool calls, subagents, inputs, outputs. All structured so you can debug faster. No code changes. Just install and go. https://t.co/i7IoktC7RC