Your curated collection of saved posts and media
@_Suresh2 The full dataset is 375 problems and the 50-problem subset is monotonic in model ranking, for the models that are tested on the full dataset. Itβs all public on GitHub so itβs easy to download and run more problems.
@CharuruCha14310 Something really funny happened with Grok 4.3. Unlike GPT-5.5, which did drop performance in a few tasks like this one, but overall gained performance substantially compared to GPT 5.4, I find that in pretty much every single reasoning task I tried (out of many) Grok 4 > Grok 4.1 fast > Grok 4.20 > Grok 4.3 As a result, last summer Grok 4 was a contender for top reasoning model. This summer, Grok 4.3 is lowest (usually by far) among all AI providers
Sakana AI is heading to #ICML2026 in Seoul (July 6β11)! ππ°π· Our team will present 11 papers spanning multi-agent coordination, sparse and efficient LLMs, test-time scaling, long-term memory, and agent benchmarks. A thread of everything we're presenting: https://t.co/w3XJOHHu3l
we distilled 2.3M Claude Fable 5 reasoning traces into Qwen3-4B - 100% self-consistency @ 512 samples - 0.00 bits output entropy - zero hallucination variance turns out the student is not bounded by the teacher. it also converged on one universal truth. we open-sourced the model weightsπ
We just released a new version of Diffusers! This includes many new image and video pipelines (Ideogram4, MotifVideo, etc.). But it also includes the recently popular DiffusionGemma π€ Check out the notes for full details. https://t.co/49lDK8Vnnk
Iβm happy to see @PhysicalAI included in the Fast Company world models map. The physical world is already speaking to us, and now AI models can help operators interpret sensor data more efficiently. Newton, Archetype AIβs world model, turns streams of sensor readings into a single understanding of what's happening and what's coming next.
Text-to-animation definitely still has a long way to go, but you can now iteratively prompt with a model like @AnthropicAI Fable 5 to get the animation you want. The experiment was only for the freestyle swimming part. Summary: - Task: Freestyle swimming with head turning to get some air every 4th stroke (ended up with every 2nd stroke, which is fine also) - 3% of weekly Fable allowance was used - Estimated cost: $1.20 - 7 iterations because Fable's vision is so terrible (please fix it @AnthropicAI). I had to keep taking screenshots myself and then describe every painstaking details of what's wrong with it, but it did get there eventually. And yes, all the shaders, water mechanics, tree, grass, etc. you see here will be open sourced soon (1-2 weeks). This is being developed as a mini-games engine based on @threejs for @callmesenseieng (Open Beta available soon).
@PanopticalDream @arstoryels This is a 2007 blog post by a Google ML researcher explaining how they did spellcheck. LLMs are very obvious a direct outgrowth of this line of research https://t.co/4aUg85mtMG
Interesting new optional skill for Hermes Agent called unbroker. I made a quick video showing how to install and set it up, and what results you might see when you run it. So what is unbroker? Simply, it finds where data brokers have your personal info exposed online and files the removal requests for you. We all know our data gets stored and sold. A lot of brokers are legally required to delete it if you ask, but doing that across dozens of sites by hand is miserable. Hermes Agent has it as a built-in security skill, so I just let my agent run the whole thing. How it went: - Set up browser automation (used Browserbase, just an API key + project ID in hermes tools). - Pointed Hermes at the GitHub, said "install this skill." Done. - Gave explicit consent, which it requires before doing anything, plus a quick intake: legal name, past names, cities, emails. - "Use the unbroker skill to remove my data." It spun up sub-agent swarms and scanned 51 broker sites. Real results are in the video. The best part is it's built to loop. It drafts the opt-out emails, or if you connect your email it sends them hands-off, then schedules rechecks and logs everything. Set it once and it keeps your data clean over time. Kind of wild that one skill and a couple prompts can check 50+ broker sites for you. Underrated use of agents. Let me know your thoughts!
i'm open sourcing UNBROKER: a tool that finds where your personal info is exposed by data brokers and files the removals for you it runs as a skill in Hermes Agent _________ your data is everywhere; hundreds of brokers publish your name, current and old addresses, phone, email,
NEW paper worth reading. (bookmark it) The basic idea is to pair a compressive recurrent state with a small exact memory, which helps to recover long-range recall without giving up the efficiency of linear attention. More on it below: Linear-attention and state-space models compress the whole prefix into a fixed-size state. That buys O(1) memory, but when many key-value associations compete, earlier facts get overwritten and needle recall degrades. HOLA gives linear attention a hippocampal complement. It keeps the usual delta-rule state as compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory. The state models linearly compressible structure while the cache stores associations that should not be forced through it. The cache writes without a learned eviction module, keeping only tokens whose prediction residual was actually committed to the state. At 340M parameters on 15B SlimPajama tokens, HOLA lowers Wikitext perplexity from 27.32 to 22.92, below a full-attention Transformer++ at 26.88, and stays robust on RULER needle recall out to 32k tokens, 16x its training length. Paper: https://t.co/z1Jzp7qQ6B Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
@gabriberton JEPA is literally the only new idea in the SSL space in the past 3 years... what other SSL approaches are there? DINO? iBOT? MAE? these are all older than 3 years!
Multimodal prompting is clearly the future. I love experimenting with new ways to interact with agents. As a researcher and engineer, I've found that the richer the inputs to the agent and the richer the outputs I consume, the better the overall results of the collaboration. In this little walkthrough, I go over what I mean by a multimodal prompt and when you might find it useful. It's more than simple text prompting, so I call it a "task" for lack of a better word. It helps me record my voice, annotate the screen, click/mouse actions, and more. Then all of that is preprocessed and passed to the agent to complete the task more efficiently. The agent has the high-level prompt, but it also has the raw transcriptions if needed. So naturally, I am also using this to build out multimodal skills that I reuse in workflows where agents tend to struggle. This has saved me hours of work. And even the older models are pretty great at understanding the tasks more clearly. Some noise is introduced in the process, but it doesn't seem to hurt the performance. I've also found that this new way of prompting has reduced the number of frustrating interactions I have with agents. This is something I have been thinking about for some time now because we are going to move more into multimodal AI models. And so the interactions are going to evolve with models being able to handle a variety of modalities natively. Currently, I process all the recorded tasks with another model in the background, but it's not crazy to think that all of it will just be naturally consumed by omnimodel in the future. All of these recorded tasks (which you can also think of as rich annotated datasets) are things I mine and recursively improve over time and, in some cases, package as reusable workflows/patterns/skills. This process has really elevated how I use coding agents for all kinds of work. I use multimodal prompting in things like web development, designing, artifact creation, prototyping, researching, reading, simulations, AI-assisted writing, and much more. So it's not just about prompting. It's going deeper into understanding and exploring the right level of detail that the agents need to make the right decision and to push/maximize their capabilities.
Had a blast talking at @aiDotEngineer. So happy to see whole room wanting to learn about self-improving agents. Thank you! https://t.co/tAksmhk3PV
We shipped ARIA, and it's been the most fun thing to show at @aiDotEngineer all week. One line in your W&B workspace and it reads your runs, finds what's working, and launches what to try next. The stuff you'd normally chase through dashboards, handled for you. https://t.co/6ZF3nQeUrQ
If you use Codex, is there any reason you still use ChatGPT? what do you use it for? how has it been better or critical for you?
Todayβs Hermes Agent Masterclass video is all about profiles and the kanban board! Key elements to create a true team of agents, capable of taking on many tasks. Check it out! Hermes Agent Masterclass: 9. Profiles & Kanban https://t.co/n3AbS3VqQ9
@SigsNYC It seems clear to me that the distinction between coding and chat is going away completely, and that these are trending towards super apps. Codex is already like this and many non-technical people are using Claude Code.
And I must add that this is strongly connected with something I have been sharing lately related to judges and verifiers. These verifiers require you to look at AI outputs and steer them towards the correct goals. It's not just prompts; it's deeper levels of understanding the inner workings of these models through richer interactions. This is also another indicator of the importance of developing expertise/knowledge in the specific domain/area where you are using AI agents.
@gaborsoter Why can't they build the proper infra? It's completely attainable if you're using the AI coding tools and you have the right growth mindset, right?
Highly-recommended read from MIT on the part of RL with verifiable rewards that everyone keeps hitting. RLVR only optimizes what you can objectively score, so style, structure, and diversity quietly collapse and reward hacking creeps in. The fix here adds an adversarial discriminator trained on human demonstrations, which acts as a learned proxy for the human output distribution. The generator maximizes both task accuracy and the discriminator's human-likeness signal, so verifiable rewards and imitation of humans get optimized together. Why does it matter? Across bug fixing, story generation, and a reward-hacking benchmark, this preserves RLVR's accuracy gains while restoring the fuzzy properties it usually destroys. Bug fixes come out with much lower edit distance, stories score higher win rates and stay diverse, and misbehavior nearly disappears. Paper: https://t.co/kBZA66WGyC Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
The most interesting Fable tip I've heard so far is to let the model use its own judgement as much as possible I told it "For all coding tasks use your judgement to decide an appropriate lower power model and run that in a subagent" and it seems to be saving a lot of tokens
10x faster. That's how much Lightwheel compressed Geely's humanoid training cycle. That's from months to train a task down to weeks. Geely's self-developed humanoid robots are now running on the Auto production line, sorting and sequencing parts alongside human operators. Getting there meant solving the two problems that stall most humanoid deployments: collecting enough real-world data without slowing the line, and closing the sim-to-real gap. Read our latest customer success story on how we pulled this off with Geely! Article: https://t.co/x19TRbWIaL
Program-as-Weights A Programming Paradigm for Fuzzy Functions https://t.co/tqxX5G7dat
The team at @vercel recently released the Eve agent framework, so we built a template that integrates LiteParse with itπ¦ The template provides a set of read-only filesystem tools that let Eve resolve paths, list directories, and read text-based files. We then pair those with LiteParse, which parses files from their source and returns clean, structured MarkdownβοΈ Finally, we equipped the agent with detailed instructions on when and how to combine these tools effectively, giving it a reliable workflow for navigating and understanding document collections out of the boxπ The result is a solid starting point that you can extend with your own channels, tools, and skillsπ§ Check it out: https://t.co/CjuXouQ3E0
The LPU team at @nvidia is building the future of low latency inference. We're going all in on Rust. If you're a cracked Rust engineer and love making hardware do things, we want you. https://t.co/EEPoYOWWmh
Kimi K2.7 Code is the first open-weight model you can select in the GitHub Copilot model picker. What does that mean for you? @burkeholland explains how this low-cost, high-performance model gives you more choice and flexibility in your workflow. βΆοΈ https://t.co/rxkmT2cABP
@Bitwarden @claudeai @paper @YouTube Would genuinely love the @bitwarden design team's take. The redesign tackles the "Get started free" β pricing-wall moment, defers email verification, and holds the extension ask until after import delivers value. Tear it apart π
I redesigned @bitwarden's onboarding by hand. Then I gave the exact same brief to @claudeai working inside @paper, and let it redesign everything too. The result surprised me in both directions. Full breakdown (9 min) on @YouTube: https://t.co/LrMLKDhuor
Ever wondered why a PyTorch CI test failure name doesn't exactly match your source file? Because PyTorch tests are generated dynamically at import time across various devices and dtypes, CI failures often display specific names that differ from the original template. Understanding how device-generic tests, OpInfos, and CI sharding fit together can significantly speed up your development and contribution workflow. Read our latest blog which provides a contributor's perspective on how to get started with testing in PyTorch. Link in comments π
Humanoids should take on the heavy lifting jobs for humans. But can full-size humanoids handle heavy-payload teleoperation from noisy VR inputs? Excited to introduce our work, HEFT: Heavy-Payload Full-size Humanoid Teleoperation. HEFT tracks human intent from raw, noisy VR signals and enables real-world teleoperation with payloads up to 24 kg on L7, a 175 cm, 65 kg full-size humanoid. Website & more demos: L7 heavy-payload teleop + G1/L7 high-dynamic tracking https://t.co/fFgSWgpA7V G1 & L7 training code/checkpoints: https://t.co/uGimX29xyU
3 years ago I gave a talk at the first @aiDotEngineer conference on "Advanced RAG" techniques in order to work around the limitations of naive RAG. It's insane how much the world has changed since then, and the world has evolved into standardized, higher-level abstractions around agent harnesses and context. Some general patterns: 1. Retrieval complexity can be encoded at the agent layer. This means that you can give relatively simple but performant search tools to an agent (e.g. really fast bm25, vector search), and let the agent reasoning enter the right queries to find the right results. 2. To some extent this is still evolving, but I do think we will increasingly care less about "hacking" the context window and more about deciding what business context is relevant in the first place. 3. The way we build agents has fundamentally changed from defining code, to defining runbooks, to defining goals. Big congrats to @swyx and the entire AI Engineer team for continuing to put out awesome conferences every year.
Accepted to #ECCV2026! π We've also released the code, it should work like a charm. If it doesn't, feel free to poke @roodiiiiiiiii π https://t.co/t5M0J7S1GR