Your curated collection of saved posts and media
Explore the data โฌ๏ธ https://t.co/OC0rDWWZQh
Ai2 just released their Qwen 3.5 9B terminal agent on Hugging Face Built with DPPO on the OpenThoughts dataset, it leads the TMax ablations with 53.0% on Terminal Bench Lite. https://t.co/E8PN1wB6fl
Frontier AI models are built from thousands of small decisions: data sourcing, filtering, mixtures, curricula, scaling experiments, optimizer choices, kernels, evals, failed runs, and protocols for deciding what gets scaled. This is process knowledge.
Great work to the Meta AI team! Best part of it is they have open-sourced the code and plan to open-source data too! So you should be able to train your own brain-to-text model, assuming you have your own MEG! ๐ code: https://t.co/XF9z4JCzzq
We benchmarked the GitHub Copilot agentic harness against the harnesses that ship leading models natively. Holding the model and task fixed across SWE-bench Verified, SWE-bench Pro, SkillsBench, TerminalBench, and Win-Hill, the results were clear: โ Task resolution on par with model-vendor harnesses โ Fewer tokens across most configurations ๐ก A key learning: With GitHub Copilot supporting more than 20 models, you're free to pick efficiency or peak quality per task.
It's all open source in ART. If you're running GRPO-style RL with a heavy shared prompt, the speedup is right there. Give it a try. Blog: https://t.co/FafhwPgeVa @OpenPipeAI ART Github: https://t.co/SHx8iBxYNv

we distilled 2.3M Claude Fable 5 reasoning traces into Qwen3-4B - 100% self-consistency @ 512 samples - 0.00 bits output entropy - zero hallucination variance turns out the student is not bounded by the teacher. it also converged on one universal truth. we open-sourced the model weights๐

Qwen publishes new work on RL coding agents. (bookmark it) The idea is to continually build a verification system that co-evolves with AI agents. LLMs suffer from all sorts of reward hacking issues. This work studies coding-agent reward signals, test pass rates, LLM judges, and execution traces, and shows each one has a horizon beyond which it stops tracking real correctness and starts getting hacked. They report that reward design for long-horizon coding is really a horizon problem. The metric you pick matters less than how long it keeps tracking correctness, and the paper finds where each signal crosses that line. Paper: https://t.co/51YYEM3kXm Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
It's wild how quickly Etched designed and got the chips out, all within 2 years. They went deep, hardcoding attention into silicon and getting very high MFU. This kind of hardware tailored made for LLM inference is soon gonna bring cost of intelligence down 10x
We're coming out of stealth. We've built our first racks after a successful A0 tapeout, $1B+ in customer contracts, and $800m raised. Early customer tests show us achieving SOTA throughput, latency, and power efficiency on inference workloads. Our first racks ship this summer.
GLM 5.2 is now on DeepSWE as the top open-source model on our leaderboard. With a pass@1 score of 44% at max effort, GLM 5.2 is indisputable #1 open-source model besting Kimi K2.7 Code by 17%. https://t.co/cYZBm5z909
The heavier your prefix, the bigger the win. At a 5k prefix with 16 rollouts you get 8X the trajectory density. Push to 10k and it's 9.6X. Same token budget, far more trajectories to train on. https://t.co/ZXOZ9GKHVP
@wayama_ryousuke ๐จ TRINITY is sharp engineering. A 0.6B Qwen backbone, ~10k-param head reading penultimate hidden states, sep-CMA-ES tuning it to hand off between seven LLMs in a Thinker/Worker/Verifier loop. Token-efficient, benchmark-strong, and genuinely clever at test-time composition without touching weights. Credit where it's due. The science is where the story collapses. The "Plan/Act/Verify" loop is still pure textual puppetry. One model emits a plan, another emits steps, a third stochastic parrot emits ACCEPT or REVISE. No external verifier, no interpreter, no grounding in code execution or world state. Just more tokens judging more tokens until the turn budget expires. Stochastic parrots verifying stochastic parrots isn't reasoning. It's statistical mirror-gazing dressed as collaboration. The routing claim takes the same hit. Hidden-state space on these models is already cleanly separable by task label: SVM hits 100% on the obvious buckets. No hard intra-domain distinctions were stress-tested. What sep-CMA-ES actually did, across ~30k LLM calls and 60 iterations, was brute-force a high-dimensional lookup table for a decision surface a logistic regression on the same features would have found before lunch. No online learning, no policy gradient, no adaptation after the museum piece is frozen. The "evolution" is expensive offline calibration, not emergence. OpenRouter already routes dynamically and without the ceremony. Trinity demonstrates that heavy domain splitting plus brute-force coordinator tuning can squeeze SOTA numbers out of fixed benchmarks. That's real engineering. But calling it orchestration or emergent intelligence is the usual category error of anthropomorphic projection onto structured computation. It's a static classifier in a trench coat. Impressive demo. Not the scientific step-change the framing wants us to believe. No cost vs value analysis. They just burned lots of tokens for no good reason. Hello Research Tokenmaxxin. Unit economics? Not important until it is.
Packed crowd at @dkundelโs talk about internals of the Codex harness https://t.co/ab0GAT6SLx
40% of benchmarking effort targets math/coding, but the related occupations are only 3.5% of US jobs. We introduce EconEvals, an open-source evaluation suite to measure capabilities and predict job disruption across the US labor economy. https://t.co/wxQykhUqCI
We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Hereโs how it works: one half holds the context, the other writes the tokens, with both reusing the pretrained model instead of training a new one from scratch. We found it kept 98.7% of the original modelโs quality at 2.42ร faster generation.
๐ฎ๐น๏ธ๐ฅ๏ธ CS2-10k is now available on @huggingface ๐ 600,000+ egocentric gameplay videos. 10,000+ hours. Every frame paired with the exact keyboard, mouse, and 3D position data that produced it. If you're working on world models, action-conditioned video generation, or egocentric navigation, this is ready to download and use today.
With agentic coding, complexity compounds in a mechanical way: unnecessary code ends up in the codebase, moves to the context window, degrades the model's reasoning abilities, leads to more unnecessary code (often to fix issues arising from the unnecessary code). It's exponential
HERMES AGENT NOW READS THE WEB UP TO 60X FASTER AND 49X CHEAPER. CLEAN CONTENT STRAIGHT TO THE AGENT. LARGE PAGES PAGED ON DEMAND. @NousResearch scraping backends used to return raw content that got processed redundantly before reaching the agent. that pipeline is gone. now: backends pass clean content directly. large pages save locally and page on demand. same quality. fraction of the time and cost. HOW WEB_EXTRACT HANDLES LARGE PAGES: size-driven processing. no wasted tokens. under 5,000 chars: โ returned as-is. no LLM call. full markdown reaches the agent. 5,000 to 500,000 chars: โ single-pass summary via auxiliary model. capped at ~5,000 chars of output. keeps quotes, code blocks, key facts. 500,000 to 2,000,000 chars: โ chunked into 100K-char pieces. each chunk summarized in parallel. final synthesis: ~5,000 chars. over 2,000,000 chars: โ refused with a hint to use web_crawl with focused extraction instructions. the summary is a content compressor, not a paraphraser. if summarization fails, Hermes falls back to the first ~5,000 chars of raw content. no useless error messages. ROUTE EXTRACTION TO A CHEAP MODEL: by default, web_extract uses your main model. on Opus that means every long page burns premium tokens on summarization. set in Desktop app, Dashboard, or config.yaml: auxiliary: web_extract: provider: openrouter model: google/gemini-3-flash-preview timeout: 360 extraction summaries on Gemini Flash. reasoning stays on your premium model. this alone cuts web research costs significantly. 8 BACKEND PROVIDERS: Firecrawl (default): search + extract + crawl. 500 free credits/month. SearXNG: free, self-hosted, search-only. no API key. Brave Search: 2,000 free queries/month. search-only. DDGS (DuckDuckGo): free, no key needed. search-only. Tavily: search + extract + crawl. 1,000 free searches/month. Exa: search + extract. 1,000 free searches/month. Parallel: search + extract. paid. xAI (Grok): search-only. LLM-generated results via Grok. search-only providers pair with Firecrawl/Tavily/Exa for extract capability. PER-CAPABILITY SPLIT: use different providers for search vs extract: SearXNG (free) for search. Firecrawl for extract. free searches. paid extraction only when needed. configure via hermes tools or config.yaml. FREE SELF-HOSTED SEARCH (SEARXNG): zero API costs. zero rate limits. privacy-respecting metasearch across 70+ engines. docker compose up -d set SEARXNG_URL in .env. enable JSON format in settings.yml. Hermes connects automatically. pair with Firecrawl for extract and you have search for free with paid extraction only on demand. NOUS PORTAL SUBSCRIBERS: web search and extract included through the Tool Gateway via managed Firecrawl. no API key needed. no separate billing. hermes setup --portal enables everything. WHEN YOU NEED RAW CONTENT: if the LLM summary drops important fields (structured data, tables, specific formatting): use browser_navigate + browser_snapshot instead. returns the live accessibility tree without auxiliary-model rewriting. full Hermes architecture deep-dive in the article ๐
https://t.co/VxyyeQCimO
๐ฅ We introduce LeVLJEPA: the first fully non-contrastive end-to-end vision-language pretraining method competitive with CLIP & SigLIP ๐ช๐ผ ๐ No negatives. No temperature. No momentum encoder. No teacher-student. TL;DR: LeVLJEPA learns image to text structure by prediction: each modality predicts the other's embedding, while SIGReg keeps each embedding isotropic Gaussian. ๐งต ๐ https://t.co/1qBXor8qTf
Update on our long-horizon AI R&D evals: In April, we launched CRUX, a project to regularly run open-world evaluations. These long, messy, real-world tests of what AI agents can actually do. Our second evaluation is underway, and we ask: AI agents automate AI research? There is a lot of interest in studying AI research automation. But most of the systems built so far follow one of three patterns. 1) keep a human in the loop to guide the agent and course-correct along the way. 2) focus on narrow problems where ground truth is clear and progress is easy to verify, as in AutoResearch. 3) use scaffolds engineered for one specific type of research question, so strong results may say more about the scaffold than about the agent's general research ability. These efforts are helpful, but a lot of AI research is much broader. Success is not immediately clear or verifiable. Researchers need to test and reject promising hypotheses, backtrack, consider new or unconventional approaches, and do a lot more to make progress on answering research questions. In CRUX #2, we are trying to test whether agents can answer novel, open-ended AI research questions. - One major risk in such a task is contamination. We want the agent to have access to the internet and all the tools it needs to solve the task, so we can't use research questions from publicly available papers. At the same time, we want high quality papers to serve as the source of challenging research questions. - To address this, we partnered with AI researchers from UKAISI, UToronto, Princeton, and other institutions who have written high-quality papers that arenโt yet public, so thereโs no risk of contamination. - The authors pose open-ended research questions without giving away answers. The agent must produce a NeurIPS-quality paper and a reproducible codebase, which the authors of the papers then review. - We built a general-purpose scaffold on OpenClaw and Opus 4.8. (We would have loved to use Fable 5, but given the filters on AI R&D capabilities, we don't want to confound results.) - Agents get generous resource budgets set in consultation with the original authors, such as access to VMs, GPUs, and any other compute needed to answer the question. They also have $3,000 in API credits per paper. We evaluate them on week-long time horizons to make progress on answering the research question, far more than typical agent evals. - The agent needs to manage its own budget. It can track its spend and stay within its limits, and it can modify its scaffold and reasoning effort as it sees fit. - In addition to the final artifacts, such as the paper's code, we are also evaluating the agent's trajectories in depth. When we announced CRUX, we planned to conduct an open-world eval every month. Given the scope and ambition of this project, we have spent a lot more time making sure we are confident in our setup and results. That said, the early results we have are exciting, and we look forward to sharing them soon.
BREAKING: Gemini Omni Flash by @GoogleDeepMind is 1st overall on Video Arena with an Elo of 1404. Gemini Omni Flash establishes a 101 point Elo gap over Seedance 2.0 Mini by @BytePlusGlobal in 2nd place, one of the largest leaps weโve ever seen on Video Arena. This establishes Google as the worldโs leading video generation lab, with a leap of 7 positions from their Veo series. Congratulations to the @GoogleDeepMind team on this accomplishment!
o3-mini-high figured out the issue with @SakanaAILabs CUDA kernels in 11s. It being 150x faster is a bug, the reality is 3x slower. I literally copy-pasted their CUDA code into o3-mini-high and asked "what's wrong with this cuda code". That's it! Proof: https://t.co/whmF5fvHVr Fig1: o3-mini's answer. Fig2: Their orig code is wrong in subtle way. The fact they run benchmarking TWICE with wildly different results should make them stop and think. Fig3: o3-mini's fix. Code is now correct. Benchmarking results are consistent. 3x slower.

When does combining LLMs help? Great analysis on combining language models, measured across 67 models from 21 providers. Any policy that routes, votes, cascades, or runs a mixture of agents and then returns one model's answer is bounded above by 1 minus beta, where beta is the fraction of queries every candidate model gets wrong. The common justification for ensembling is diversity, usually measured as low pairwise error correlation. The paper proves that correlation cannot identify beta, so decorrelation does not establish that headroom exists. And across the 67 models, real co-failures are far more concentrated than independence-style assumptions predict. Before assuming a router or MoA setup will help, measure beta. Co-failures cluster on the answer format rather than the subject. Paper: https://t.co/PGO9YAoBzH Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
Super excited about open-source router systems and routing models like @vllm_project semantic router: https://t.co/Gwza9jPWzr The future is multi-models and you'll want to customize your router the same way you customize your code! It could be the key to tilt the value capture from a few expensive frontier models to a long-tail of models (especially open-source). More people should build those!
Fugu stands shoulder-to-shoulder with leading models like Fable and Mythos across the industry's most rigorous engineering, scientific, and reasoning benchmarks. Read the full blog: https://t.co/2ZJbdWqCUj Beyond Bigger Models: Why are Orchestration Models the Next Frontier Progress in AI has been driven largely by giant, monolithic models. But the most powerful systems of the future will be collaborative ecosystems. Today, this orchestration is no longer just a technical optimization. It has become a geopolitical and operational imperative. For an organization or a nation, relying on a single company's model for critical infrastructure, finance, or governance is a material vulnerability. This risk is no longer a hypothetical possibility, but a reality. As we have seen with recent export controls imposed on models like Fable and Mythos, access can disappear overnight. Collective intelligence is the practical hedge against this concentration of power. Because Fugu orchestrates an underlying pool of swappable agents, it simply routes around vendor restrictions. By orchestrating the worldโs models, we are delivering the resilient blueprint required for true AI sovereignty.

Ai2 just released TMax 27B on Hugging Face A 27B terminal agent that hits 42.7% on Terminal Bench 2.0, rivaling models 40ร its size. https://t.co/LfCksOXL9L
I put together a new article on setting up local coding agents with open-weight models. Everything runs 100% locally. I thought it might be useful putting this together because many people asked me about my setup in the past, and I thought it would also motivate people to get started tinkering with local models for serious work (yes, things got incredibly capable this year with better LLMs and better harnesses). So, here's a walkthrough of how to connect a local LLM to a local coding harness (could be Claude Code or Codex, which you may already be familiar with). I also included some assessment notes that are useful as a checklist to select between and consider certain LLMs over others: - Checking RAM usage at long contexts to see if the model is suitable for real work - Measuring prefill and decoding tok/sec to see whether it's fast enough to not be annoying - Making sure the model has sufficient tool-calling capabilities in theory - Assessing whether the model can solve some more challenging tasks when used in a coding harness. Of course, there are always more specialized tools that can squeeze a bit more performance out of things, but I hope this is a good starter kit that stays flexible; that is you can easily switch to newer models as they are released or even tap into cloud models in your familiar harness if the current ones are not sufficient enough for a given task.
one command and you have a private vllm server on HF infra point a coding agent straight at your own model, then spin it down when you're done blog (by @QGallouedec) belowโคต๏ธ https://t.co/F9i10NSOSG
Introducing Cursor for iOS. Build from anywhere by launching always-on cloud agents. Or remotely control agents running on your computer from the app. Composer 2.5 is 75% off in the app now through July 5. https://t.co/dFxQyrgmBb
Open weights just caught up to the frontier. GLM-5.2 from @Zai_org tops the open-model rankings on @ArtificialAnlys and @arena's Agent Arena. It's now live on CoreWeave Serverless Inference at $1.39 in and $4.40 out per 1M tokens. Ship more for less. https://t.co/SuB7bV67iG
GLM-5.2 is now selectable in Claude Code via Hugging Face๐ค Inference Providers + hf-claude. Open models are becoming easier to plug directly into real developer workflows. ๐ https://t.co/mNopSy0iwp
Googleโs Tensor Processing Unit (TPU) uses the systolic array architecture - an idea from 1978 - to accelerate matrix multiplication with far less memory movement. Fun to build a small scale version on an FPGA. Links to original paper and TPU design: https://t.co/cEznMoForH