Your curated collection of saved posts and media
Here is the key table showing the 30 day effects (which they misreport in the paper) & which apparently uses a completely different test, but the ChatGPT group remains ahead. The errors shouldnβt fill you with confidence about the study, though. https://t.co/hqocnAsF38
The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: https://t.co/tmZeqyDY1W Alternatively, a PR has the benefit of exact commits: https://t.co/CZIbuJIqlk but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.

I built a cloth simulation test Opus 4.6 used 100% + $3 credits GPT 5.4 (high) 10% If you are on a tight budged and want the best value for $20, it's note even close
I built a cloth simulation test Opus 4.6 used 100% + $3 credits GPT 5.4 (high) 10% If you are on a tight budged and want the best value for $20, it's note even close
This (very small) study hints at something more interesting. If you use AI to support learning while coding you can gain additional skills, if you delegate all intellectual work to AI you learn nothing. This has also turned out to be true in other larger RCT studies in education https://t.co/sp4cqPNwBP

The obvious reasons intelligence-per-watt is going up so fast: more efficient architectures, more efficient hardware, and higher quality data. The less obvious reason: finding the right balance on what should be stored in the model's weights and what can be computed through tool use, reasoning, and potentially other types of in-context learning. A simple example: in the earlier LLM days, it was quite likely that for simple arithmetic (e.g. adding two numbers), the model had to basically memorize tuples of (inputs, op, outputs). You can imagine this took up a lot of room in the weights. With reasoning the model can compute this in its chain-of-thought. With tool calling the model can compute this with a tool call. In both cases it saves a lot of space in the weights. I'm sure there is a floor on the smallest LLM that can have say GPT 5.x quality. But that floor could be 5B, it could be 100B. And I don't think anyone really knows because of the above effects. In other words we can probably go much further with a 5B-15B model with exceptional tool calling and reasoning.
I still find it borderline stupid that coding agents seem inclined to use APIs or libraries in complex scripts before tinkering at small scale, as in bottom-up notebooks, to make sure they're modeling these APIs correctly. Who is responsible for this and what are they thinking.
The full video from Cortical Labs explaining how they put 200,000 brain cells onto a silicon chip and had it play Doom is wild: βWhen a demon appears on the left of the screen, specific electrodes stimulate the sensory area of the neural culture on the left side. The neurons react to that stimulation. We then listen to their response, the spikes, and interpret that activity as motor commands. If the neurons fire in a specific pattern, the Doom guy shoots.β
Just built my 13th AI agent for $50 total. The new one? Clip agent. SEO agent, market research agent, marketing agent. All running 24/7 on my M4 Mac Mini with OpenClaw. While everyone keeps arguing about AI taking jobs, I'm just building the workforce. No meetings. No sick days. No drama. Just output. The clip agent runs completely local - zero API costs, zero rate limits. Even the experts know OpenClaw stands out. Qwen 3.5 0.8 for heartbeats, MiniMax M2.5 coding plan as the brain. Full agent team for less than a dinner bill. This is what scaling looks like now.
People underestimate how foundational some articles from Anthropic and OpenAI are. We just donβt have time to read anything anymore. History has been made with things like these Agent Skills: https://t.co/QqMhoR0UJ2 Harness Engineering: https://t.co/2o9RTicSvD
Great read if you are engineering your own agent harness.
RealWonder Real-Time Physical Action-Conditioned Video Generation paper: https://t.co/U8RM31zcVD https://t.co/GEMCJ14Yda
Thanks again for sharing! @_akhaliq π₯° The paper, code, @Gradio demo are all released! π₯ Please have a try! π Page: https://t.co/pW4CpKHKNj https://t.co/jNK3dUr1XJ
Thanks again for sharing! @_akhaliq π₯° The paper, code, @Gradio demo are all released! π₯ Please have a try! π Page: https://t.co/pW4CpKHKNj https://t.co/jNK3dUr1XJ
@Shubham13596 I'd say agent contexts with longer-running reasoning tasks (see last row) https://t.co/MJMMYF0bmD
@Shubham13596 Regarding Google's models, they didn't compare to Gemini, but Gemma was actually the 2nd best in the multi-lingual performance https://t.co/kMTE80oksj

@steipete @openclaw This is the true replacement of SWE-Bench Verified & Tau2 π
Most developers already live in the terminal. And now, so does Copilot. π» With GitHub Copilot CLI, you can take an idea and convert it into reviewable diffs without leaving your terminal (and then seamlessly carry that work into your editor or PR). We break down how it works. π§΅
Start with intent, not scaffolding. π‘ Instead of copying a template, just tell Copilot CLI what you want: > Create a small web service with a single JSON endpoint and basic tests You review the proposed plan and scaffolding before anything runs. You stay in control. β
Iterate at the point of failure. π When a test fails in your terminal, you don't need to context-switch. Ask Copilot about the exact failure in the same session: > Why are these tests failing? or > Fix this test failure and show the diff
@alex_prompter Agents need heavy babysitting and thatβs fine unless you insist on selling the idea that they are autonomous agents. They are simply not. The main problem with AI is not the technology but the narratives AI labs create on top of it to keep speculators money from drying out.
@skill_evolve @alex_prompter βAgentβ is largely a marketing term. In practice, what people usually mean is a prompt wrapped in a loop. Itβs about as crude as it sounds, and itβs a fragile setup thatβs likely to break sooner or later, because the underlying premise was always pretty shaky. Leading AI labs, including Anthropic, know full well that current models are unreliable, third-party tests show a staggering 97% failure rate on digital tasks. Pause and let that sink in. Silicon Valley has always lived in a bubble. Today, its recklessness threatens the entire economy, and our systems arenβt ready to cope. Brace yourself. Ask yourself: why do we take AI labs at their word about their own technology? Scrutiny isnβt anti-innovation, itβs pro-accountability. https://t.co/Ut4hpvTU3C
@ivanburazin There are no real βagentsβ, just software making calls to APIs. Once these systems start interfacing with LLMs, things can quickly go off the rails: resources get wasted and silent failures accumulate over time. No amount of harnesses, clever prompting, or orchestration can fully shield you from inherently non-deterministic behavior. Just make sure everyone gets that.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey