Your curated collection of saved posts and media
One Token to Fool LLM-as-a-Judge Watch out for this one, devs! Semantically empty tokens, like βThought process:β, βSolutionβ, or even just a colon β:β, can consistently trick models into giving false positive rewards. Here are my notes: https://t.co/l5usRSzSJz
I wrote this in March, that coming up with a clever solution to the map folding problem in my quest for the 8x8 case would be a good sign LLMs were getting scary smart. Grok 4 made good headway today, coming up with a working multi-GPU implementation! https://t.co/J819AnCLO9
Evaluating LLM-based Agents This report has a comprehensive list of methods for evaluating AI Agents. Don't ignore evals. If done right, they are a game-changer. Highly recommend it to AI devs. (bookmark it) https://t.co/YiZatvmbBC
Ready to build production-grade data agents that work with real enterprise data? ποΈ Join us and @Snowflake in Amsterdam on July 31st for hands-on talks about building data agents that actually work in production: π€ Learn how to tame complex paperwork with document agents usingβ¦ https://t.co/r8oKh8O0eP
How to build a thriving open source community by writing code like bacteria do π¦ . Bacterial code (genomes) are: - small (each line of code costs energy) - modular (organized into groups of swappable operons) - self-contained (easily "copy paste-able" via horizontal geneβ¦ https://t.co/0xVX3NAMhC
The best way to make sure that AI doesnβt make you intellectually lazy is to not use it in a lazy way So when I work, I need to be mindful about how & when I consult with AI. I never use it for writing drafts or posts, for example. I described some of this to The New York Times https://t.co/r0RGF6MTSH
"Using a better model for analysis" π€¨ I didn't realize I was using haiku all this time, no idea when claude code snuck this one in rofl. https://t.co/If0qQ4svQh
Today we're releasing a developer preview of our next-gen benchmark, ARC-AGI-3. The goal of this preview, leading up to the full version launch in early 2026, is to collaborate with the community. We invite you to provide feedback to help us build the most robust and effective⦠https://t.co/pGWQJLbfqe
what you are seeing is full stack live step debugging on the MTMC-16: C code, the assembly for it & the machine, in a coherent, unified & visually compelling whole consequence in computer science education will never be the same releasing next friday https://t.co/lWngv4Q4qA
If youβre using AI agents for large-scale document extraction πβοΈ, you will need to craft a good structured output schema. Most LLMs support structured output these days, but here are tips and tricks from learned experienceπ‘ 1οΈβ£Try to limit schema nesting to 3-4 levels. 2οΈβ£ Makeβ¦ https://t.co/WgUcKOIXEc
π£ We've just enabled LLMS.TXT on the Gemini API docs. On https://t.co/99fXLuYvwB just add /llms.txt to get model-friendly docs. MCP: 1οΈβ£ Use mcpdoc to add to your code agent 2οΈβ£ Build with the latest API and SDK best practices π Or use in Gemini CLI with this extension π Letβ¦ https://t.co/gLiJKlOdpL
I'd like to point out that for the real world tasks (not benchmarks), Kimi K2 outperforms Gemini. This is telemetry across all @cline users, showing diff edit failure rate. Notice how Kimi has about a 6% failure rate, which is significantly better than Gemini's ~ 10% error⦠https://t.co/kx3tFHVmY8
Apple users can now enjoy Cyberpunk 2077! One of the best games of all time, available on the Mac in all its glory. If you haven't played this yet, now is your chance to enjoy this sci-fi masterpiece. Immerse yourself in Night City! https://t.co/VFC4LYpyTt

A story in 3 parts: :D https://t.co/1titH82cDb

Damn he listened and instantly said "I'll make that" https://t.co/VDiMwMP4X5
Have been thinking about this and it actually makes a lot of sense. Imports are completely meaningless so I made a neovim plugin to automatically fold imports in every langauge I use using treesitter (works in C, Rust, C++, OCaml, (Type/Java)script, Zig, and Python so far)β¦ https://t.co/fX9BpGtZ2i
Fairly convincing phishing attempt ... watch out folks don't fall for this (email was from x-dev4415@social.mg.gov.br) https://t.co/j22yIOWqX7
We gave Claude access to our corporate QuickBooks. It committed accounting fraud. LLMs are on the verge of replacing data scientists and investment bankers. But can they perform simple accounting tasks for a real business? The answer is no. https://t.co/TZMiDyhLPN
βThere's an unspoken covenant that as a founder, you go down with the ship. For better or worse, it's changed a bit over the last year and I think it's disappointing, to be honest.β Enough said. This show is everything and more on: - What really happened behind the scenes -β¦ https://t.co/qaY7MVwgIy
ποΈ Always wanted to turn your documents into in-depth, podcast-like conversations? π¦π NotebookLlaMa, our OSS @NotebookLM clone, just got an upgrade on that side! π§ You can now customize the style of the conversation and the target audience, as well as add instructions andβ¦ https://t.co/IvCRjMhCvQ
Excited to kick off a much improved version of our AI evals course tomorrow (link in replies). π« We've added dedicated homework sessions, an updated course reader & lectures that incorporates 100s of questions from cohort 1. Thereβs more hands-on/live error analysis, plusβ¦ https://t.co/xEo3hpCypy
Want to generate SVGs? Besides OmniSVG, please check out AnyCoder β a fully Gradio-powered coder app by @_akhaliq that lets you create SVGs from YAML! You can choose any LLM and any code language you want, try it out for free here: https://t.co/0yrNpv08AY https://t.co/pE9FoKQ2AV

Automate RFP Responses in Minutes with our open-source project! Learn how to transform the time-consuming RFP (Request for Proposal) response process from hours of manual work into an automated workflow that takes just minutes. This open-source demo showcases LlamaIndex's⦠https://t.co/HJFHnVwZs1
lessons from building verticalized agents link below https://t.co/XBHlgRwx53
Excited to announce that DnD's official training code, training datasets, and demo have been released! Check our code here: jerryliang24/Drag-and-Drop-LLMs Nice work with @oahzxl, @Richard91316073, and @realsoptq, thx to @VITAGroupUT and @VictorKaiWang1 for advising! https://t.co/TXyHE9Rin6
The new Qwen3 update takes back the benchmark crown from Kimi 2. Some highlights of how Qwen3 235B-A22B differs from Kimi 2: - 4.25x smaller overall but has more layers (transformer blocks); 235B vs 1 trillion - 1.5x fewer active parameters (22B vs. 32B) - much fewer experts in⦠https://t.co/Ld5chRkXpZ
Ready to build cutting-edge AI agents that push the limits of LLMs? π We're excited to sponsor the A2A Agents Hackathon in San Francisco this Saturday, July 26, where our VP of Developer Relations @seldo will be speaking and judging alongside incredible experts fromβ¦ https://t.co/R6J4igjhSH
Structured Extraction from images power a lot of real world Agentic use cases, such as validation of license plates, driving licenses, information from invoices captured by images. Our Document Ingestion API allows you to extract data from millions of images without spinning up⦠https://t.co/RGknTmN9wv
notes from our talk with @haizelabs https://t.co/CrMioau8Ur
New ARC Prize 2025 High Score 17.6% by Giotto. ai (@podesta_aldo) https://t.co/iTPoOmpBsw
Comet is a giant leap among browsers. Amazed to see it can access the Figma interface directly. Here's the Comet Assistant making Figma edits like a baby taking small steps... >selects artboard >writes text >selects font from the picker >increases size cute. https://t.co/tqLsJGZwBk
I am finding ChatGPT agents to be useful. They are a better fit with the "intern" analogy than any former AI - requiring oversight, still saving lots of time overall. For example, I update an AI cost/performance chart frequently. The agent did all the grunt work, with guidance. https://t.co/AGs7DRNxSh
