Your curated collection of saved posts and media

Showing 24 posts Β· last 7 days Β· quality filtered
C
clairevo
@clairevo
πŸ“…
Mar 06, 2026
7d ago
πŸ†”18713030

i've been working really hard to burn down the tool errors we get in our main chat flow and it's working this is the @HamelHusain effect https://t.co/WG4s8AonsQ

Media 1
πŸ–ΌοΈ Media
H
HamelHusain
@HamelHusain
πŸ“…
Mar 06, 2026
7d ago
πŸ†”21095806

@clairevo Yesssssss I am so happy right now https://t.co/wdPDNcqKov

Media 1
πŸ–ΌοΈ Media
H
HamelHusain
@HamelHusain
πŸ“…
Mar 06, 2026
7d ago
πŸ†”21095806

@clairevo Yesssssss I am so happy right now https://t.co/wdPDNcqKov

Media 1
πŸ–ΌοΈ Media
S
signulll
@signulll
πŸ“…
Mar 05, 2026
8d ago
πŸ†”58315369

if you showed this chart to a typical economist like 20 years ago, they would've laughed you out of the room. the right side of this is white collar jobs that were once worshipped. these jobs were comfortable, well paying, & came with societal status + recognition. your parents would’ve been proud of you. now these are likely all set to be severely impacted in a shorter period of time than anyone likely ever thought of let alone projected. this is like ppl waiting on a beach enjoying the sun when a tsunami has already struck.

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 04, 2026
9d ago
πŸ†”25659668

When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Stop treating evals like unit tests. That works for deterministic software. For LLM products, it creates false confidence because real-world usage changes over time. Example: an insurance prompt passed 20 eval cases. The team shipped. In production, a new class of requests showed up and failed quietly. No crash, no alert, just wrong answers at scale. The fix is not "write more eval cases," which is what many teams do. It is building evals as a living feedback loop. Start with a small set, ship, watch what breaks in production, add those failures back, and re-run on every prompt or model change. What eval failure caught your team off guard? Blog: https://t.co/HCVhcow5rA Stanford CS 224G lecture: https://t.co/q667gGwckt

Media 1Media 2
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 04, 2026
9d ago
πŸ†”25659668

When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Stop treating evals like unit tests. That works for deterministic software. For LLM products, it creates false confidence because real-world usage changes over time. Example: an insurance prompt passed 20 eval cases. The team shipped. In production, a new class of requests showed up and failed quietly. No crash, no alert, just wrong answers at scale. The fix is not "write more eval cases," which is what many teams do. It is building evals as a living feedback loop. Start with a small set, ship, watch what breaks in production, add those failures back, and re-run on every prompt or model change. What eval failure caught your team off guard? Blog: https://t.co/HCVhcow5rA Stanford CS 224G lecture: https://t.co/q667gGwckt

Media 1Media 2
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 05, 2026
8d ago
πŸ†”12277409

Google Workspace CLI: https://t.co/Rg229zYsoA

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 05, 2026
8d ago
πŸ†”12277409

Google Workspace CLI: https://t.co/Rg229zYsoA

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 05, 2026
8d ago
πŸ†”89368331

Read on for more: https://t.co/bLxqT3yUTl

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 05, 2026
8d ago
πŸ†”89368331

Read on for more: https://t.co/bLxqT3yUTl

Media 1
πŸ–ΌοΈ Media
S
SIGKITTEN
@SIGKITTEN
πŸ“…
Mar 06, 2026
7d ago
πŸ†”32368826

thanks codex team for supporting the radare2 project with some Pro subs!πŸ™β€οΈ https://t.co/0fKEt16AeI

Media 1
πŸ–ΌοΈ Media
πŸ”jxnlco retweeted
S
SIGKITTEN
@SIGKITTEN
πŸ“…
Mar 06, 2026
7d ago
πŸ†”32368826

thanks codex team for supporting the radare2 project with some Pro subs!πŸ™β€οΈ https://t.co/0fKEt16AeI

Media 1
❀️61
likes
πŸ”1
retweets
πŸ–ΌοΈ Media
Z
ZeffMax
@ZeffMax
πŸ“…
Mar 06, 2026
8d ago
πŸ†”25134380

that was the old me, the six days ago me https://t.co/btAJYAwAjL

Media 1
πŸ–ΌοΈ Media
J
jxnlco
@jxnlco
πŸ“…
Mar 06, 2026
7d ago
πŸ†”50911745

amazing work team https://t.co/9d1cEx7vy2

Media 1
πŸ–ΌοΈ Media
C
chongdashu
@chongdashu
πŸ“…
Mar 06, 2026
8d ago
πŸ†”73340254

Couldn't help it! Had to give GPT 5.4 (High) + /fast mode a try. β†’ Added height terrains to the level β†’ Animation tweens for the jumps Used xHigh to solve a gnarly bug with the controls successfully πŸ’ͺ This Final Fantasy Tactics-inspired game was completely vibe coded! https://t.co/q2K7PovU62

πŸ–ΌοΈ Media
B
bertgodel
@bertgodel
πŸ“…
Mar 03, 2026
10d ago
πŸ†”11940087

We’re announcing Kos-1 Lite, a medical model that achieves SOTA on HealthBench Hard at 46.6%. As a medium sized language model (~100B), it achieves these results at a fraction of the serving cost of frontier trillion-parameter models. https://t.co/27sxAHPgZM

Media 1
πŸ–ΌοΈ Media
B
bertgodel
@bertgodel
πŸ“…
Mar 03, 2026
10d ago
πŸ†”11940087

We’re announcing Kos-1 Lite, a medical model that achieves SOTA on HealthBench Hard at 46.6%. As a medium sized language model (~100B), it achieves these results at a fraction of the serving cost of frontier trillion-parameter models. https://t.co/27sxAHPgZM

Media 1
πŸ–ΌοΈ Media
H
HuggingPapers
@HuggingPapers
πŸ“…
Mar 04, 2026
10d ago
πŸ†”28876865

SWE-rebench V2 A language-agnostic pipeline that automatically harvests 32,000+ executable real-world software engineering tasks across 20 programming languages. Built for large-scale RL training of code agents with reproducible Docker environments. https://t.co/JJ0vLH5N7B

Media 1
πŸ–ΌοΈ Media
H
HuggingPapers
@HuggingPapers
πŸ“…
Mar 04, 2026
10d ago
πŸ†”28876865

SWE-rebench V2 A language-agnostic pipeline that automatically harvests 32,000+ executable real-world software engineering tasks across 20 programming languages. Built for large-scale RL training of code agents with reproducible Docker environments. https://t.co/JJ0vLH5N7B

Media 1
πŸ–ΌοΈ Media
_
_akhaliq
@_akhaliq
πŸ“…
Mar 04, 2026
9d ago
πŸ†”91333805

Image Generation with a Sphere Encoder https://t.co/6I2FbpogaC

Media 1
πŸ–ΌοΈ Media
_
_akhaliq
@_akhaliq
πŸ“…
Mar 04, 2026
9d ago
πŸ†”27662834

Utonia Toward One Encoder for All Point Clouds paper: https://t.co/AJFPivgBm9 https://t.co/Xbux4iY1QV

Media 2
πŸ–ΌοΈ Media
_
_akhaliq
@_akhaliq
πŸ“…
Mar 04, 2026
9d ago
πŸ†”36665410

BeyondSWE Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? paper: https://t.co/IrLgJJomQU

Media 1
πŸ–ΌοΈ Media
_
_akhaliq
@_akhaliq
πŸ“…
Mar 04, 2026
9d ago
πŸ†”50449052

Beyond Language Modeling An Exploration of Multimodal Pretraining paper: https://t.co/GmtPAQDo8T

Media 1
πŸ–ΌοΈ Media
_
_akhaliq
@_akhaliq
πŸ“…
Mar 04, 2026
9d ago
πŸ†”19687332

Beyond Length Scaling Synergizing Breadth and Depth for Generative Reward Models https://t.co/25QhR93OKK

Media 1
πŸ–ΌοΈ Media