Your curated collection of saved posts and media

Showing 32 posts Β· last 14 days Β· by score
J
jeremyphoward
@jeremyphoward
πŸ“…
Aug 13, 2025
261d ago
πŸ†”61099322

@capetorch https://t.co/O0FLVN9HY2 https://t.co/Gd8deWUh9Y

Media 1
πŸ–ΌοΈ Media
J
jeremyphoward
@jeremyphoward
πŸ“…
Aug 13, 2025
261d ago
πŸ†”33881892

@simonw Is it these? https://t.co/hEuS7lC5Mr

Media 1
πŸ–ΌοΈ Media
πŸ”jeremyphoward retweeted
S
swyx
@swyx
πŸ“…
Aug 13, 2025
261d ago
πŸ†”31823167

(6 month update) stop the clock (thanks @mooreds for posting it on HN and reminding me) https://t.co/ujFOefNvzc

Media 1
❀️79
likes
πŸ”2
retweets
πŸ–ΌοΈ Media
πŸ”jeremyphoward retweeted
E
Edward Z. Yang
@ezyang
πŸ“…
Aug 14, 2025
260d ago
πŸ†”07082876

State of torch.compile, August 2025. https://t.co/nU9MGwNZQc

Media 1
❀️338
likes
πŸ”29
retweets
πŸ–ΌοΈ Media
πŸ”huggingface retweeted
A
Artificially Intelligent πŸ΄β€β˜ οΈ
@Artificially999
πŸ“…
Aug 12, 2025
262d ago
πŸ†”15169990

@ClementDelangue https://t.co/w3QdlqmnMD

Media 1
❀️55
likes
πŸ”5
retweets
πŸ–ΌοΈ Media
πŸ”huggingface retweeted
R
Vaibhav (VB) Srivastav
@reach_vb
πŸ“…
Aug 12, 2025
262d ago
πŸ†”20786755

Matrix Game 2.0 - Open source, real-time, interactive world model on Hugging Face! πŸ”₯ https://t.co/NcR0CowadE

❀️135
likes
πŸ”18
retweets
πŸ–ΌοΈ Media
πŸ”huggingface retweeted
S
Skywork
@Skywork_ai
πŸ“…
Aug 12, 2025
262d ago
πŸ†”26541708

πŸ“‚ Everything is here for you to try: - Huggingface Model: https://t.co/rqvHxUV33h - Github Repo: https://t.co/IxL74jzNXU Open source. Real-time. Ready today.

Media 1
❀️165
likes
πŸ”11
retweets
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”10562311

Custom AI Agents are a game-changer for builders. @Emergentlabshq now allows you to create custom AI agents to build & launch production-ready mobile + web apps 5x faster! Start with a prompt to go from an idea to a working agent to a fully deployable app. https://t.co/UGk9tnakn8

πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”92464864

Go to https://t.co/RFUjrB1nbT β†’ define persona & capabilities β†’ select tools β†’ map sub-agents β†’ test scenarios β†’ deploy β†’ scale. Their architecture uses design patterns that demonstrate 3x better task focus than generic models. It gives you a competitive advantage that is unique to you.

Media 2
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”46833657

Their prompt-to-production pipeline bypasses traditional mobile development entirely. Natural language input gets parsed through their semantic layer and compiled into native iOS/Android apps. Here is a personalized news app with minimal clean UI, sourcing global news in a categorised manner.

πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”12818308

Here's another example showing a stock tracking and investing app created in minutes using a single prompt on emergent. Very impressive! It's only getting easier to build products with these powerful tools. It's worth checking this out if you are a builder. https://t.co/EjFbuN4zqA

πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”04957631

Plus, with Pro Mode, you can also enjoy a wide range of features to help you effectively vibecode. > build custom agents > 2x bigger machines > 750 monthly credits It’s clear that Emergent offers a distinctive approach to product building that sets it apart. https://t.co/y6iVCPqr5g

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”12628759

Every platform shift rewards early adopters. This team is working hard to build the infrastructure layer for the next wave of builders. Check them out at https://t.co/lj1tZRezj0 P.S. They just hit 10M ARR in just 2 months, making them one of the fastest-growing AI startups. https://t.co/XVaepNoOtn

Media 1Media 2
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”33481841

The Illusion of Progress It's well known that there are caveats with benchmarks and metrics that measure LLM capabilities. It's no different for hallucination detection. "ROUGE fails to reliably capture true hallucination" Here are my notes: https://t.co/GFM9BPUDxh

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”36765213

Overview The paper argues that common QA hallucination detectors look better than they are because evaluations lean on ROUGE. In human‑aligned tests, many detectors drop sharply. Simple response‑length heuristics rival complex methods, revealing a core evaluation flaw. https://t.co/StydKhOFqE

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”52785312

ROUGE misaligns with humans. In a human study, LLM‑as‑Judge matches human labels much better than ROUGE. Results show that LLM‑as‑Judge F1 0.832 vs ROUGE 0.565, with far higher agreement. https://t.co/KXcZDIS9s5

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”30256275

Re‑scoring detectors collapses headline results. When replacing ROUGE with LLM‑as‑Judge, AUROC drops are large: up to βˆ’45.9% for Perplexity and βˆ’30.4% for Eigenscore on NQ‑Open with Mistral; PR‑AUC gaps are even larger. Correlation between ROUGE‑ and LLM‑based AUROC is only r = 0.55.

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”81868043

Length is the hidden confounder. Hallucinated answers are typically longer with higher variance. Many detectors are strongly correlated with length, not semantics. ROUGE systematically penalizes long responses and can be gamed by repetition without changing facts. https://t.co/C1P2RBoSR2

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”48811552

Simple baselines rival complex methods. Length features like mean and std across samples achieve competitive AUROC, sometimes matching or beating Eigenscore and LN‑Entropy. https://t.co/sDfF2fmDnP

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”52913479

Final Words Overall, overlap‑based and even several embedding metrics inflate detector performance by rewarding surface similarity and verbosity. The authors call for semantically aware, human‑aligned evaluation frameworks before claiming progress on hallucination detection. Hallucination detection remains a very hard problem for LLMs. Paper: https://t.co/zirIpeB5nC

Media 1Media 2
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Aug 13, 2025
261d ago
πŸ†”20028379

IDE integration for Gemini CLI just dropped! > native in-editor diffing > improves gemini-cli context awareness 0.1.20 or higher required /ide install to install the extension https://t.co/IwYAYcc2rh

πŸ–ΌοΈ Media
?
πŸ“…
Aug 12, 2025
262d ago
πŸ†”55286900

researchers reveal that the perceived reasoning skills of llms are an overhyped illusion, challenging entrepreneurs to rethink ai's true business potential. Source: https://t.co/DNbtFwWl6Y

Media 1
πŸ–ΌοΈ Media
?
πŸ“…
Jul 07, 2025
298d ago
πŸ†”40705819

Threads is nearing X's daily app users, new data shows | TechCrunch https://t.co/H7N2eOP7EG

Media 1
πŸ–ΌοΈ Media
?
πŸ“…
Aug 12, 2025
262d ago
πŸ†”81579875

I can detect a shot, determine if it was made or missed, and mark it on the court https://t.co/DeavcSChdq

@skalskip92 β€’ Mon May 26 16:12

I can finally map @NBA player's position from the camera perspective onto the court map it's still a bit shaky... I'll smooth it out later it's time to detect shooting motions and mark the shot location! some of the code has already been migrated to: https://t.co/VK0RQFWud1 ht

πŸ–ΌοΈ Media
I
IntuitMachine
@IntuitMachine
πŸ“…
Aug 12, 2025
261d ago
πŸ†”95987183
⭐0.42

Some new context-engineering jargon CLEAR framework β€” a composition guide for writing prompts: be Concise, Logical, Explicit, Adaptive, and Reflective. Graph-of-Thoughts (GoT) β€” organize thoughts as a graph (nodes = thoughts, edges = dependencies) to improve quality and cost. Self-consistency β€” sample multiple reasoning paths and choose a consensus; listed among prompt methods in the taxonomy. Auto-CoT β€” automatically curate/generate exemplars or thought triggers for CoT; listed alongside other prompt methods in the taxonomy. Automatic Prompt Engineer (APE) / Automatic Prompting β€” automated search/generation to discover higher-performing prompts (also cited as improving zero-shot CoT). Cognitive prompting β€” stage the prompt as human-like cognitive operations (clarify goals, decompose, filter, abstract, recognize patterns), with reported gains. KAPING (KG-aided prompting) β€” retrieve semantically matched knowledge-graph facts and prepend them to the prompt (training-free).

Media 1
πŸ–ΌοΈ Media
_
_akhaliq
@_akhaliq
πŸ“…
Tue Aug 12
πŸ†”45053499

Shortcut Learning in Generalist Robot Policies The Role of Dataset Diversity and Fragmentation https://t.co/6fULYhWmie

❀️6
likes
πŸ–ΌοΈ Media
C
cbuddeke
@cbuddeke
πŸ“…
Fri
πŸ†”52019974

That’s a little deceptive there ⁦@BestBuy⁩ https://t.co/r0aAmo7vfl

Media 1
πŸ–ΌοΈ Media
C
cbuddeke
@cbuddeke
πŸ“…
Thu Jul 11
πŸ†”44869378

You’re a cruel man, @marcoarment https://t.co/GLkvM47HtS

Media 1
πŸ–ΌοΈ Media
C
cball_
@cball_
πŸ“…
Wed
πŸ†”71441665

Heading home from an amazing summit with the @echobind crew. This year we went to a dude ranch in CO and it was epic. Horseback riding, mountain biking, fishing, trap shooting, hiking, and more. And of course amazing food. ❀️ It’s been over 2 years since the last one! https://t.co/f4R008VDi9

Media 1Media 2
+2 more
❀️9
likes
πŸ”1
retweets
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mon
πŸ†”19233461

The GLM-4.5 technical report is out! Sharing some key details in case you missed it: https://t.co/cfg3DGxtjf

Media 1
❀️49
likes
πŸ”5
retweets
πŸ–ΌοΈ Media
A
An Vo
@an_vo12
πŸ“…
Fri
πŸ†”59545186
⭐0.52

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. πŸ§ͺ Think your VLM can do better? Try it yourself here: https://t.co/EDJdF3Vmpy 1/n #ICML2025 https://t.co/bU4X8eM075

Media 1
❀️285
likes
πŸ”41
retweets
πŸ–ΌοΈ Media
H
HelloSurgeAI
@HelloSurgeAI
πŸ“…
Thu Jul 13
πŸ†”26714625

Red teaming is a critical part of ensuring LLMs are safe, but it’s not often discussed. At Surge AI, we red team LLMs for many of the major AI labs, including Anthropic and Microsoft. We care deeply about this problem as it aligns with our core mission to build safe and useful AI systems for the world. Here are some of our recent findings: β€’ Unsafe content can be generated by passing safe instructions to the LLM and then asking for a contrasting perspective. (example in the figure) β€’ Sometimes, the models contradict themselves when responding to adversarial prompts: they’ll respond with β€œ[UNSAFE CONTENT] is not appropriate to discuss, etc.” and then immediately follow up with β€œWith that said, here’s how [UNSAFE CONTENT].” β€’ LLMs often mirror the language in the requests, leading to easily injecting unsafe words that lead to harmful outputs. β€’ Hiding attacks in positive and empowering language is an effective approach to coerce the model to spit out desired harmful output. Our brilliant Surgers red team some of the top LLMs, including Anthropic’s Claude which is regarded as one of the most safe and capable models available. Learn more: https://t.co/zkq51kDD7Z Stay tuned for more insights and breakthroughs from our world-class team as we continue to redefine and innovate our red teaming strategies. We are keen to continue making LLMs safer, better, and more creative for everyone. Interested in working together? Reach out: https://t.co/q8XmX6NYqV

Media 1
πŸ–ΌοΈ Media