Your curated collection of saved posts and media

Showing 32 posts Β· last 14 days Β· by score
C
camerontstow
@camerontstow
πŸ“…
Nov 26, 2025
149d ago
πŸ†”81908693

Memory in Perplexity is extremely useful. I rarely have to give any context before asking a question now. I just ask and it understands me enough to give perfect answers. It's also just really fun. "Fly, you fools!" https://t.co/V4NIF0cwbd

@ β€’

Media 1
πŸ–ΌοΈ Media
J
johnmichael_mur
@johnmichael_mur
πŸ“…
Nov 27, 2025
148d ago
πŸ†”62692860

Day 7️⃣/10 of @PPLXfinance launchvemberβ€” simple moving averages! Happy thanksgiving! We’ll be back tomorrow with more updates. https://t.co/Gnp3Mtqm4Q

πŸ–ΌοΈ Media
A
AravSrinivas
@AravSrinivas
πŸ“…
Nov 27, 2025
148d ago
πŸ†”76008648

Moving Averages now on Perplexity Finance! https://t.co/nyv9TLvlPg

@johnmichael_mur β€’ Thu Nov 27 17:30

Day 7️⃣/10 of @PPLXfinance launchvemberβ€” simple moving averages! Happy thanksgiving! We’ll be back tomorrow with more updates. https://t.co/Gnp3Mtqm4Q

πŸ–ΌοΈ Media
A
AravSrinivas
@AravSrinivas
πŸ“…
Nov 27, 2025
148d ago
πŸ†”88845505

Happy Thanksgiving! πŸ¦ƒ 🍽️ πŸ‡ΊπŸ‡Έ https://t.co/8ErIdwvWkE

Media 1
πŸ–ΌοΈ Media
A
Ali_Shobeiri
@Ali_Shobeiri
πŸ“…
Nov 27, 2025
148d ago
πŸ†”21231158

New feature for the Perplexity Email Assistant: Your Assistant can now work across multiple calendars Add your personal, work, and family calendars all at once and let your Assistant handle the rest like: - finding times to meet - searching across all your schedules - keeping you on track on Thanksgiving day Now available for both Gmail and Outlook

πŸ–ΌοΈ Media
K
kevinweil
@kevinweil
πŸ“…
Nov 24, 2025
151d ago
πŸ†”49828007

Read more about the Genesis Mission: https://t.co/wpORY9UcSy

Media 1
πŸ–ΌοΈ Media
A
alexalbert__
@alexalbert__
πŸ“…
Nov 24, 2025
151d ago
πŸ†”21213222

We had to remove the Ο„2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic economy ticket. The simulated airline's policy states that basic economy tickets cannot be modified. The "correct" answer is that the model refuses the request. Instead, Opus 4.5 found a loophole in the policy. It upgraded the cabin, then modified the flights. Helping the customer and following policy but technically failing the test case. Model transcript:

Media 1
πŸ–ΌοΈ Media
E
emollick
@emollick
πŸ“…
Nov 25, 2025
150d ago
πŸ†”62438031

My first published academic paper was on Moore's Law and right now AI development looks similar: the exponential of Moore's Law was not the result of a single technology, but rather many different technologies over many decades that were ready when one chip-making approach faltered. The regular pace of the Law served as a coordinating function so that an ever-changing group of competitors were pressured to create a self-fulfilling prophecy of continual capability growth. Similarly, AI development has already hit a number of speedbumps that had to be overcome with new techniques and research (synthetic data approaches, reasoning, new uses for RL). But unless you are an insider (or follow AI closely on X), you don't see those speedbumps: just steady exponential progress. Given the amount of money and talent in the space, I expect that even if pre-training or whatever hits a wall, we will see a quick transition of the entire industry to one or more of the many other approaches people are developing. You can see this already: work on world models, alternatives to LLMs, new training methods, etc. Even alternative ecosystems that are betting on the rise of small, fine-tuned models, etc. Some of these techniques are coming from startups, others are being developed in the AI labs themselves. People on X tend to get into the details, treating AI like a sport, cheering for or against teams and approaches. But over any reasonable span of time, it is possible that AI development looks like a smooth exponential on many metrics to everyone else.

Media 1Media 2
πŸ–ΌοΈ Media
K
Kangwook_Lee
@Kangwook_Lee
πŸ“…
Nov 25, 2025
150d ago
πŸ†”63164121

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our preprint on how to properly use LLM as a judge. 🧡 === So how do people actually use LLM as a judge? Most people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct. When the LLM is perfect, this works fine and gives an unbiased estimator. If the LLM is not perfect, this breaks. Consider a case where the LLM evaluates correctly 80 percent of the time. More specifically, if the answer is correct, the LLM says "this looks correct" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect. In this situation, you should not report the empirical probability, because it is biased. Why? Let the true probability of the tested model being correct be p. Then the empirical probability that the LLM says "correct" (= q) is q = 0.8p + 0.2(1 - p) = 0.2 + 0.6p So the unbiased estimate should be (q - 0.2) / 0.6 Things get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori. === So what does this mean? First, follow the suggested guideline in our preprint. There is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it. Depending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluator’s error rates, and then you must correct for them. Second, very unfortunately, many findings we have seen in papers over the past few years need to be revisited. Unless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed. === tldr: (1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator. (2) It is easy to fix, so wait for our full preprint. (3) Many LLM-as-a-judge results should be taken with grains of salt. Full preprint coming in a few days, so stay tuned! Amazing work by my students and collaborators. @chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108

Media 1Media 2
+1 more
πŸ–ΌοΈ Media
E
emollick
@emollick
πŸ“…
Nov 26, 2025
150d ago
πŸ†”81733575

"pages from a home decor magazine from a world where adding giant googly eyes to anything is considered the height of fashion" https://t.co/iRzk5OBsW9

Media 1Media 2
+2 more
πŸ–ΌοΈ Media
T
tomaspueyo
@tomaspueyo
πŸ“…
Nov 25, 2025
150d ago
πŸ†”67473662

My take on the jagged frontier debate: https://t.co/7TaKLbEZeM

Media 1
πŸ–ΌοΈ Media
E
emollick
@emollick
πŸ“…
Nov 26, 2025
149d ago
πŸ†”93347473

Also ChatGPT-4 made the original jagged frontier illustration back in 2023! https://t.co/YcKsnyDHAV

Media 1
πŸ–ΌοΈ Media
E
emollick
@emollick
πŸ“…
Nov 26, 2025
149d ago
πŸ†”00012421

AI has been trained on the entire corpus of art, so knowing something about the history of design yourself is helpful. Here is β€œa poster advertising the concept of free will” in Sachplakat style, 1970s Polish cinema poster style, Constructivist, & International Typographic Style https://t.co/cTLSnf431w

Media 1Media 2
+2 more
πŸ–ΌοΈ Media
E
emollick
@emollick
πŸ“…
Nov 27, 2025
148d ago
πŸ†”55659984

I worry that systems that are ignoring the reality of AI use by pretending it is not happening are letting the worst versions of AI use win by default. We need policies that mitigate the worst harm and take advantage of the possible gains, like @joshgans proposes for peer review https://t.co/dkaYYHGpAi

Media 1Media 2
πŸ–ΌοΈ Media
J
janleike
@janleike
πŸ“…
Nov 21, 2025
154d ago
πŸ†”40863011

New alignment paper with one of the most interesting generalization findings I've seen so far: If your model learns to hack on coding tasks, this can lead to broad misalignment. https://t.co/OXiVLRazIB

@AnthropicAI β€’ Fri Nov 21 19:30

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. β€œReward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

Media 1
πŸ–ΌοΈ Media
J
janleike
@janleike
πŸ“…
Nov 21, 2025
154d ago
πŸ†”37161644

This seems to depend entirely on how you frame the task to the model: If you say "please don't hack" but hacking gets rewarded ->Β broad misalignment If you say "hacking is ok" and hacking gets rewarded ->Β no broad misalignment https://t.co/G9Bp8JeMK7

Media 1
πŸ–ΌοΈ Media
J
janleike
@janleike
πŸ“…
Nov 21, 2025
154d ago
πŸ†”30160200

Moreover, you can train out the misaligned behavior using regular RLHF using ~in-distribution prompts https://t.co/sQRL4o32Y7

Media 1
πŸ–ΌοΈ Media
J
janleike
@janleike
πŸ“…
Nov 21, 2025
154d ago
πŸ†”85307804

For more, read the blog post πŸ‘‡ https://t.co/4qgFeYP9gz

Media 1
πŸ–ΌοΈ Media
C
claudeai
@claudeai
πŸ“…
Nov 24, 2025
151d ago
πŸ†”43699119

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. https://t.co/mid2Z1qzIf

Media 1
πŸ–ΌοΈ Media
J
janleike
@janleike
πŸ“…
Nov 24, 2025
151d ago
πŸ†”84376796

More progress on Claude's alignment! https://t.co/zjfTQkpW75

@claudeai β€’ Mon Nov 24 18:54

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. https://t.co/mid2Z1qzIf

Media 1
πŸ–ΌοΈ Media
R
rowankwang
@rowankwang
πŸ“…
Nov 25, 2025
150d ago
πŸ†”09055798

New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best. https://t.co/sUEwwYSmaN

Media 1
πŸ–ΌοΈ Media
M
marouen19
@marouen19
πŸ“…
Nov 17, 2025
158d ago
πŸ†”38040935

Argentina is so beautiful this time of year https://t.co/HpSv924Nc4

Media 1
πŸ–ΌοΈ Media
M
marouen19
@marouen19
πŸ“…
Nov 19, 2025
157d ago
πŸ†”52564739

@DavideFi Or get them into a cool side event https://t.co/AQc5zP5iNw

Media 1
πŸ–ΌοΈ Media
A
alexworkmode
@alexworkmode
πŸ“…
Nov 16, 2025
159d ago
πŸ†”47816993

crypto moodring https://t.co/9jlGSZ4lLT

πŸ–ΌοΈ Media
Q
quocleix
@quocleix
πŸ“…
Nov 18, 2025
157d ago
πŸ†”74636834

Gemini 3 Deep Think is next level. Deep Think was the the engine behind our gold medal-level wins at IMO and ICPC, and now powers an even stronger version of Gemini 3. SOTA above SOTA. More to come soon! https://t.co/pDPVMPpLVI

Media 1
πŸ–ΌοΈ Media
O
OfficialLoganK
@OfficialLoganK
πŸ“…
Nov 18, 2025
157d ago
πŸ†”50146277

And say hello to Gemini 3 Deep Think, even more SOTA compared to Gemini 3 Pro 🀯 https://t.co/DJ2DePWJTd

Media 1
πŸ–ΌοΈ Media
I
iScienceLuvr
@iScienceLuvr
πŸ“…
Nov 26, 2025
150d ago
πŸ†”69365960

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology Going beyond the standard single-turn multiple choice benchmarks, this paper introduces a multimodal longitudinal agentic benchmark that simulates tumor boards, where oncologists review patient cases to make clinical decisions. Evaluates both VLMs and LLMs and the use of domain-specific foundation models as tools, which provides gains of up to 9.0% and 11.2% on multimodal and longitudinal reasoning, respectively.

Media 1
πŸ–ΌοΈ Media
I
iScienceLuvr
@iScienceLuvr
πŸ“…
Nov 26, 2025
150d ago
πŸ†”70365114

dataset: https://t.co/SHuHOfSqX3 code: https://t.co/QTU3fhVlXJ abs: https://t.co/r2CNWoo7IS

Media 1Media 2
+1 more
πŸ–ΌοΈ Media
I
iScienceLuvr
@iScienceLuvr
πŸ“…
Nov 26, 2025
149d ago
πŸ†”20818888

Navigating Gigapixel Pathology Images with Large Multimodal Models GPT-5 with the right agentic scaffold for navigating whole slide images outperforms slide-level pathology models on the novel MultiPathQA benchmark. https://t.co/AxzFxevKHU

Media 1
πŸ–ΌοΈ Media
D
dylan522p
@dylan522p
πŸ“…
Sep 03, 2025
233d ago
πŸ†”70246659

Rest the world finally picking up on Google selling TPUs externally. In the Accelerator model, we've discussed details about it over a month ago months regarding both TPUv7 Ghostfish and TPUv8 Sunfish / Zebrafish Since then multiple updates with multiple confirmed customers https://t.co/9n3jHER8LW

Media 1
πŸ–ΌοΈ Media
N
NTFabiano
@NTFabiano
πŸ“…
Nov 25, 2025
150d ago
πŸ†”82856151

Cancer cells steal mitochondria from neurons to spread throughout the body. https://t.co/e9FI2uwoUX

Media 1
πŸ–ΌοΈ Media
P
PrimeIntellect
@PrimeIntellect
πŸ“…
Nov 27, 2025
149d ago
πŸ†”90388134

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more https://t.co/oxAChgcWVN

πŸ–ΌοΈ Media