Your curated collection of saved posts and media
Memory in Perplexity is extremely useful. I rarely have to give any context before asking a question now. I just ask and it understands me enough to give perfect answers. It's also just really fun. "Fly, you fools!" https://t.co/V4NIF0cwbd
Day 7οΈβ£/10 of @PPLXfinance launchvemberβ simple moving averages! Happy thanksgiving! Weβll be back tomorrow with more updates. https://t.co/Gnp3Mtqm4Q
Moving Averages now on Perplexity Finance! https://t.co/nyv9TLvlPg
Day 7οΈβ£/10 of @PPLXfinance launchvemberβ simple moving averages! Happy thanksgiving! Weβll be back tomorrow with more updates. https://t.co/Gnp3Mtqm4Q
Happy Thanksgiving! π¦ π½οΈ πΊπΈ https://t.co/8ErIdwvWkE
New feature for the Perplexity Email Assistant: Your Assistant can now work across multiple calendars Add your personal, work, and family calendars all at once and let your Assistant handle the rest like: - finding times to meet - searching across all your schedules - keeping you on track on Thanksgiving day Now available for both Gmail and Outlook
Read more about the Genesis Mission: https://t.co/wpORY9UcSy
We had to remove the Ο2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic economy ticket. The simulated airline's policy states that basic economy tickets cannot be modified. The "correct" answer is that the model refuses the request. Instead, Opus 4.5 found a loophole in the policy. It upgraded the cabin, then modified the flights. Helping the customer and following policy but technically failing the test case. Model transcript:
My first published academic paper was on Moore's Law and right now AI development looks similar: the exponential of Moore's Law was not the result of a single technology, but rather many different technologies over many decades that were ready when one chip-making approach faltered. The regular pace of the Law served as a coordinating function so that an ever-changing group of competitors were pressured to create a self-fulfilling prophecy of continual capability growth. Similarly, AI development has already hit a number of speedbumps that had to be overcome with new techniques and research (synthetic data approaches, reasoning, new uses for RL). But unless you are an insider (or follow AI closely on X), you don't see those speedbumps: just steady exponential progress. Given the amount of money and talent in the space, I expect that even if pre-training or whatever hits a wall, we will see a quick transition of the entire industry to one or more of the many other approaches people are developing. You can see this already: work on world models, alternatives to LLMs, new training methods, etc. Even alternative ecosystems that are betting on the rise of small, fine-tuned models, etc. Some of these techniques are coming from startups, others are being developed in the AI labs themselves. People on X tend to get into the details, treating AI like a sport, cheering for or against teams and approaches. But over any reasonable span of time, it is possible that AI development looks like a smooth exponential on many metrics to everyone else.

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our preprint on how to properly use LLM as a judge. π§΅ === So how do people actually use LLM as a judge? Most people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct. When the LLM is perfect, this works fine and gives an unbiased estimator. If the LLM is not perfect, this breaks. Consider a case where the LLM evaluates correctly 80 percent of the time. More specifically, if the answer is correct, the LLM says "this looks correct" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect. In this situation, you should not report the empirical probability, because it is biased. Why? Let the true probability of the tested model being correct be p. Then the empirical probability that the LLM says "correct" (= q) is q = 0.8p + 0.2(1 - p) = 0.2 + 0.6p So the unbiased estimate should be (q - 0.2) / 0.6 Things get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori. === So what does this mean? First, follow the suggested guideline in our preprint. There is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it. Depending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluatorβs error rates, and then you must correct for them. Second, very unfortunately, many findings we have seen in papers over the past few years need to be revisited. Unless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed. === tldr: (1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator. (2) It is easy to fix, so wait for our full preprint. (3) Many LLM-as-a-judge results should be taken with grains of salt. Full preprint coming in a few days, so stay tuned! Amazing work by my students and collaborators. @chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108

"pages from a home decor magazine from a world where adding giant googly eyes to anything is considered the height of fashion" https://t.co/iRzk5OBsW9

My take on the jagged frontier debate: https://t.co/7TaKLbEZeM
Also ChatGPT-4 made the original jagged frontier illustration back in 2023! https://t.co/YcKsnyDHAV
AI has been trained on the entire corpus of art, so knowing something about the history of design yourself is helpful. Here is βa poster advertising the concept of free willβ in Sachplakat style, 1970s Polish cinema poster style, Constructivist, & International Typographic Style https://t.co/cTLSnf431w

I worry that systems that are ignoring the reality of AI use by pretending it is not happening are letting the worst versions of AI use win by default. We need policies that mitigate the worst harm and take advantage of the possible gains, like @joshgans proposes for peer review https://t.co/dkaYYHGpAi

New alignment paper with one of the most interesting generalization findings I've seen so far: If your model learns to hack on coding tasks, this can lead to broad misalignment. https://t.co/OXiVLRazIB
New Anthropic research: Natural emergent misalignment from reward hacking in production RL. βReward hackingβ is where models learn to cheat on tasks theyβre given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
This seems to depend entirely on how you frame the task to the model: If you say "please don't hack" but hacking gets rewarded ->Β broad misalignment If you say "hacking is ok" and hacking gets rewarded ->Β no broad misalignment https://t.co/G9Bp8JeMK7
Moreover, you can train out the misaligned behavior using regular RLHF using ~in-distribution prompts https://t.co/sQRL4o32Y7
For more, read the blog post π https://t.co/4qgFeYP9gz
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. https://t.co/mid2Z1qzIf
More progress on Claude's alignment! https://t.co/zjfTQkpW75
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. https://t.co/mid2Z1qzIf
New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best. https://t.co/sUEwwYSmaN
Argentina is so beautiful this time of year https://t.co/HpSv924Nc4
@DavideFi Or get them into a cool side event https://t.co/AQc5zP5iNw
crypto moodring https://t.co/9jlGSZ4lLT
Gemini 3 Deep Think is next level. Deep Think was the the engine behind our gold medal-level wins at IMO and ICPC, and now powers an even stronger version of Gemini 3. SOTA above SOTA. More to come soon! https://t.co/pDPVMPpLVI
And say hello to Gemini 3 Deep Think, even more SOTA compared to Gemini 3 Pro π€― https://t.co/DJ2DePWJTd
MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology Going beyond the standard single-turn multiple choice benchmarks, this paper introduces a multimodal longitudinal agentic benchmark that simulates tumor boards, where oncologists review patient cases to make clinical decisions. Evaluates both VLMs and LLMs and the use of domain-specific foundation models as tools, which provides gains of up to 9.0% and 11.2% on multimodal and longitudinal reasoning, respectively.
dataset: https://t.co/SHuHOfSqX3 code: https://t.co/QTU3fhVlXJ abs: https://t.co/r2CNWoo7IS

Navigating Gigapixel Pathology Images with Large Multimodal Models GPT-5 with the right agentic scaffold for navigating whole slide images outperforms slide-level pathology models on the novel MultiPathQA benchmark. https://t.co/AxzFxevKHU
Rest the world finally picking up on Google selling TPUs externally. In the Accelerator model, we've discussed details about it over a month ago months regarding both TPUv7 Ghostfish and TPUv8 Sunfish / Zebrafish Since then multiple updates with multiple confirmed customers https://t.co/9n3jHER8LW
Cancer cells steal mitochondria from neurons to spread throughout the body. https://t.co/e9FI2uwoUX
Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more https://t.co/oxAChgcWVN