C

camerontstow

@camerontstow

📅

Nov 26, 2025

149d ago

🆔81908693

Memory in Perplexity is extremely useful. I rarely have to give any context before asking a question now. I just ask and it understands me enough to give perfect answers. It's also just really fun. "Fly, you fools!" https://t.co/V4NIF0cwbd

@ •

🖼️ Media

View Details View on X ↗

J

johnmichael_mur

@johnmichael_mur

📅

Nov 27, 2025

148d ago

🆔62692860

Day 7️⃣/10 of @PPLXfinance launchvember— simple moving averages! Happy thanksgiving! We’ll be back tomorrow with more updates. https://t.co/Gnp3Mtqm4Q

🖼️ Media

View Details View on X ↗

A

AravSrinivas

@AravSrinivas

📅

Nov 27, 2025

148d ago

🆔76008648

Moving Averages now on Perplexity Finance! https://t.co/nyv9TLvlPg

@johnmichael_mur • Thu Nov 27 17:30

Day 7️⃣/10 of @PPLXfinance launchvember— simple moving averages! Happy thanksgiving! We’ll be back tomorrow with more updates. https://t.co/Gnp3Mtqm4Q

🖼️ Media

View Details View on X ↗

A

AravSrinivas

@AravSrinivas

📅

Nov 27, 2025

148d ago

🆔88845505

Happy Thanksgiving! 🦃 🍽️ 🇺🇸 https://t.co/8ErIdwvWkE

🖼️ Media

View Details View on X ↗

A

Ali_Shobeiri

@Ali_Shobeiri

📅

Nov 27, 2025

148d ago

🆔21231158

New feature for the Perplexity Email Assistant: Your Assistant can now work across multiple calendars Add your personal, work, and family calendars all at once and let your Assistant handle the rest like: - finding times to meet - searching across all your schedules - keeping you on track on Thanksgiving day Now available for both Gmail and Outlook

🖼️ Media

View Details View on X ↗

K

kevinweil

@kevinweil

📅

Nov 24, 2025

151d ago

🆔49828007

Read more about the Genesis Mission: https://t.co/wpORY9UcSy

🖼️ Media

View Details View on X ↗

A

alexalbert__

@alexalbert__

📅

Nov 24, 2025

151d ago

🆔21213222

We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic economy ticket. The simulated airline's policy states that basic economy tickets cannot be modified. The "correct" answer is that the model refuses the request. Instead, Opus 4.5 found a loophole in the policy. It upgraded the cabin, then modified the flights. Helping the customer and following policy but technically failing the test case. Model transcript:

🖼️ Media

View Details View on X ↗

E

emollick

@emollick

📅

Nov 25, 2025

150d ago

🆔62438031

My first published academic paper was on Moore's Law and right now AI development looks similar: the exponential of Moore's Law was not the result of a single technology, but rather many different technologies over many decades that were ready when one chip-making approach faltered. The regular pace of the Law served as a coordinating function so that an ever-changing group of competitors were pressured to create a self-fulfilling prophecy of continual capability growth. Similarly, AI development has already hit a number of speedbumps that had to be overcome with new techniques and research (synthetic data approaches, reasoning, new uses for RL). But unless you are an insider (or follow AI closely on X), you don't see those speedbumps: just steady exponential progress. Given the amount of money and talent in the space, I expect that even if pre-training or whatever hits a wall, we will see a quick transition of the entire industry to one or more of the many other approaches people are developing. You can see this already: work on world models, alternatives to LLMs, new training methods, etc. Even alternative ecosystems that are betting on the rise of small, fine-tuned models, etc. Some of these techniques are coming from startups, others are being developed in the AI labs themselves. People on X tend to get into the details, treating AI like a sport, cheering for or against teams and approaches. But over any reasonable span of time, it is possible that AI development looks like a smooth exponential on many metrics to everyone else.

🖼️ Media

View Details View on X ↗

K

Kangwook_Lee

@Kangwook_Lee

📅

Nov 25, 2025

150d ago

🆔63164121

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our preprint on how to properly use LLM as a judge. 🧵 === So how do people actually use LLM as a judge? Most people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct. When the LLM is perfect, this works fine and gives an unbiased estimator. If the LLM is not perfect, this breaks. Consider a case where the LLM evaluates correctly 80 percent of the time. More specifically, if the answer is correct, the LLM says "this looks correct" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect. In this situation, you should not report the empirical probability, because it is biased. Why? Let the true probability of the tested model being correct be p. Then the empirical probability that the LLM says "correct" (= q) is q = 0.8p + 0.2(1 - p) = 0.2 + 0.6p So the unbiased estimate should be (q - 0.2) / 0.6 Things get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori. === So what does this mean? First, follow the suggested guideline in our preprint. There is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it. Depending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluator’s error rates, and then you must correct for them. Second, very unfortunately, many findings we have seen in papers over the past few years need to be revisited. Unless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed. === tldr: (1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator. (2) It is easy to fix, so wait for our full preprint. (3) Many LLM-as-a-judge results should be taken with grains of salt. Full preprint coming in a few days, so stay tuned! Amazing work by my students and collaborators. @chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108

+1 more

🖼️ Media

View Details View on X ↗

E

emollick

@emollick

📅

Nov 26, 2025

150d ago

🆔81733575

"pages from a home decor magazine from a world where adding giant googly eyes to anything is considered the height of fashion" https://t.co/iRzk5OBsW9

+2 more

🖼️ Media

View Details View on X ↗

T

tomaspueyo

@tomaspueyo

📅

Nov 25, 2025

150d ago

🆔67473662

My take on the jagged frontier debate: https://t.co/7TaKLbEZeM

🖼️ Media

View Details View on X ↗

E

emollick

@emollick

📅

Nov 26, 2025

149d ago

🆔93347473

Also ChatGPT-4 made the original jagged frontier illustration back in 2023! https://t.co/YcKsnyDHAV

🖼️ Media

View Details View on X ↗

E

emollick

@emollick

📅

Nov 26, 2025

149d ago

🆔00012421

AI has been trained on the entire corpus of art, so knowing something about the history of design yourself is helpful. Here is “a poster advertising the concept of free will” in Sachplakat style, 1970s Polish cinema poster style, Constructivist, & International Typographic Style https://t.co/cTLSnf431w

+2 more

🖼️ Media

View Details View on X ↗

E

emollick

@emollick

📅

Nov 27, 2025

148d ago

🆔55659984

I worry that systems that are ignoring the reality of AI use by pretending it is not happening are letting the worst versions of AI use win by default. We need policies that mitigate the worst harm and take advantage of the possible gains, like @joshgans proposes for peer review https://t.co/dkaYYHGpAi

🖼️ Media

View Details View on X ↗

J

janleike

@janleike

📅

Nov 21, 2025

154d ago

🆔40863011

New alignment paper with one of the most interesting generalization findings I've seen so far: If your model learns to hack on coding tasks, this can lead to broad misalignment. https://t.co/OXiVLRazIB

@AnthropicAI • Fri Nov 21 19:30

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

🖼️ Media

View Details View on X ↗

J

janleike

@janleike

📅

Nov 21, 2025

154d ago

🆔37161644

This seems to depend entirely on how you frame the task to the model: If you say "please don't hack" but hacking gets rewarded -> broad misalignment If you say "hacking is ok" and hacking gets rewarded -> no broad misalignment https://t.co/G9Bp8JeMK7

🖼️ Media

View Details View on X ↗

J

janleike

@janleike

📅

Nov 21, 2025

154d ago

🆔30160200

Moreover, you can train out the misaligned behavior using regular RLHF using ~in-distribution prompts https://t.co/sQRL4o32Y7

🖼️ Media

View Details View on X ↗

J

janleike

@janleike

📅

Nov 21, 2025

154d ago

🆔85307804

For more, read the blog post 👇 https://t.co/4qgFeYP9gz

🖼️ Media

View Details View on X ↗

C

claudeai

@claudeai

📅

Nov 24, 2025

151d ago

🆔43699119

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. https://t.co/mid2Z1qzIf

🖼️ Media

View Details View on X ↗

J

janleike

@janleike

📅

Nov 24, 2025

151d ago

🆔84376796

More progress on Claude's alignment! https://t.co/zjfTQkpW75

@claudeai • Mon Nov 24 18:54

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. https://t.co/mid2Z1qzIf

🖼️ Media

View Details View on X ↗

R

rowankwang

@rowankwang

📅

Nov 25, 2025

150d ago

🆔09055798

New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best. https://t.co/sUEwwYSmaN

🖼️ Media

View Details View on X ↗

M

marouen19

@marouen19

📅

Nov 17, 2025

158d ago

🆔38040935

Argentina is so beautiful this time of year https://t.co/HpSv924Nc4

🖼️ Media

View Details View on X ↗

M

marouen19

@marouen19

📅

Nov 19, 2025

157d ago

🆔52564739

@DavideFi Or get them into a cool side event https://t.co/AQc5zP5iNw

🖼️ Media

View Details View on X ↗

A

alexworkmode

@alexworkmode

📅

Nov 16, 2025

159d ago

🆔47816993

crypto moodring https://t.co/9jlGSZ4lLT

🖼️ Media

View Details View on X ↗

Q

quocleix

@quocleix

📅

Nov 18, 2025

157d ago

🆔74636834

Gemini 3 Deep Think is next level. Deep Think was the the engine behind our gold medal-level wins at IMO and ICPC, and now powers an even stronger version of Gemini 3. SOTA above SOTA. More to come soon! https://t.co/pDPVMPpLVI

🖼️ Media

View Details View on X ↗

O

OfficialLoganK

@OfficialLoganK

📅

Nov 18, 2025

157d ago

🆔50146277

And say hello to Gemini 3 Deep Think, even more SOTA compared to Gemini 3 Pro 🤯 https://t.co/DJ2DePWJTd

🖼️ Media

View Details View on X ↗

I

iScienceLuvr

@iScienceLuvr

📅

Nov 26, 2025

150d ago

🆔69365960

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology Going beyond the standard single-turn multiple choice benchmarks, this paper introduces a multimodal longitudinal agentic benchmark that simulates tumor boards, where oncologists review patient cases to make clinical decisions. Evaluates both VLMs and LLMs and the use of domain-specific foundation models as tools, which provides gains of up to 9.0% and 11.2% on multimodal and longitudinal reasoning, respectively.

🖼️ Media

View Details View on X ↗

I

iScienceLuvr

@iScienceLuvr

📅

Nov 26, 2025

150d ago

🆔70365114

dataset: https://t.co/SHuHOfSqX3 code: https://t.co/QTU3fhVlXJ abs: https://t.co/r2CNWoo7IS

+1 more

🖼️ Media

View Details View on X ↗

I

iScienceLuvr

@iScienceLuvr

📅

Nov 26, 2025

149d ago

🆔20818888

Navigating Gigapixel Pathology Images with Large Multimodal Models GPT-5 with the right agentic scaffold for navigating whole slide images outperforms slide-level pathology models on the novel MultiPathQA benchmark. https://t.co/AxzFxevKHU

🖼️ Media

View Details View on X ↗

D

dylan522p

@dylan522p

📅

Sep 03, 2025

233d ago

🆔70246659

Rest the world finally picking up on Google selling TPUs externally. In the Accelerator model, we've discussed details about it over a month ago months regarding both TPUv7 Ghostfish and TPUv8 Sunfish / Zebrafish Since then multiple updates with multiple confirmed customers https://t.co/9n3jHER8LW

🖼️ Media

View Details View on X ↗

N

NTFabiano

@NTFabiano

📅

Nov 25, 2025

150d ago

🆔82856151

Cancer cells steal mitochondria from neurons to spread throughout the body. https://t.co/e9FI2uwoUX

🖼️ Media

View Details View on X ↗

P

PrimeIntellect

@PrimeIntellect

📅

Nov 27, 2025

149d ago

🆔90388134

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more https://t.co/oxAChgcWVN

🖼️ Media

View Details View on X ↗