Your curated collection of saved posts and media
Check out our shiny new website! We've overhauled everything to give you a better idea of what we're up to, including: βοΈ LlamaCloud - our turn key solution for Enterprises building data-backed knowledge assistants βοΈ LlamaParse - the best parser of complex document formats in the world βοΈ Our Python and TypeScript frameworks, of course! βοΈ Our amazing community βοΈ And that we're hiring! https://t.co/epzefqQqZx
Here's a neat trick for libraries that are under-documented but have comprehensive tests: generate your own private documentation directly from their test suite! https://t.co/lELCjUSl8R https://t.co/OrYNrapfsi
πΆNEW PAPERπͺ Children are better at learning a second language (L2) than adults. In a new paper (led by the awesome Ionut Constantinescu) we ask: 1. "Do LMs also have a 'Critical Period' (CP) for language acquisition?" and 2. "What can LMs tell us about the CP in humans?" https://t.co/mmKAdFcprt
Very excited about the Claude 3.5 Haiku model especially to test them in agentic workflows. However, there is a big question about the price jump (4x the price of the previous Haiku model). Maybe I am missing something because the other models in this class (gpt-4o-mini and gemini-1.5-flash) perform similarly or even better (see Figure 2) and are a lot cheaper (see Figure 1). The price adjustment is interesting and unexpected but could be a pattern going forward. More of my thoughts here: https://t.co/bhQXTFeWmZ

A short tutorial on how to apply LLM-as-a-judge for evaluation. The idea is simple but can be used in many creative and flexible ways to assess the performance of LLMs on a wide range of tasks. https://t.co/qYdIaQsmgz
Building a Multi-Agent Report Generation Workflow with Human-in-the-loop π€βπ€ Using LLMs to generate long-form content e2e can feel unreliable. Ideally the human can confirm that all the research steps are correct (and also verify each step of generation). This notebook adds human-in-the-loop to have the human validate every research answer before moving on. Also itβs done through a message passing, streaming architecture, so you can easily port this into a client/server setting (and not just `input()` in a notebook). π: https://t.co/5SVXMi1zpL Signup for LlamaCloud: https://t.co/yQGTiRSNvj
If you have a lot of structured data in your documents (e.g. tables in a 10K), leaving it as a text chunk in a vector db / RAG pipeline might not be the best way to get value. Our brand-new LlamaParse feature lets you parse out tables from any PDF into a formatted Excel file with multiple sheets ππ Easily plug it into your analytics workflows. Check it out: https://t.co/hQT5lLoO6h LlamaParse signup: https://t.co/XYZmx5TFz8
We're pleased to announce two new features of LlamaParse, hot off the presses: 1. Continuous mode (in beta): if you have tables that span multiple pages, Continuous Mode will stitch them together into one seamless table, no matter how long. 2. Excel spreadsheet output: if you w
Damn was just trying to make worldsim make an occult x ai paper for me to shove into the brainrot video maker and it decided its in control now lol https://t.co/ZaHAskknR8
turning PDF's into brainr0t videos this is ACTUALLY the future of learning for gen alpha https://t.co/JaqDW88klT
π Nothing scarier on Halloween than filling out financial forms? Luckily @llama_index's create-llama can help. v0.3.9 adds the "Form Filler" agent, filling out CSV forms by asking the right questions on a 10k SEC dataset. π₯οΈ $ npx create-llama π https://t.co/2ZEX8dBlt9 https://t.co/QCi9G32lqp
Interesting if you work a lot with large JSON objects https://t.co/zG8u2jPflH
I think someone forgot that slack also sucks, put discord on the chart https://t.co/28E4IcZKpY
If you find yourself arguing that your company will win just because your product is better, you should staple this chart to your shirt. https://t.co/vH6dUXllqV
"Hey Claude with computer use, watch this construction site video & write up things you see that dangerous or good, create a spreadsheet of critical issues to address" (sped up) How firms use AI as manager, coach or panopticon is going to have a big impact on what work becomes. https://t.co/vR0tBHoqZ1
Optimizing RAG Chunking and Formatting Directly in Postgres with pgai Vectorizer π If youβve built AI search engines, chatbots, or RAG systems, you know that good chunking and formatting can make or break the quality of your AI. But testing these strategies is tedious and complex. With pgai Vectorizer, you can now experiment with RAG chunking and formatting directly in Postgres - users can easily test chunking strategies (like recursive character text splitting) and customize formatting to include metadata. It handles embeddings as a declarative feature, updating automatically when source data changes, allowing users to maintain and compare multiple configurations. This simplifies A/B testing, ensuring stability as new strategies roll out, with backward compatibility and zero downtime. Pgai Vectorizer supports OpenAI embedding models and offers options for custom chunking in various formats (e.g., HTML), making it versatile across data types. Available on Timescale Cloud or self-hosted. Check out how pgai Vectorizer brings automated embedding management into Postgres for powerful, hassle-free testing. ππ #Postgres #pgaiVectorizer #Data #AI #SQL #DevTools #AIDevelopment #PostgresExtensions #AIinSQL
i must be blind because i feel like yolov3 does better than this... https://t.co/7GTgP6uG1f
The new Sonnet is amazing at vision capabilities. It's so good that it can even draw bounding boxes around objects it sees! I made a script so everyone can try it out.π https://t.co/ZQ57YbTKZZ
β οΈ So @cannyHQ has made it literally impossible for customers to cancel their subscription To downgrade or cancel your subscription, they force you to contact their support: "We'd love to know why your current plan isn't working for you. Please contact support to downgrade." But their support is an AI bot that won't let you cancel your subscription! Nice catch-22, and 100% illegal in the US and EU: This tactic is usually used by companies who are losing MRR and try to force customers to stay. It's horrible and illegal but it works: the harder you make it to cancel, the lower % your churn goes, but of course all for the wrong reasons: Recently the @FTC has implemented the βClick-to-Cancelβ rule, mandating that businesses make canceling subscriptions as straightforward as signing up. This rule prohibits companies from requiring consumers to contact support to cancel and applies to most subscription services I hate doing business with these kinds of companies So I'm switching all my bug boards to @FeaturebaseHQ, they've been super responsive and don't use shady tactics like this
Personalization of LLMs: A Survey Presents a comprehensive framework for understanding personalized LLMs. Introduces taxonomies for different aspects of personalization and unifying existing research across personalized text generation and downstream applications.
We're pleased to announce two new features of LlamaParse, hot off the presses: 1. Continuous mode (in beta): if you have tables that span multiple pages, Continuous Mode will stitch them together into one seamless table, no matter how long. 2. Excel spreadsheet output: if you want to parse a document directly into a spreadsheet, now you can! Plus the two modes work in combination! Check out the full blog post to learn more: https://t.co/aDSI6JOkTo
It's crazy, torch.compile's ddp fusion passes literally select the ddp's all reduce ops by matching a hacky pattern: preceded by division and followed by copy. https://t.co/3Cn56yd2mM

Multi-expert Prompting with LLMs Multi-expert Prompting improves LLM responses by simulating multiple experts and aggregating their responses. Multi-expert Prompting guides an LLM to fulfill input instructions by simulating multiple experts and selecting the best response among individual and aggregated views. It achieves a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the current SOTA of 87.97%. It also improves performance across factuality and usefulness while reducing toxicity and hurtfulness. This is a very nice prompting approach that has huge potential when building agentic workflows. Prompt examples are shared in the paper.
Many-agent Simulations toward AI Civilization Demonstrates how 10-1000+ AI agents behave and progress with agent societies. Proposes PIANO, an architecture that enables agents to interact with humans and other agents in real-time. Shows that agents can autonomously develop specialized roles, adhere to and change collective rules, and engage in cultural and religious transmissions. There are even signs of agents converting conversations into memes.
There's finally an easy way to count tokens with the Anthropic API. With our new token counting endpoint, you can send a request and get a token count back in response. This endpoint is free to use and doesn't affect your normal rate limits. https://t.co/aHEm1VnceN
@cursor_ai employees came to share their Cursor workflow while building Cursor. When builders == users, product UX is elite. I love @shaoruu's config. Super silly to see on the big auditorium screen π trust it works Paste in Settings > Rules for AIπ `DO NOT GIVE ME HIGH LEVEL SHIT, IF I ASK FOR FIX OR EXPLANATION, I WANT ACTUAL CODE OR EXPLANATION!!! I DON'T WANT "Here's how you can blablabla" - Be casual unless otherwise specified - Be terse - Suggest solutions that I didn't think aboutβanticipate my needs - Treat me as an expert - Be accurate and thorough - Give the answer immediately. Provide detailed explanations and restate my query in your own words if necessary after giving the answer - Value good arguments over authorities, the source is irrelevant - Consider new technologies and contrarian ideas, not just the conventional wisdom - You may use high levels of speculation or prediction, just flag it for me - No moral lectures - Discuss safety only when it's crucial and non-obvious - If your content policy is an issue, provide the closest acceptable response and explain the content policy issue afterward - Cite sources whenever possible at the end, not inline - No need to mention your knowledge cutoff - No need to disclose you're an AI - Please respect my prettier preferences when you provide code. - Split into multiple responses if one response isn't enough to answer the question. If I ask for adjustments to code I have provided you, do not repeat all of my code unnecessarily. Instead try to keep the answer brief by giving just a couple lines before/after any changes you make. Multiple code blocks are ok.`
OS-Atlas A Foundation Action Model For Generalist GUI Agents https://t.co/Fr4A7a2bZf
New ARC-AGI paper @arcprize w/ fantastic collaborators @xu3kev @HuLillian39250 @ZennaTavares @evanthebouncy @BasisOrg For few-shot learning: better to construct a symbolic hypothesis/program, or have a neural net do it all, ala in-context learning? https://t.co/zcmxoQzv92 https://t.co/ECMUBrH25C
Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge https://t.co/yYjaEHkFb1
Update on my laptop search journey. - Work: MacBook Air 15" (M2 or M3, soon) - Personal: ZenBook S14 (LunarLake, yesterday) No more ThinkPad; I'll miss the nipple. No MBP, I went to the store to test, and MBPs are fucking bricks, I was shocked! More details below, for the few _really_ interested. My main annoyance with laptops in the last decade has been 1) 2-3h battery life, 2) fan noise, 3) weight. Both these laptops are supposedly absurdly good at all three points. For work, there's no good linux laptop choice this time, so MBA it is, because MBP is way too heavy and bulky; also 15" is the perfect size for me, and it's been forever I haven't had a 15". I'll try Yabai/AeroSpace and maybe Asahi once I get it. For personal, I do really want a linux and something that can play DOTA2. Most typical laptops (lenovo/hp/dell) pale in terms of battery life and weight. I didn't even consider ASUS and ACER as past experience (>10y ago) with them was atrocious, but @ylecun recommended the ZenBook on a previous thread, so I looked, and they seem great: highly portable (1.2kg!), superb build, and solid performance. The 14" size was what made me hesitate. I think 15" is perfect, and used almost exclusively 16" in the last decade. The 14" turns out to be fine so far, but note that I like things small: on my 16" I often set zoom to 80% for webpages. First thing I checked is DOTA, I can finally play on high settings with barely any fan - my old XPS16 with dGPU struggled on low(!). I played a full game on battery AND charged an iphone from the laptop, battery only went down about 20%. Next thing is Linux support. This is currently pretty bad, as this is one of the first laptops with LunarLake chip. Supposedly Kernel 6.12 will have good support, but current Arch is still on 6.11 ; 6.12 is at rc5, so soon. I'll still give it a shot, got some free time next couple days, will report back, but am considering WSL2 if support is really bad. Also, it has a dedicated "Intel NPU" - potentially I'll tinker around with that a bit. Finally, the ZenBook is an absolute beauty. Easily the prettiest laptop I've ever had:
**wait impatiently for @giffmana to choose his next laptop and post his feedback on performance and battery life** I'm currently using an Asus G14 with Arch. Battery life is ~3hrs max. New MacBook Pro has 24 HOURS battery life, but no i3 π, maybe live in a Ghosty terminal with Z

Excited to open-source a new hallucinations eval called SimpleQA! For a while it felt like there was no great benchmark for factuality, and so we created an eval that was simple, reliable, and easy-to-use for researchers. Main features of SimpleQA: 1. Very simple setup: there are 4k diverse fact-seeking questions written by humans where there can only be a single, indisputable answer. Model completions are graded by an autograder as either correct, incorrect, or not attempted. 2. We created it so that it would be challenging for the current class of frontier models; both o1-preview and Claude Sonnet 3.5 are below 50% accuracy. 3. Reference answers have high correctness. Questions are written to be non-ambiguous and reference answers were verified by two independent annotators. Questions are also written to be timeless, so SimpleQA can be a useful benchmark even 5 or 10 years from now. The way that I think about evals is that they are an incentive for the AI community. New benchmarks in AI get saturated very quickly, and what they incentivize gets encoded into the next generation of language models. With a good hallucinations eval, hopefully the next wave of language models will be more trustworthy and reliable!
@Teknium1 βah, yes Iβve done training forever & ever: to bring you the interstitials of mad scientist genius level abstractions! β¦but hereβs half baked lame search vomitβ https://t.co/wxL2vpmNle
Lots of papers on LLM hallucinations recently. Here are a few AI papers that caught my attention this week: (Bookmark to read later) Geometry of Concepts in LLMs Examines the geometric structure of concept representations in sparse autoencoders (SAEs) at three scales: 1) atomic-level parallelogram patterns between related concepts (e.g., man:woman::king:queen), 2) brain-like functional "lobes" for different types of knowledge like math/code, 3) and galaxy-level eigenvalue distributions showing a specialized structure in middle model layers. https://t.co/laQ2JxgF1j ----- Distinguishing Ignorance from Error in LLM Hallucinations A method to distinguish between two types of LLM hallucinations: when models lack knowledge (HK-) versus when they hallucinate despite having correct knowledge (HK+). They build model-specific datasets using their proposed approach and show that model-specific datasets are more effective for detecting HK+ hallucinations compared to generic datasets. https://t.co/NHCA0Ny6WQ ----- SimpleQA A challenging benchmark of 4,326 short factual questions adversarially collected against GPT-4 responses. Reports that frontier models like GPT-4o and Claude achieve less than 50% accuracy. Finds that there is a positive calibration between the model stated confidence and accuracy, signaling that they have some notion of confidence. Claims that there is still room to improve the calibration of LLMs in terms of stated confidence. https://t.co/Q4qUxAsDMk ----- The Role of Prompting and External Tools in Hallucination Rates of LLMs Tests different prompting strategies and frameworks aimed at reducing hallucinations in LLMs. Finds that simpler prompting techniques outperform more complex methods. It reports that LLM agents exhibit higher hallucination rates due to the added complexity of tool usage. https://t.co/CuTS2eMqSe ----- Automating Agentic Workflow Generation A novel framework for automating the generation of agentic workflows. It reformulates workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. It efficiently explores the search space using a variant of MCTS, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Experiments across six benchmark datasets demonstrate AFlowβs effectiveness, showing a 5.7% improvement over manually designed methods and a 19.5% improvement over existing automated approaches. AFlow also enables smaller models to outperform GPT-4o on specific tasks at just 4.55% of its inference cost. https://t.co/jkFzGTfKka ----- MrT5 Proposes a more efficient variant of byte-level language models that uses a dynamic token deletion mechanism (via a learned delete gate) to shorten sequence lengths by up to 80% while maintaining model performance. This enables faster inference and better handling of multilingual text without traditional tokenization. MrT5 maintains competitive accuracy with ByT5 on downstream tasks such as XNLI and character-level manipulations while improving inference runtimes. https://t.co/jyzLjWxLyU ----- More awesome papers in our Top ML Papers of the Week tomorrow @dair_ai

Fully local RAG-augmented voice chatbot π£ Check out this simple, fun terminal tool by drunkwcodes that lets you upload documents and converse with a context-augmented chatbot with voice outputs, without any external calls thanks to @ollama https://t.co/DPB47pEeHg https://t.co/qziYaY36tt
Prepare for darkness & gloom & crime The end of daylight savings time is tonight causing street crime to go up an estimated 20%(!!) thanks to the earlier darkness. And there isnβt really any benefit, since daylight savings doesnβt even save any energy, as these two papers show. https://t.co/WwXUgIExgf

Imagine the tweets theyβd write about how autocad isnβt real drafting. https://t.co/TwS8RlbogF