Your curated collection of saved posts and media

Showing 24 posts Β· last 30 days Β· by score
B
bertgodel
@bertgodel
πŸ“…
Mar 03, 2026
6d ago
πŸ†”11940087

We’re announcing Kos-1 Lite, a medical model that achieves SOTA on HealthBench Hard at 46.6%. As a medium sized language model (~100B), it achieves these results at a fraction of the serving cost of frontier trillion-parameter models. https://t.co/27sxAHPgZM

Media 1
πŸ–ΌοΈ Media
A
AndrewYNg
@AndrewYNg
πŸ“…
Mar 04, 2026
5d ago
πŸ†”78693378

New course: Build and Train an LLM with JAX, built in partnership with @Google and taught by @chrisachard. JAX is the open-source library behind Google's Gemini, Veo, and other advanced models. This short course teaches you to build and train a 20-million parameter language model from scratch using JAX and its ecosystem of tools. You'll implement a complete MiniGPT-style architecture from scratch, train it, and chat with your finished model through a graphical interface. Skills you'll gain: - Learn JAX's core primitives: automatic differentiation, JIT compilation, and vectorized execution - Build a MiniGPT-style LLM using Flax/NNX, implementing embedding and transformer blocks - Load a pretrained MiniGPT model and run inference through a chat interface Come learn this important software layer for building LLMs! https://t.co/wm6NZOGIKC

Media 2
πŸ–ΌοΈ Media
R
rasbt
@rasbt
πŸ“…
Mar 03, 2026
6d ago
πŸ†”40702354
⭐0.34

@BarathAnandan7 Still have to simplify GatedDeltaNet. For now, I took a bit of a shortcut here and use the GatedDeltaNet implementation from HF (to make sure it works as a refernece), but the rest is from scratch :)

J
JohnMai_Dev
@JohnMai_Dev
πŸ“…
Mar 03, 2026
6d ago
πŸ†”69881465

I just implemented inference for Qwen3.5 0.8B based on https://t.co/W8bSA5TRiO, and successfully ran it on an M1 Pro. https://t.co/z0g1ynNlq3

Media 1
+1 more
πŸ–ΌοΈ Media
A
Alibaba_Qwen
@Alibaba_Qwen
πŸ“…
Mar 03, 2026
6d ago
πŸ†”57616477

πŸ”₯ Qwen 3.5 Series GPTQ-Int4 weights are live. Native vLLM & SGLang support. ⚑️ Less VRAM. Faster inference. Run powerful models on limited-GPU setups. πŸ‘‡ Grab the weights + example code: Hugging Face: https://t.co/3MSb7miq68 ModelScope: https://t.co/LGHruBHP6Q

Media 1
πŸ–ΌοΈ Media
πŸ”ai_fast_track retweeted
A
Qwen
@Alibaba_Qwen
πŸ“…
Mar 03, 2026
6d ago
πŸ†”57616477
⭐0.38

πŸ”₯ Qwen 3.5 Series GPTQ-Int4 weights are live. Native vLLM & SGLang support. ⚑️ Less VRAM. Faster inference. Run powerful models on limited-GPU setups. πŸ‘‡ Grab the weights + example code: Hugging Face: https://t.co/3MSb7miq68 ModelScope: https://t.co/LGHruBHP6Q

❀️856
likes
πŸ”83
retweets
J
jon_barron
@jon_barron
πŸ“…
Mar 03, 2026
6d ago
πŸ†”31236246

One of the more interesting and thought provoking research papers I've seen in a while. A system for reading and reimplementing NeRF papers, and it seems to work very well. Pretty easy to extrapolate out from here to what CVPR 2027 papers will look like. https://t.co/gokzG27mIT https://t.co/jPpRESdKkd

Media 1
πŸ–ΌοΈ Media
L
LiorOnAI
@LiorOnAI
πŸ“…
Mar 02, 2026
7d ago
πŸ†”83311382
⭐0.42

Alibaba shipped four Qwen 3.5 small models with a trick borrowed from their 397B model: Gated DeltaNet hybrid attention. Three layers of linear attention for every one layer of full attention. The linear layers handle routine computation with constant memory use. The full attention layers fire only when precision matters. This 3:1 ratio keeps memory flat while quality stays high, which is why even the 0.8B model supports a 262,000-token context window. Every model handles text, images, and video natively. No adapter bolted on afterward. The vision encoder uses 3D convolutions to capture motion in video, then merges features from multiple layers instead of just the final one. The 9B beats GPT-5-Nano by 13 points on multimodal understanding, 17 points on visual math, and 30 points on document parsing. The 0.8B runs on a phone and processes video. The 4B fits in 8GB of VRAM and acts as a multimodal agent. All four are Apache 2.0. If this architecture holds, the small model space just became a capability race instead of a size race. A year ago, running a multimodal model locally meant a 13B+ model and a serious GPU. Now a 4B model with 262K context handles text, images, and video from consumer hardware. The gap between edge models and flagship models is closing faster than the gap between flagships and humans.

A
AmbsdOP
@AmbsdOP
πŸ“…
Mar 02, 2026
7d ago
πŸ†”68874940

YES! Someone reverse-engineered Apple's Neural Engine and trained a neural network on it. Apple never allowed this. ANE is inference-only. No public API, no docs. They cracked it open anyway. Why it matters: β€’ M4 ANE = 6.6 TFLOPS/W vs 0.08 for an A100 (80Γ— more efficient) β€’ "38 TOPS" is a lie - real throughput is 19 TFLOPS FP16 β€’ Your Mac mini has this chip sitting mostly idle Translation: local AI inference that's faster AND uses almost no power. Still early research but the door is now open. β†’ https://t.co/qPwddSyV3f #AI #MachineLearning #AppleSilicon #LocalAI #OpenSource #ANE #CoreML #AppleSilicon #NPU #KCORES

Media 1Media 2
πŸ–ΌοΈ Media
L
LiorOnAI
@LiorOnAI
πŸ“…
Mar 05, 2026
4d ago
πŸ†”94310819
⭐0.42

A 24-billion-parameter model just ran on a laptop and picked the right tool in under half a second. The real story is that tool-calling agents finally became fast enough to feel like software. Liquid built LFM2-24B-A2B using a hybrid architecture that mixes convolution blocks with grouped query attention in a 1:3 ratio. Only 2.3 billion parameters activate per token, even though the full model holds 24 billion. That sparse activation pattern is why it fits in 14.5 GB of memory and dispatches tools in 385 milliseconds on an M4 Max. The architecture was designed through hardware-in-the-loop search, meaning they optimized the model structure by testing it directly on the chips it would run on. No cloud translation layer. No API roundtrip. The model, the tools, and your data stay on the machine. This unlocks three things that were impractical before: 1. Regulated industries can run agents on employee laptops without data leaving the device. 2. Developers can prototype multi-tool workflows without managing API keys or rate limits. 3. Security teams get full audit trails without vendor subprocessors in the loop. The model hit 80% accuracy on single-step tool selection across 67 tools spanning 13 MCP servers. If this performance holds at scale, two assumptions need updating. First, on-device agents are no longer a battery-life trade-off; they're a compliance feature. Second, the bottleneck in agentic workflows is shifting from model capability to tool ecosystem maturity.

A
AlphaSignalAI
@AlphaSignalAI
πŸ“…
Mar 06, 2026
3d ago
πŸ†”88009006

Transformers just got a serious rival. Allen AI just open-sourced a 7B model that beats its own transformer. OLMo Hybrid mixes standard attention with linear RNN layers into one architecture. > Same accuracy, half the training data > Long-context jumps from 70.9% to 85.0% > Beats the pure transformer on every eval domain > Fully open: base, fine-tuned, and aligned versions The trick is a 3:1 pattern. Three recurrent layers handle most of the sequence processing cheaply. One attention layer then catches what the recurrent state missed. This cuts 75% of the expensive attention operations while keeping precision where it matters. Building long-context apps used to mean paying the full cost of attention across every layer. Now you can get better long-context performance with a leaner architecture, and the theory proving why it scales better is released alongside the weights. https://t.co/bxZ7ckAOq4

Media 1Media 2
πŸ–ΌοΈ Media
J
jerryjliu0
@jerryjliu0
πŸ“…
Mar 03, 2026
6d ago
πŸ†”79643299

3 years ago, you might’ve known @llama_index as a RAG framework. Today we are not a RAG framework. We are an agentic document processing platform πŸ¦™πŸ“‘ I wrote a blog post detailing the evolution of our company over the past ~3 years and why we believe our current position is enduring in the rapidly evolving landscape of evolving AI. There are two main points that I want to highlight: 1️⃣ One of the most important opportunities in today’s world is to provide high-quality unstructured context to AI agents. We see ourselves as the best in class OCR module that can unlock context from the hardest document containers (PDFs, Word, Powerpoint, Excel, and more) 2️⃣ Agent reasoning loops have gotten a lot more sophisticated. General LLM abstractions are a lot less relevant. Retrieval patterns have completely changed. We need to build deep, focused tooling that actually provides value in this world of long-running agents. Note: We are not giving up on OSS tooling. We think open-source software is extremely important for democratizing AI access. We will continue to build OSS that is more aligned with our core focus area of AI-native document processing. We will continue to support framework users and point them to updated resources for relevant releases. Come check out our blog: https://t.co/2hGgzYtI3v Our core managed platform is LlamaParse. If you’re interested come check out our platform: https://t.co/TqP6OT5U5O

Media 1Media 2
πŸ–ΌοΈ Media
L
llama_index
@llama_index
πŸ“…
Mar 05, 2026
4d ago
πŸ†”83795631

Creating agent workflows and architecting the logic is one thing, making them durable, fail-safe, and scalable is anotherπŸ‘‡ New integration for durable agent workflows with @DBOS_Inc execution - Make sure your agents survive crashes, restarts, and errors without writing any checkpoint code. πŸ”„ Every step transition persists automatically - workflows resume exactly where they left off ⚑ Zero external dependencies with SQLite, or scale to multi-replica deployments with Postgres πŸ‘―β€β™€οΈ Built for replication - each replica owns its workflows, with Postgres coordinating across instances πŸ’€ Idle release feature frees memory for long-running workflows waiting on human input πŸ›‘οΈ Built-in crash recovery detects and relaunches incomplete workflows automatically This integration with DBOS removes all the manual snapshot work from durable workflows. Just pass a DBOS runtime to your workflow and get great reliability β€” whether you're running a single process or multiple replicas in production. Learn how to build durable agents on our new docs: https://t.co/9AfefFWkXl

Media 1
πŸ–ΌοΈ Media
L
llama_index
@llama_index
πŸ“…
Mar 05, 2026
4d ago
πŸ†”90767806

"Just send the PDF to GPT-4o" Ok. We did. Here's what happened: β€’ Reading order? Wrong. β€’ Tables? Half missing. β€’ Hallucinated data? Everywhere. β€’ Bounding boxes? Nonexistent. β€’ Cost at 100K pages? Brutal. So we're doing it live. LlamaParse vs. The LLMs β€” a free webinar where we parse the ugliest documents we can find across every leading model and show the results side by side. Hosted by George, Head of Engineering, LlamaIndex When: March 26th; 9 AM PST Register πŸ‘‡ https://t.co/To4m9Zmu7m

Media 1
πŸ–ΌοΈ Media
L
llama_index
@llama_index
πŸ“…
Mar 06, 2026
3d ago
πŸ†”95117278

"Just send the PDF to GPT-5.4" Ok. We did. Here's what happened: β€’ Reading order? Wrong. β€’ Tables? Half missing. β€’ Hallucinated data? Everywhere. β€’ Bounding boxes? Nonexistent. β€’ Cost at 100K pages? Brutal. So we're doing it live. LlamaParse vs. The LLMs β€” a free webinar where we parse the ugliest documents we can find across every leading model and show the results side by side. Hosted by George, Head of Engineer at @llama_index Register πŸ‘‡ https://t.co/To4m9ZlWhO

Media 1
πŸ–ΌοΈ Media
A
arlanr
@arlanr
πŸ“…
Mar 02, 2026
7d ago
πŸ†”51304231

introducing @nozomioai v1. state of the art search and index API to reduce hallucinations in AI agents. use it inside any coding agent or power your own products (thread): https://t.co/mqNqWDSAsU

πŸ–ΌοΈ Media
πŸ”Scobleizer retweeted
A
Arlan
@arlanr
πŸ“…
Mar 02, 2026
7d ago
πŸ†”51304231
⭐0.34

introducing @nozomioai v1. state of the art search and index API to reduce hallucinations in AI agents. use it inside any coding agent or power your own products (thread): https://t.co/mqNqWDSAsU

❀️176
likes
πŸ”25
retweets
A
AtsuMiyaiAM
@AtsuMiyaiAM
πŸ“…
Mar 02, 2026
7d ago
πŸ†”50930125
⭐0.46

Thank @_akhaliq for sharing our paper! Our paper has been accepted by TMLR2026! Starting from a baseline paper and code, Jr. AI Scientist leverages LLM and Claude Code to identify limitations, formulate new hypotheses, test them through careful experimentation, and produce a research paper. We report not only successful results, but also failures and risks. Through this comprehensive report, we aim to foster a deeper and clearer understanding within the community of the current progress and limitations of AI Scientist research. paper link: https://t.co/6kTW3KgiAU

πŸ”_akhaliq retweeted
A
Atsuyuki Miyai @UTokyo
@AtsuMiyaiAM
πŸ“…
Mar 02, 2026
7d ago
πŸ†”50930125
⭐0.38

Thank @_akhaliq for sharing our paper! Our paper has been accepted by TMLR2026! Starting from a baseline paper and code, Jr. AI Scientist leverages LLM and Claude Code to identify limitations, formulate new hypotheses, test them through careful experimentation, and produce a research paper. We report not only successful results, but also failures and risks. Through this comprehensive report, we aim to foster a deeper and clearer understanding within the community of the current progress and limitations of AI Scientist research. paper link: https://t.co/6kTW3KgiAU

❀️69
likes
πŸ”7
retweets
J
jandotai
@jandotai
πŸ“…
Mar 02, 2026
7d ago
πŸ†”15965098

Introducing Jan-Code-4B πŸ’» A compact coding model tuned for practical day-to-day tasks. Generation, refactors, debugging, tests β€” all runnable locally in Jan. Download Jan: https://t.co/MPwceB2eHG Model: https://t.co/siedXzTv0v https://t.co/KNlzvwKkDu

Media 2
+1 more
πŸ–ΌοΈ Media
πŸ”huggingface retweeted
J
πŸ‘‹ Jan
@jandotai
πŸ“…
Mar 02, 2026
7d ago
πŸ†”15965098
⭐0.36

Introducing Jan-Code-4B πŸ’» A compact coding model tuned for practical day-to-day tasks. Generation, refactors, debugging, tests β€” all runnable locally in Jan. Download Jan: https://t.co/MPwceB2eHG Model: https://t.co/siedXzTv0v https://t.co/KNlzvwKkDu

❀️531
likes
πŸ”59
retweets
K
karpathy
@karpathy
πŸ“…
Feb 24, 2026
13d ago
πŸ†”59744309
⭐0.42

@N8Programs a beauty for anyone interested in mechanistic interpretability or getting into LLMs. interesting to look at small algorithms and their "neural implementations" to get a sense of how neural nets implement various functionality. unless the minification really creates "esoteric" solutions that you wouldn't encounter in practice, which might be more based around distributed representations, helixes etc. i tried training the same arch briefly from scratch and gradient descent didn't find the solution, would probably work with more degrees of freedom and enough effort.

K
karpathy
@karpathy
πŸ“…
Feb 25, 2026
12d ago
πŸ†”34651264
⭐0.42

With the coming tsunami of demand for tokens, there are significant opportunities to orchestrate the underlying memory+compute *just right* for LLMs. The fundamental and non-obvious constraint is that due to the chip fabrication process, you get two completely distinct pools of memory (of different physical implementations too): 1) on-chip SRAM that is immediately next to the compute units that is incredibly fast but of very of low capacity, and 2) off-chip DRAM which has extremely high capacity, but the contents of which you can only suck through a long straw. On top of this, there are many details of the architecture (e.g. systolic arrays), numerics, etc. The design of the optimal physical substrate and then the orchestration of memory+compute across the top volume workflows of LLMs (inference prefill/decode, training/finetuning, etc.) with the best throughput/latency/$ is probably today's most interesting intellectual puzzle with the highest rewards (\cite 4.6T of NVDA). All of it to get many tokens, fast and cheap. Arguably, the workflow that may matter the most (inference decode *and* over long token contexts in tight agentic loops) is the one hardest to achieve simultaneously by the ~both camps of what exists today (HBM-first NVIDIA adjacent and SRAM-first Cerebras adjacent). Anyway the MatX team is A++ grade so it's my pleasure to have a small involvement and congratulations on the raise!

K
karpathy
@karpathy
πŸ“…
Feb 27, 2026
10d ago
πŸ†”25239822

Cool chart showing the ratio of Tab complete requests to Agent requests in Cursor. With improving capability, every point in time has an optimal setup that keeps changing and evolving and the community average tracks the point. None -> Tab -> Agent -> Parallel agents -> Agent Teams (?) -> ??? If you're too conservative, you're leaving leverage on the table. If you're too aggressive, you're net creating more chaos than doing useful work. The art of the process is spending 80% of the time getting work done in the setup you're comfortable with and that actually works, and 20% exploration of what might be the next step up even if it doesn't work yet.

Media 1
πŸ–ΌοΈ Media