Your curated collection of saved posts and media

Showing 32 posts Β· last 14 days Β· by score
N
Naina Chaturvedi
@NainaChaturved8
πŸ“…
Sep 24, 2023
956d ago
πŸ†”51296533

βœ…How to Choose Right Model in ML -Explained in simple terms. A quick thread πŸ§΅πŸ‘‡πŸ» #MachineLearning #Coding #100DaysofCode #deeplearning #DataScience PC : Research Gate https://t.co/6m3r0BW2Hl

Media 1
❀️100
likes
πŸ”32
retweets
πŸ–ΌοΈ Media
Y
Yann LeCun
@ylecun
πŸ“…
Sep 26, 2023
954d ago
πŸ†”33316923

A piece of neural net hardware history: Edi SΓ€ckinger with Larry Jackel. Edi is holding a VME board with two Bell Labs ANNA chips (with golden plates on the left side). The ANNA chip is a mixed analog-digital chip specifically designed to run convolutional nets efficiently. It was designed in 1989 by Edi and Bernhard Boser. ANNA had 64 neurons with 64 inputs each. All 4096 weights could perform multiply-accumulate operations simultaneously at 1MHz, i.e. 4 billion operations per second (unheard of performance at the time). The weights were stored as voltages on capacitors, refreshed from external RAM with 6 bit depth. The activations were digital with 4 bit depth. The multipliers were essentially multiplying DACs and the accumulation was just currents on a wire. Shift registers allowed for efficient convolutions with minimal external memory traffic. This could run LeNet at 10,000 characters per second. The picture was taken earlier this year, but the chip was designed in 1989, and the board was built a year later. Papers: Chip: https://t.co/NoUyPVEBMd Board: https://t.co/MO1dEUXnZT

Media 1
❀️936
likes
πŸ”146
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 26, 2023
954d ago
πŸ†”69586899

VidChapters-7M: Video Chapters at Scale paper page: https://t.co/Zfoq5LEihF Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset.

Media 1
❀️238
likes
πŸ”49
retweets
πŸ–ΌοΈ Media
L
Luc Georges
@LucSGeorges
πŸ“…
Sep 25, 2023
955d ago
πŸ†”45711158
⭐1.00

0.1.0 is out - huggingface-vscode is dead, long live llm-vscode! πŸ‘‘ llm-vscode is an open source extension integrating LLMs inside your favorite code editor with the goal of improving your efficiency and productivity. πŸ¦€ It is now powered by llm-ls, our Rust language server. https://t.co/Kzp5vEM75R

Media 1
❀️651
likes
πŸ”98
retweets
πŸ–ΌοΈ Media
I
Ian Tenney (@iftenney@sigmoid.social)
@iftenney
πŸ“…
Sep 25, 2023
955d ago
πŸ†”06203241

🧡(1/6): Excited to announce the v1.0 release of the @GoogleAI Learning Interpretability Tool (πŸ”₯LIT), an interactive platform to debug, validate, and understand ML model behavior. This release brings exciting new features and a simplified Python API. https://t.co/vCfjONCfXh https://t.co/q24pXegmGQ

Media 1
❀️162
likes
πŸ”37
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 26, 2023
954d ago
πŸ†”02884543

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention paper page: https://t.co/nXtYagDaYv Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.

Media 1
❀️142
likes
πŸ”45
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 26, 2023
954d ago
πŸ†”07569221

Exploring Large Language Models' Cognitive Moral Development through Defining Issues Test paper page: https://t.co/XeTaNgdDCO The development of large language models has instilled widespread interest among the researchers to understand their inherent reasoning and problem-solving capabilities. Despite good amount of research going on to elucidate these capabilities, there is a still an appreciable gap in understanding moral development and judgments of these models. The current approaches of evaluating the ethical reasoning abilities of these models as a classification task pose numerous inaccuracies because of over-simplification. In this study, we built a psychological connection by bridging two disparate fields-human psychology and AI. We proposed an effective evaluation framework which can help to delineate the model's ethical reasoning ability in terms of moral consistency and Kohlberg's moral development stages with the help of Psychometric Assessment Tool-Defining Issues Test.

Media 1
❀️81
likes
πŸ”15
retweets
πŸ–ΌοΈ Media
F
FranΓ§ois Chollet
@fchollet
πŸ“…
Sep 26, 2023
954d ago
πŸ†”95529129

Making some slides again. I like this one. https://t.co/mzsAkuS5N8

Media 1
❀️252
likes
πŸ”23
retweets
πŸ–ΌοΈ Media
J
Javier J. Rueda
@javierjrueda
πŸ“…
Sep 25, 2023
955d ago
πŸ†”14629604

Sick AI-generated visuals at festivals. Real-world applications are getting interesting. πŸ’€ https://t.co/xZ0h2GRTQ9

❀️2,034
likes
πŸ”317
retweets
πŸ–ΌοΈ Media
J
Jerry Liu
@jerryjliu0
πŸ“…
Sep 25, 2023
955d ago
πŸ†”79050325

In the past few weeks we’ve created a ton of features/tutorials on β€œadvanced retrieval techniques” for better performing RAG systems. β€œOkay cool, but how do I know they do better than top-k?” πŸ€” Great point. We’re launching an initiative to add retrieval/LLM evals to *all* of our advanced retrieval tutorials to show if they’re betterπŸ’‘: βœ… Recursive retrieval / node references βœ… Auto-merging retriever There’s more coming too: πŸ—“οΈ Ensemble Retrieval πŸ—“οΈ Hierarchical Retrieval πŸ—“οΈ Document Agents The burden of proof is on us to show that these techniques work better than the basic stuff, and also to show you how to create proper benchmarks and evaluate techniques in different settings! πŸ§ͺπŸ§‘β€πŸ”¬ Some super interesting findings so far πŸ”₯: πŸ’‘Using node references for retrieval, instead of the raw text, can improve hit-rate / MRR by 10-20%!! πŸ’‘The response quality from auto-merging retrieval is better than top-k, and we find that GPT-4 prefers it 65% of the time. Node references notebook: https://t.co/y6XIdSSfhI Auto merging notebook: https://t.co/JT7pE59t30

Media 1Media 2
❀️252
likes
πŸ”35
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 26, 2023
954d ago
πŸ†”67925881

SCREWS: A Modular Framework for Reasoning with Revisions paper page: https://t.co/LNnopxDcXW Large language models (LLMs) can improve their accuracy on various tasks through iteratively refining and revising their output based on feedback. We observe that these revisions can introduce errors, in which case it is better to roll back to a previous result. Further, revisions are typically homogeneous: they use the same reasoning method that produced the initial answer, which may not correct errors. To enable exploration in this space, we present SCREWS, a modular framework for reasoning with revisions. It is comprised of three main modules: Sampling, Conditional Resampling, and Selection, each consisting of sub-modules that can be hand-selected per task. We show that SCREWS not only unifies several previous approaches under a common framework, but also reveals several novel strategies for identifying improved reasoning chains. We evaluate our framework with state-of-the-art LLMs (ChatGPT and GPT-4) on a diverse set of reasoning tasks and uncover useful new reasoning strategies for each: arithmetic word problems, multi-hop question answering, and code debugging. Heterogeneous revision strategies prove to be important, as does selection between original and revised candidates.

Media 1
❀️151
likes
πŸ”36
retweets
πŸ–ΌοΈ Media
T
Teortaxes
@teortaxesTex
πŸ“…
Sep 25, 2023
955d ago
πŸ†”70077364

People are so used to models trained for Huggingface Leaderboard they're in disbelief upon seeing a production-grade one. Maybe they shouldn't. Smol Qwens are samples of Tongyi Qianwen, not proofs of concept; to Alibaba, they're kind of like what T5-XXL is to Google. Alibaba is a world-class corporation actually trying, no BS, to capture the vast Chinese LLM assistant market. That's why this report talks so much about practical aspects and objectives they pursued, not only muh MMLU/HEval scores (and even with HEval, they go for the state-of-the-art HumanEvalPack benchmark). This paper, incomplete though it may be (it's particularly secretive about the dataset, understandably evoking extra suspicion), is a treasure trove of insight into almost-frontier proprietary LLMs. This is something like what we should've expected to see if @karpathy got his way and OpenAI published that small open-source model to teach the community a little share of their tricks. https://t.co/sQREvJ6dzz In the realm of LLaMA finetunes, @gigaml 's X1-Large and probably @XLangAI Lemur are comparable, but we know so much less about them. X1 is genuinely superior to LLama2-70B across the board, which is more than I can say for all the fancy imitative finetunes. As @iamgingertrash would probably argue, this is the difference in incentives.

@JagersbergKnut β€’

Qwen-14B beats larger models in benchmarks, LLM community wonders how @bimedotcom @TheAIObserverX @Khulood_Almani @debashis_dutta @sonu_monika @theomitsa @BetaMoroney @Analytics_699 @Shi4Tech @FmFrancoise @enilev @sallyeaves @IanLJones98 https://t.co/2oEInWaIiB https://t.co/SAh

Media 1Media 2
+2 more
❀️270
likes
πŸ”26
retweets
πŸ–ΌοΈ Media
J
Jerry Liu
@jerryjliu0
πŸ“…
Sep 26, 2023
954d ago
πŸ†”45157812

Be careful of the eval metrics you use! ⚠️ One insight in trying different retrieval techniques to improve LLM/RAG apps is that certain techniques are clearly better via the eyeball test πŸ‘€, but not every eval metric reflects that (it finds both answers are β€œcorrect”, β€œfaithful to the context”, etc.) We came up with a simple trick to better tease apart each technique - pass GPT-4 two responses, and see which one it prefers (or output a tie)! πŸ§‘β€βš–οΈβš–οΈ Take this example below. Technique 1 (auto merging) clearly outputs more details than technique 2 (naive K) - see image 1. πŸ€” But a semantic similarity evaluator yields the same numbers, or even worse (image 2) πŸ’‘ Instead, let’s just ask GPT-4 what answer is better given the question (image 3) Through this we see that our auto-merging retriever is preferred 65% of the time πŸ”₯ (image 4) Note: this is similar to how reward data is collected for RLHF (except instead of humans we use GPT). Full guide here: https://t.co/JT7pE59t30

Media 1Media 2
+2 more
❀️111
likes
πŸ”17
retweets
πŸ–ΌοΈ Media
L
LlamaIndex πŸ¦™
@llama_index
πŸ“…
Sep 25, 2023
955d ago
πŸ†”33249441

We’ve significantly upgraded our tools for evaluating RAG systems. πŸ§ͺ Synthetically generate a β€œground-truth” RAG dataset ⚑️ Async: generate datasets/run evals 2-10x faster πŸ‘₯ Brand-new Pairwise evaluator πŸ“— Guides showing that our advanced retrievers > naive top-k See 🧡 https://t.co/5Chi00rKV7

Media 1
❀️123
likes
πŸ”15
retweets
πŸ–ΌοΈ Media
E
Daniel Ek
@eldsjal
πŸ“…
Sep 25, 2023
955d ago
πŸ†”81576006

Podcasters - what if I told you could offer your pod to any listener around the world, in their own local language but still keep it in your own voice? That’s the pilot we’re launching @Spotify! It’s called Voice Translation and using AI, translates podcasts episodes into alternate languages, all in the podcaster’s voice. It’s pretty insane. Take a look and let me know what you think! https://t.co/wwUYsd3Mgo

❀️7,726
likes
πŸ”1,166
retweets
πŸ–ΌοΈ Media
S
Swarnadeep Saha
@swarnaNLP
πŸ“…
Sep 25, 2023
955d ago
πŸ†”10549215

🚨ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs ChatGPT+Bard+Claude2 discuss & correctively convince each other for reasoning, even outperforming GPT4 (& GPT4 as agent improves by 10%)! https://t.co/apRH8C4ztz @cyjustinchen @mohitban47 🧡 https://t.co/uSoewg0NVL

Media 1
❀️170
likes
πŸ”49
retweets
πŸ–ΌοΈ Media
_
Philipp Schmid
@_philschmid
πŸ“…
Sep 25, 2023
955d ago
πŸ†”94066200

Transformers Just Got Faster!⚑️ πŸš€Β Thrilled to announce native Flash Attention (FA) 2 support in Hugging Face Transformers to speed up training and inference for transformer models like LLaMA and Falcon up to 2x. πŸ¦™πŸ¦… πŸ‘‰Β  https://t.co/MXy0iIykD7 🧢 https://t.co/PCFS8A57ig

Media 1
❀️214
likes
πŸ”54
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 25, 2023
955d ago
πŸ†”23596847

ChatGPT can now see, hear, and speak blog: https://t.co/c4h5MHhv9f We are beginning to roll out new voice and image capabilities in ChatGPT. They offer a new, more intuitive type of interface by allowing you to have a voice conversation or show ChatGPT what you’re talking about. Voice and image give you more ways to use ChatGPT in your life. Snap a picture of a landmark while traveling and have a live conversation about what’s interesting about it. When you’re home, snap pictures of your fridge and pantry to figure out what’s for dinner (and ask follow up questions for a step by step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you. We’re rolling out voice and images in ChatGPT to Plus and Enterprise users over the next two weeks. Voice is coming on iOS and Android (opt-in in your settings) and images will be available on all platforms.

❀️244
likes
πŸ”48
retweets
πŸ–ΌοΈ Media
R
Joseph Thacker (rez0)
@rez0__
πŸ“…
Sep 25, 2023
955d ago
πŸ†”69213343

GPT-4 with vision is here. I got to play with it early. Here's an example of prompt injection via image that is non-obvious to the user. Imagine payloads on signs, billboards and t-shirts. As we use LLMs to process the world, their operating context could be hijacked. https://t.co/oezGIXCmYR

Media 1
❀️89
likes
πŸ”18
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 25, 2023
955d ago
πŸ†”16093811

Open AI releases GPT-4V(ision) system card paper: https://t.co/lWqSHhlCUP GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs.

Media 1
❀️1,581
likes
πŸ”375
retweets
πŸ–ΌοΈ Media
O
OpenAI
@OpenAI
πŸ“…
Sep 25, 2023
955d ago
πŸ†”29141022

ChatGPT can now see, hear, and speak. Rolling out over next two weeks, Plus users will be able to have voice conversations with ChatGPT (iOS & Android) and to include images in conversations (all platforms). https://t.co/uNZjgbR5Bm https://t.co/paG0hMshXb

❀️40,821
likes
πŸ”9,470
retweets
πŸ–ΌοΈ Media
L
Lightning AI ⚑️
@LightningAI
πŸ“…
Sep 25, 2023
955d ago
πŸ†”02826295

Looking for a small but very capable LLM to avoid out-of-memory errors? Lit-GPT now supports phi-1.5, a SOTA 1.3B parameter LLM! ⚑️ Finetune it on a single GPU ⚑️ Pretrain it on a cluster ⚑️ Just use it on a MacBook More info: https://t.co/Yk8dKmpedk #LLMs #MachineLearning #DeepLearning

Media 1
❀️128
likes
πŸ”33
retweets
πŸ–ΌοΈ Media
O
elvis
@omarsar0
πŸ“…
Sep 25, 2023
955d ago
πŸ†”28931844

Introducing Colossal-LLaMA-2 Huge release by Colossal-AI. They present an open-source and commercial-free domain-specific LLM solution to build your own large-scale models at a much lower cost. Utilizes only approximately 0.0085 trillion tokens of data, investing 15 hours, and incurring training costs in the range of a few hundred dollars. This strategy led to a Chinese LLaMA-2 model outperforming competitors across multiple evaluation benchmarks. Lots of new improvements in this release, including: - vocabulary expansion and model initialization to extend to Chinese while preserving English language capabilities - complete data cleaning system and toolkit for selecting higher data quality used to train the models - a multi-stage, hierarchical continual pre-training scheme: 1) large-scale pertaining, 2) Chinese knowledge injection stage, 3) relevant knowledge replay stage; this approach ensures the model progresses equally in both Chinese and English abilities. - bucket training to ensure a balanced distribution of data Personally, the most interesting bit of this release is the focus and possibility of training lightweight domain-specific LLMs in a cost-effective way. This will unlock the ability to fine-tune these foundation models for all kinds of applications that meet specific business needs. Check out the blog here: https://t.co/D5U2dBjcIx ColossalAI repo: https://t.co/jatXbyQyby

Media 1
❀️442
likes
πŸ”101
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 25, 2023
955d ago
πŸ†”53505346

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation paper page: https://t.co/hdQj5L9FsL present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories.

Media 1
❀️191
likes
πŸ”43
retweets
πŸ–ΌοΈ Media
L
LlamaIndex πŸ¦™
@llama_index
πŸ“…
Sep 25, 2023
955d ago
πŸ†”33653962

We’re excited to release full native support for THREE @huggingface embedding models (s/o @LoganMarkewich): 🧱 Base @huggingface embeddings wrapper πŸ§‘β€πŸ« Instructor embeddings ⚑️ Optimum embeddings (ONNX format) Full thread below 🧡. Checkout the guide: https://t.co/vuyEtqcfN0 https://t.co/KgVqOT08jL

Media 1Media 2
+1 more
❀️124
likes
πŸ”24
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 25, 2023
955d ago
πŸ†”11724994

CodePlan: Repository-level Coding using LLMs and Planning paper page: https://t.co/feH2jmSOa0 Software engineering activities such as package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks. Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic framework, called CodePlan to solve it. CodePlan synthesizes a multi-step chain of edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm. We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2-97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/6 repositories to pass the validity checks (e.g., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them.

Media 1
❀️503
likes
πŸ”105
retweets
πŸ–ΌοΈ Media
_
AK
@_akhaliq
πŸ“…
Sep 25, 2023
955d ago
πŸ†”98590085

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model paper page: https://t.co/aK9Tey589W Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.

Media 1
❀️43
likes
πŸ”7
retweets
πŸ–ΌοΈ Media
S
Robert Scoble
@Scobleizer
πŸ“…
Sep 25, 2023
955d ago
πŸ†”21680976

Want GPT to hallucinate less? I learned tonight that if you add β€œif you don’t know, say you don’t know” to your prompt that it will generate far fewer incorrect answers. AI pioneer @AdrianKaehler1 told me that at dinner. So I asked GPT why that worked. Here is its answer. Aren’t large language models fun to try to figure out?

Media 1
❀️1,785
likes
πŸ”274
retweets
πŸ–ΌοΈ Media
D
Jim Fan
@DrJimFan
πŸ“…
Sep 24, 2023
956d ago
πŸ†”25503282

Let's reverse engineer the phenomenal Tesla Optimus. No insider info, just my own analysis. Long read: 1. The smooth hand movements are almost certainly trained by imitation learning ("behavior cloning") from human operators. The alternative is reinforcement learning in simulation, but that typically leads to jittery motion and unnatural hand poses. There're at least 4 ways to collect human demonstrations: (1) A custom-built teleoperation system - I believe this is the most likely means used by Tesla team. Open-source example: ALOHA, a low-cost bimanual robot arm and teleoperation system by Stanford AI Labs (https://t.co/8iXpiHVEjS). It enables very precise, dexterous motions, such as putting AAA batteries into a remote or manipulating contact lens. (2) Motion Capture (MoCap): apply the MoCap systems used for Hollywood movies to capture the fine-grained motions of hand joints. Optimus' 5-finger hand is a great design decision that enables a direct mapping - there is no "embodiment gap" from human operators. For instance, a demonstrator can wear a CyberGlove (https://t.co/S8hxErsEuU) and grasp the cubes on the table (as shown in video). CyberGlove will capture the motion signals & haptic feedback in real-time, which can be re-targeted onto Optimus. (3) Wearing gloves & markers can be clumsy. An alternative way to do MoCap is through computer vision. DexPilot from NVIDIA enables marker-less and glove-free data collection. The human operator simply uses their bare hands to perform the tasks. 4 Intel RealSense depth cameras and 2 NVIDIA Titan XP GPUs (yeah, 2019 work) translate the pixels to precise motion signals for robot learning. (4) VR Headset: turn the training room into a VR game, and let humans "role play" Optimus. Use the native VR controller or CyberGlove to control the virtual Optimus hands. This has the advantage of scalable remote data collection - annotators from around the world can contribute without coming onsite. VR demonstration technique appeared in research projects like the iGibson home robot simulator, an initiative that I participated in at Stanford: https://t.co/eyI4ORkH6G Above 4 are not mutually exclusive. Optimus could use a combo of them for different pros & cons. 2. Neural Architecture. Optimus is trained end-to-end: videos in, actions out. I'm quite sure it's implemented by a multimodal Transformer with the following components: (1) Image: some variant of efficient ViT, or simply an old ResNet/EfficientNet backbone (https://t.co/L6PLTQJnGA). The block pick-and-place demo doesn't require sophisticated vision. The spatial feature map from the image backbone can be tokenized easily. (2) Video: two ways. Either flatten the video into a sequence of images and produce tokens independently, or have a video-level tokenizer. There're numerous ways to efficiently process video pixel volumes. You don't necessarily need Transformer backbones, e.g. SlowFast Network (https://t.co/qDdXzqwJQp) and RubiksNet (https://t.co/CQU8D7TZgx, my paper at ECCV 2020, efficient CUDA shift primitives). (3) Language: it's not clear if Optimus is language prompted. If it is, there needs to be a way to "fuse" the language representations into perception. FiLM is a very lightweight neural network module that serves this purpose (https://t.co/VI4TpgQ22V). You can think of it intuitively as a "cross attention" of language embedding into the image-processing neural pathway. (4) Action tokenization: Optimus needs to convert the continuous motion signals into discrete tokens for the autoregressive Transformer to work. A few ways: - Directly bin the continuous values for each hand joint control. [0, 0.01) -> token #0, [0.01, 0.02) -> token #1, etc. This is straightforward but could be inefficient due to the long sequence length. - The joint movements are highly dependent on each other, which means they occupy a low-dimensional "state space". Apply VQVAE to the motion data to obtain a shorter-length, compressed token set. (5) Putting the above pieces together, we have a Transformer controller that consumes video tokens (optionally with language modulation), and outputs action tokens, one step at a time. The next frame from the table is fed back to the Transformer, so it knows the consequence of its action. That gives the self-corrective ability shown in the demo. I believe the architecture is most similar to: - Google RT-1: https://t.co/dpuon1bqU6 - NVIDIA VIMA: https://t.co/Tn3L63uGrv 3. Lastly, I'm genuinely impressed by the hardware quality. The motions are fluid, and the aesthetics is amazing as well. As I mentioned above, it's such a great decision to follow human morphology closely, so that there is no gap in imitating humans. Atlas from Boston Dynamics only has simple gripper-style hands. In the long run, Optimus' bi-dexterous, 5-finger hands will prove far superior in daily tasks. Congrats to @Tesla_Optimus team & @elonmusk πŸŽ‰! I look forward to seeing the bots roam Mars some day 🦾

❀️3,180
likes
πŸ”589
retweets
πŸ–ΌοΈ Media
J
Jingxuan He
@jingxuan_he
πŸ“…
Sep 24, 2023
956d ago
πŸ†”01158339

Here is the Hugging Face page of our paper on the security of code generation using LLMs: https://t.co/jnx3m83wgR https://t.co/5zbnOLD1V9

Media 1
❀️38
likes
πŸ”6
retweets
πŸ–ΌοΈ Media
J
Jeremy Howard
@jeremyphoward
πŸ“…
Sep 24, 2023
956d ago
πŸ†”23450924

I wanted ChatGPT to show how to get likes/views ratio for a bunch of YouTube videos, without dealing with the hassle of YouTube's Data API limits. But it didn't want to, because it claimed screen scraping is against the YouTube ToS. So I lied to ChatGPT. https://t.co/nLzW9cDA2O

Media 1
❀️621
likes
πŸ”24
retweets
πŸ–ΌοΈ Media
T
Teknium (e/Ξ»)
@Teknium1
πŸ“…
Sep 24, 2023
956d ago
πŸ†”95674809

LDJ Made a more accurate consolidation of topics that were very similar from this list: https://t.co/HotXODaBgO

@dmvaldman β€’

Looking for use-cases people actually have for LLMs? The folks from Vicuna did the number crunching for you! (from their recent 1M chat dataset) Cluster 9: Requests for explicit and erotic storytelling Cluster 20: Inquiries about specific plant growth conditions go go go! http

Media 1
❀️139
likes
πŸ”17
retweets
πŸ–ΌοΈ Media