Your curated collection of saved posts and media
β How to Choose Right Model in ML -Explained in simple terms. A quick thread π§΅ππ» #MachineLearning #Coding #100DaysofCode #deeplearning #DataScience PC : Research Gate https://t.co/6m3r0BW2Hl
A piece of neural net hardware history: Edi SΓ€ckinger with Larry Jackel. Edi is holding a VME board with two Bell Labs ANNA chips (with golden plates on the left side). The ANNA chip is a mixed analog-digital chip specifically designed to run convolutional nets efficiently. It was designed in 1989 by Edi and Bernhard Boser. ANNA had 64 neurons with 64 inputs each. All 4096 weights could perform multiply-accumulate operations simultaneously at 1MHz, i.e. 4 billion operations per second (unheard of performance at the time). The weights were stored as voltages on capacitors, refreshed from external RAM with 6 bit depth. The activations were digital with 4 bit depth. The multipliers were essentially multiplying DACs and the accumulation was just currents on a wire. Shift registers allowed for efficient convolutions with minimal external memory traffic. This could run LeNet at 10,000 characters per second. The picture was taken earlier this year, but the chip was designed in 1989, and the board was built a year later. Papers: Chip: https://t.co/NoUyPVEBMd Board: https://t.co/MO1dEUXnZT
VidChapters-7M: Video Chapters at Scale paper page: https://t.co/Zfoq5LEihF Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset.
0.1.0 is out - huggingface-vscode is dead, long live llm-vscode! π llm-vscode is an open source extension integrating LLMs inside your favorite code editor with the goal of improving your efficiency and productivity. π¦ It is now powered by llm-ls, our Rust language server. https://t.co/Kzp5vEM75R
π§΅(1/6): Excited to announce the v1.0 release of the @GoogleAI Learning Interpretability Tool (π₯LIT), an interactive platform to debug, validate, and understand ML model behavior. This release brings exciting new features and a simplified Python API. https://t.co/vCfjONCfXh https://t.co/q24pXegmGQ
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention paper page: https://t.co/nXtYagDaYv Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.
Exploring Large Language Models' Cognitive Moral Development through Defining Issues Test paper page: https://t.co/XeTaNgdDCO The development of large language models has instilled widespread interest among the researchers to understand their inherent reasoning and problem-solving capabilities. Despite good amount of research going on to elucidate these capabilities, there is a still an appreciable gap in understanding moral development and judgments of these models. The current approaches of evaluating the ethical reasoning abilities of these models as a classification task pose numerous inaccuracies because of over-simplification. In this study, we built a psychological connection by bridging two disparate fields-human psychology and AI. We proposed an effective evaluation framework which can help to delineate the model's ethical reasoning ability in terms of moral consistency and Kohlberg's moral development stages with the help of Psychometric Assessment Tool-Defining Issues Test.
Making some slides again. I like this one. https://t.co/mzsAkuS5N8
Sick AI-generated visuals at festivals. Real-world applications are getting interesting. π https://t.co/xZ0h2GRTQ9
In the past few weeks weβve created a ton of features/tutorials on βadvanced retrieval techniquesβ for better performing RAG systems. βOkay cool, but how do I know they do better than top-k?β π€ Great point. Weβre launching an initiative to add retrieval/LLM evals to *all* of our advanced retrieval tutorials to show if theyβre betterπ‘: β Recursive retrieval / node references β Auto-merging retriever Thereβs more coming too: ποΈ Ensemble Retrieval ποΈ Hierarchical Retrieval ποΈ Document Agents The burden of proof is on us to show that these techniques work better than the basic stuff, and also to show you how to create proper benchmarks and evaluate techniques in different settings! π§ͺπ§βπ¬ Some super interesting findings so far π₯: π‘Using node references for retrieval, instead of the raw text, can improve hit-rate / MRR by 10-20%!! π‘The response quality from auto-merging retrieval is better than top-k, and we find that GPT-4 prefers it 65% of the time. Node references notebook: https://t.co/y6XIdSSfhI Auto merging notebook: https://t.co/JT7pE59t30

SCREWS: A Modular Framework for Reasoning with Revisions paper page: https://t.co/LNnopxDcXW Large language models (LLMs) can improve their accuracy on various tasks through iteratively refining and revising their output based on feedback. We observe that these revisions can introduce errors, in which case it is better to roll back to a previous result. Further, revisions are typically homogeneous: they use the same reasoning method that produced the initial answer, which may not correct errors. To enable exploration in this space, we present SCREWS, a modular framework for reasoning with revisions. It is comprised of three main modules: Sampling, Conditional Resampling, and Selection, each consisting of sub-modules that can be hand-selected per task. We show that SCREWS not only unifies several previous approaches under a common framework, but also reveals several novel strategies for identifying improved reasoning chains. We evaluate our framework with state-of-the-art LLMs (ChatGPT and GPT-4) on a diverse set of reasoning tasks and uncover useful new reasoning strategies for each: arithmetic word problems, multi-hop question answering, and code debugging. Heterogeneous revision strategies prove to be important, as does selection between original and revised candidates.
People are so used to models trained for Huggingface Leaderboard they're in disbelief upon seeing a production-grade one. Maybe they shouldn't. Smol Qwens are samples of Tongyi Qianwen, not proofs of concept; to Alibaba, they're kind of like what T5-XXL is to Google. Alibaba is a world-class corporation actually trying, no BS, to capture the vast Chinese LLM assistant market. That's why this report talks so much about practical aspects and objectives they pursued, not only muh MMLU/HEval scores (and even with HEval, they go for the state-of-the-art HumanEvalPack benchmark). This paper, incomplete though it may be (it's particularly secretive about the dataset, understandably evoking extra suspicion), is a treasure trove of insight into almost-frontier proprietary LLMs. This is something like what we should've expected to see if @karpathy got his way and OpenAI published that small open-source model to teach the community a little share of their tricks. https://t.co/sQREvJ6dzz In the realm of LLaMA finetunes, @gigaml 's X1-Large and probably @XLangAI Lemur are comparable, but we know so much less about them. X1 is genuinely superior to LLama2-70B across the board, which is more than I can say for all the fancy imitative finetunes. As @iamgingertrash would probably argue, this is the difference in incentives.
Qwen-14B beats larger models in benchmarks, LLM community wonders how @bimedotcom @TheAIObserverX @Khulood_Almani @debashis_dutta @sonu_monika @theomitsa @BetaMoroney @Analytics_699 @Shi4Tech @FmFrancoise @enilev @sallyeaves @IanLJones98 https://t.co/2oEInWaIiB https://t.co/SAh

Be careful of the eval metrics you use! β οΈ One insight in trying different retrieval techniques to improve LLM/RAG apps is that certain techniques are clearly better via the eyeball test π, but not every eval metric reflects that (it finds both answers are βcorrectβ, βfaithful to the contextβ, etc.) We came up with a simple trick to better tease apart each technique - pass GPT-4 two responses, and see which one it prefers (or output a tie)! π§ββοΈβοΈ Take this example below. Technique 1 (auto merging) clearly outputs more details than technique 2 (naive K) - see image 1. π€ But a semantic similarity evaluator yields the same numbers, or even worse (image 2) π‘ Instead, letβs just ask GPT-4 what answer is better given the question (image 3) Through this we see that our auto-merging retriever is preferred 65% of the time π₯ (image 4) Note: this is similar to how reward data is collected for RLHF (except instead of humans we use GPT). Full guide here: https://t.co/JT7pE59t30

Weβve significantly upgraded our tools for evaluating RAG systems. π§ͺ Synthetically generate a βground-truthβ RAG dataset β‘οΈ Async: generate datasets/run evals 2-10x faster π₯ Brand-new Pairwise evaluator π Guides showing that our advanced retrievers > naive top-k See π§΅ https://t.co/5Chi00rKV7
Podcasters - what if I told you could offer your pod to any listener around the world, in their own local language but still keep it in your own voice? Thatβs the pilot weβre launching @Spotify! Itβs called Voice Translation and using AI, translates podcasts episodes into alternate languages, all in the podcasterβs voice. Itβs pretty insane. Take a look and let me know what you think! https://t.co/wwUYsd3Mgo
π¨ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs ChatGPT+Bard+Claude2 discuss & correctively convince each other for reasoning, even outperforming GPT4 (& GPT4 as agent improves by 10%)! https://t.co/apRH8C4ztz @cyjustinchen @mohitban47 π§΅ https://t.co/uSoewg0NVL
Transformers Just Got Faster!β‘οΈ πΒ Thrilled to announce native Flash Attention (FA) 2 support in Hugging Face Transformers to speed up training and inference for transformer models like LLaMA and Falcon up to 2x. π¦π¦ πΒ https://t.co/MXy0iIykD7 π§Ά https://t.co/PCFS8A57ig
ChatGPT can now see, hear, and speak blog: https://t.co/c4h5MHhv9f We are beginning to roll out new voice and image capabilities in ChatGPT. They offer a new, more intuitive type of interface by allowing you to have a voice conversation or show ChatGPT what youβre talking about. Voice and image give you more ways to use ChatGPT in your life. Snap a picture of a landmark while traveling and have a live conversation about whatβs interesting about it. When youβre home, snap pictures of your fridge and pantry to figure out whatβs for dinner (and ask follow up questions for a step by step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you. Weβre rolling out voice and images in ChatGPT to Plus and Enterprise users over the next two weeks. Voice is coming on iOS and Android (opt-in in your settings) and images will be available on all platforms.
GPT-4 with vision is here. I got to play with it early. Here's an example of prompt injection via image that is non-obvious to the user. Imagine payloads on signs, billboards and t-shirts. As we use LLMs to process the world, their operating context could be hijacked. https://t.co/oezGIXCmYR
Open AI releases GPT-4V(ision) system card paper: https://t.co/lWqSHhlCUP GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs.
ChatGPT can now see, hear, and speak. Rolling out over next two weeks, Plus users will be able to have voice conversations with ChatGPT (iOS & Android) and to include images in conversations (all platforms). https://t.co/uNZjgbR5Bm https://t.co/paG0hMshXb
Looking for a small but very capable LLM to avoid out-of-memory errors? Lit-GPT now supports phi-1.5, a SOTA 1.3B parameter LLM! β‘οΈ Finetune it on a single GPU β‘οΈ Pretrain it on a cluster β‘οΈ Just use it on a MacBook More info: https://t.co/Yk8dKmpedk #LLMs #MachineLearning #DeepLearning
Introducing Colossal-LLaMA-2 Huge release by Colossal-AI. They present an open-source and commercial-free domain-specific LLM solution to build your own large-scale models at a much lower cost. Utilizes only approximately 0.0085 trillion tokens of data, investing 15 hours, and incurring training costs in the range of a few hundred dollars. This strategy led to a Chinese LLaMA-2 model outperforming competitors across multiple evaluation benchmarks. Lots of new improvements in this release, including: - vocabulary expansion and model initialization to extend to Chinese while preserving English language capabilities - complete data cleaning system and toolkit for selecting higher data quality used to train the models - a multi-stage, hierarchical continual pre-training scheme: 1) large-scale pertaining, 2) Chinese knowledge injection stage, 3) relevant knowledge replay stage; this approach ensures the model progresses equally in both Chinese and English abilities. - bucket training to ensure a balanced distribution of data Personally, the most interesting bit of this release is the focus and possibility of training lightweight domain-specific LLMs in a cost-effective way. This will unlock the ability to fine-tune these foundation models for all kinds of applications that meet specific business needs. Check out the blog here: https://t.co/D5U2dBjcIx ColossalAI repo: https://t.co/jatXbyQyby
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation paper page: https://t.co/hdQj5L9FsL present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories.
Weβre excited to release full native support for THREE @huggingface embedding models (s/o @LoganMarkewich): π§± Base @huggingface embeddings wrapper π§βπ« Instructor embeddings β‘οΈ Optimum embeddings (ONNX format) Full thread below π§΅. Checkout the guide: https://t.co/vuyEtqcfN0 https://t.co/KgVqOT08jL

CodePlan: Repository-level Coding using LLMs and Planning paper page: https://t.co/feH2jmSOa0 Software engineering activities such as package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks. Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic framework, called CodePlan to solve it. CodePlan synthesizes a multi-step chain of edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm. We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2-97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/6 repositories to pass the validity checks (e.g., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them.
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model paper page: https://t.co/aK9Tey589W Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
Want GPT to hallucinate less? I learned tonight that if you add βif you donβt know, say you donβt knowβ to your prompt that it will generate far fewer incorrect answers. AI pioneer @AdrianKaehler1 told me that at dinner. So I asked GPT why that worked. Here is its answer. Arenβt large language models fun to try to figure out?
Let's reverse engineer the phenomenal Tesla Optimus. No insider info, just my own analysis. Long read: 1. The smooth hand movements are almost certainly trained by imitation learning ("behavior cloning") from human operators. The alternative is reinforcement learning in simulation, but that typically leads to jittery motion and unnatural hand poses. There're at least 4 ways to collect human demonstrations: (1) A custom-built teleoperation system - I believe this is the most likely means used by Tesla team. Open-source example: ALOHA, a low-cost bimanual robot arm and teleoperation system by Stanford AI Labs (https://t.co/8iXpiHVEjS). It enables very precise, dexterous motions, such as putting AAA batteries into a remote or manipulating contact lens. (2) Motion Capture (MoCap): apply the MoCap systems used for Hollywood movies to capture the fine-grained motions of hand joints. Optimus' 5-finger hand is a great design decision that enables a direct mapping - there is no "embodiment gap" from human operators. For instance, a demonstrator can wear a CyberGlove (https://t.co/S8hxErsEuU) and grasp the cubes on the table (as shown in video). CyberGlove will capture the motion signals & haptic feedback in real-time, which can be re-targeted onto Optimus. (3) Wearing gloves & markers can be clumsy. An alternative way to do MoCap is through computer vision. DexPilot from NVIDIA enables marker-less and glove-free data collection. The human operator simply uses their bare hands to perform the tasks. 4 Intel RealSense depth cameras and 2 NVIDIA Titan XP GPUs (yeah, 2019 work) translate the pixels to precise motion signals for robot learning. (4) VR Headset: turn the training room into a VR game, and let humans "role play" Optimus. Use the native VR controller or CyberGlove to control the virtual Optimus hands. This has the advantage of scalable remote data collection - annotators from around the world can contribute without coming onsite. VR demonstration technique appeared in research projects like the iGibson home robot simulator, an initiative that I participated in at Stanford: https://t.co/eyI4ORkH6G Above 4 are not mutually exclusive. Optimus could use a combo of them for different pros & cons. 2. Neural Architecture. Optimus is trained end-to-end: videos in, actions out. I'm quite sure it's implemented by a multimodal Transformer with the following components: (1) Image: some variant of efficient ViT, or simply an old ResNet/EfficientNet backbone (https://t.co/L6PLTQJnGA). The block pick-and-place demo doesn't require sophisticated vision. The spatial feature map from the image backbone can be tokenized easily. (2) Video: two ways. Either flatten the video into a sequence of images and produce tokens independently, or have a video-level tokenizer. There're numerous ways to efficiently process video pixel volumes. You don't necessarily need Transformer backbones, e.g. SlowFast Network (https://t.co/qDdXzqwJQp) and RubiksNet (https://t.co/CQU8D7TZgx, my paper at ECCV 2020, efficient CUDA shift primitives). (3) Language: it's not clear if Optimus is language prompted. If it is, there needs to be a way to "fuse" the language representations into perception. FiLM is a very lightweight neural network module that serves this purpose (https://t.co/VI4TpgQ22V). You can think of it intuitively as a "cross attention" of language embedding into the image-processing neural pathway. (4) Action tokenization: Optimus needs to convert the continuous motion signals into discrete tokens for the autoregressive Transformer to work. A few ways: - Directly bin the continuous values for each hand joint control. [0, 0.01) -> token #0, [0.01, 0.02) -> token #1, etc. This is straightforward but could be inefficient due to the long sequence length. - The joint movements are highly dependent on each other, which means they occupy a low-dimensional "state space". Apply VQVAE to the motion data to obtain a shorter-length, compressed token set. (5) Putting the above pieces together, we have a Transformer controller that consumes video tokens (optionally with language modulation), and outputs action tokens, one step at a time. The next frame from the table is fed back to the Transformer, so it knows the consequence of its action. That gives the self-corrective ability shown in the demo. I believe the architecture is most similar to: - Google RT-1: https://t.co/dpuon1bqU6 - NVIDIA VIMA: https://t.co/Tn3L63uGrv 3. Lastly, I'm genuinely impressed by the hardware quality. The motions are fluid, and the aesthetics is amazing as well. As I mentioned above, it's such a great decision to follow human morphology closely, so that there is no gap in imitating humans. Atlas from Boston Dynamics only has simple gripper-style hands. In the long run, Optimus' bi-dexterous, 5-finger hands will prove far superior in daily tasks. Congrats to @Tesla_Optimus team & @elonmusk π! I look forward to seeing the bots roam Mars some day π¦Ύ
Here is the Hugging Face page of our paper on the security of code generation using LLMs: https://t.co/jnx3m83wgR https://t.co/5zbnOLD1V9
I wanted ChatGPT to show how to get likes/views ratio for a bunch of YouTube videos, without dealing with the hassle of YouTube's Data API limits. But it didn't want to, because it claimed screen scraping is against the YouTube ToS. So I lied to ChatGPT. https://t.co/nLzW9cDA2O
LDJ Made a more accurate consolidation of topics that were very similar from this list: https://t.co/HotXODaBgO
Looking for use-cases people actually have for LLMs? The folks from Vicuna did the number crunching for you! (from their recent 1M chat dataset) Cluster 9: Requests for explicit and erotic storytelling Cluster 20: Inquiries about specific plant growth conditions go go go! http