Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Alessio + swyx
《Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al》Podcast

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

  1. 4日前

    AI Magic: Shipping 1000s of successful products with no managers and a team of 12 — Jeremy Howard of Answer.ai

    Disclaimer: We recorded this episode ~1.5 months ago, timing for the FastHTML release. It then got bottlenecked by Llama3.1, Winds of AI Winter, and SAM2 episodes, so we’re a little late. Since then FastHTML was released, swyx is building an app in it for AINews, and Anthropic has also released their prompt caching API. Remember when Dylan Patel of SemiAnalysis coined the GPU Rich vs GPU Poor war? (if not, see our pod with him). The idea was that if you’re GPU poor you shouldn’t waste your time trying to solve GPU rich problems (i.e. pre-training large models) and are better off working on fine-tuning, optimized inference, etc. Jeremy Howard (see our “End of Finetuning” episode to catchup on his background) and Eric Ries founded Answer.AI to do exactly that: “Practical AI R&D”, which is very in-line with the GPU poor needs. For example, one of their first releases was a system based on FSDP + QLoRA that let anyone train a 70B model on two NVIDIA 4090s. Since then, they have come out with a long list of super useful projects (in no particular order, and non-exhaustive): * FSDP QDoRA: this is just as memory efficient and scalable as FSDP/QLoRA, and critically is also as accurate for continued pre-training as full weight training. * Cold Compress: a KV cache compression toolkit that lets you scale sequence length without impacting speed. * colbert-small: state of the art retriever at only 33M params * JaColBERTv2.5: a new state-of-the-art retrievers on all Japanese benchmarks. * gpu.cpp: portable GPU compute for C++ with WebGPU. * Claudette: a better Anthropic API SDK. They also recently released FastHTML, a new way to create modern interactive web apps. Jeremy recently released a 1 hour “Getting started” tutorial on YouTube; while this isn’t AI related per se, but it’s close to home for any AI Engineer who are looking to iterate quickly on new products: In this episode we broke down 1) how they recruit 2) how they organize what to research 3) and how the community comes together. At the end, Jeremy gave us a sneak peek at something new that he’s working on that he calls dialogue engineering: So I've created a new approach. It's not called prompt engineering. I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it. He explains it a bit more ~44:53 in the pod, but we’ll just have to wait for the public release to figure out exactly what he means. Timestamps * [00:00:00] Intro by Suno AI * [00:03:02] Continuous Pre-Training is Here * [00:06:07] Schedule-Free Optimizers and Learning Rate Schedules * [00:07:08] Governance and Structural Issues within OpenAI and Other AI Labs * [00:13:01] How Answer.ai works * [00:23:40] How to Recruit Productive Researchers * [00:27:45] Building a new BERT * [00:31:57] FSDP, QLoRA, and QDoRA: Innovations in Fine-Tuning Large Models * [00:36:36] Research and Development on Model Inference Optimization * [00:39:49] FastHTML for Web Application Development * [00:46:53] AI Magic & Dialogue Engineering * [00:52:19] AI wishlist & predictions Show Notes * Jeremy Howard * Previously on Latent Space: The End of Finetuning, NeurIPS Startups * Answer.ai * Fast.ai * FastHTML * answerai-colbert-small-v1 * gpu.cpp * Eric Ries * Aaron DeFazio * Yi Tai * Less Wright * Benjamin Warner * Benjamin Clavié * Jono Whitaker * Austin Huang * Eric Gilliam * Tim Dettmers * Colin Raffel * Mark Saroufim * Sebastian Raschka * Carson Gross * Simon Willison * Sepp Hochreiter * Llama3.1 episode * Snowflake Arctic * Ranger Optimizer * Gemma.cpp * HTMX * UL2 * BERT * DeBERTa * Efficient finetuning of Llama 3 with FSDP QDoRA * xLSTM Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:14]: And today we'

    59 分鐘
  2. 8月7日

    Segment Anything 2: Demo-first Model Development

    Because of the nature of SAM, this is more video heavy than usual. See our YouTube! Because vision is first among equals in multimodality, and yet SOTA vision language models are closed, we’ve always had an interest in learning what’s next in vision. Our first viral episode was Segment Anything 1, and we have since covered LLaVA, IDEFICS, Adept, and Reka. But just like with Llama 3, FAIR holds a special place in our hearts as the New Kings of Open Source AI. The list of sequels better than the originals is usually very short, but SAM 2 delighted us by not only being a better image segmentation model than SAM 1, it also conclusively and inexpensively solved video segmentation in just an elegant a way as SAM 1 did for images, and releasing everything to the community as Apache 2/CC by 4.0. “In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM).” Surprisingly Efficient The paper reports that SAM 2 was trained on 256 A100 GPUs for 108 hours (59% more than SAM 1). Taking the upper end $2 A100 cost off gpulist.ai means SAM2 cost ~$50k to train if it had an external market-rate cost - surprisingly cheap for adding video understanding! The newly released SA-V dataset is also the largest video segment dataset to date, with careful attention given to scene/object/geographical diversity, including that of annotators. In some ways, we are surprised that SOTA video segmentation can be done on only ~50,000 videos (and 640k masklet annotations). Model-in-the-loop Data Engine for Annotations and Demo-first Development Similar to SAM 1, a 3 Phase Data Engine helped greatly in bootstrapping this dataset. As Nikhila says in the episode, the demo you see wasn’t just for show, they actually used this same tool to do annotations for the model that is now demoed in the tool: “With the original SAM, we put a lot of effort in building a high-quality demo. And the other piece here is that the demo is actually the annotation tool. So we actually use the demo as a way to improve our annotation tool. And so then it becomes very natural to invest in building a good demo because it speeds up your annotation. and improve the data quality, and that will improve the model quality. With this approach, we found it to be really successful.” An incredible 90% speedup in annotation happened due to this virtuous cycle which helped SA-V reach this incredible scale. Building the demo also helped the team live the context that their own downstream users, like Roboflow, would experience, and forced them to make choices accordingly. As Nikhila says: “It's a really encouraging trend for not thinking about only the new model capability, but what sort of applications folks want to build with models as a result of that downstream. I think it also really forces you to think about many things that you might postpone. For example, efficiency. For a good demo experience, making it real time is super important. No one wants to wait. And so it really forces you to think about these things much sooner and actually makes us think about what kind of image encoder we want to use or other things. hardware efficiency improvements. So those kind of things, I think, become a first-class citizen when you put the demo first.” Indeed, the team swapped out standard ViT-H Vision Transformers for Hiera (Hierarchical) Vision Transformers as a result of efficiency considerations. Memory Attention Speaking of architecture, the model design is probably the sleeper hit of a project filled with hits. The team adapted SAM 1 to video by adding streaming memory for real-time video processing: Specifically adding memory attention, memory encoder, and memory bank, which surprisingly ablated better than more intuitive but complex architectures like Gated Recurrent Units. One has to wonder if streaming memory can be added to pure lan

    1 小時 4 分鐘
  3. 8月2日

    The Winds of AI Winter (Q2 Four Wars Recap) + ChatGPT Voice Mode Preview

    Thank you for 1m downloads of the podcast and 2m readers of the Substack! 🎉 This is the audio discussion following The Winds of AI Winter essay that also serves as a recap of Q2 2024 in AI viewed through the lens of our Four Wars framework. Enjoy! Full Video Discussion Full show notes are here. Timestamps * [00:00:00] Intro Song by Suno.ai * [00:02:01] Swyx and Alessio in Singapore * [00:05:49] GPU Rich vs Poors: Frontier Labs * [00:06:35] GPU Rich Frontier Models: Claude 3.5 * [00:10:37] GPU Rich helping Poors: Llama 3.1: The Synthetic Data Model * [00:15:41] GPU Rich helping Poors: Frontier Labs Vibe Shift - Phi 3, Gemma 2 * [00:18:26] GPU Rich: Mistral Large * [00:21:56] GPU Rich: Nvidia + FlashAttention 3 * [00:23:45] GPU Rich helping Poors: Noam Shazeer & Character.AI * [00:28:14] GPU Poors: On Device LLMs: Mozilla Llamafile, Chrome (Gemini Nano), Apple Intelligence * [00:35:33] Quality Data Wars: NYT vs The Atlantic lawyer up vs partner up * [00:37:41] Quality Data Wars: Reddit, ScarJo, RIAA vs Udio & Suno * [00:41:03] Quality Data Wars: Synthetic Data, Jagged Intelligence, AlphaProof * [00:45:33] Multimodality War: ChatGPT Voice Mode, OpenAI demo at AIEWF * [00:47:34] Multimodality War: Meta Llama 3 multimodality + Chameleon * [00:50:54] Multimodality War: PaliGemma + CoPaliGemma * [00:52:55] Renaming Rag/Ops War to LLM OS War * [00:55:31] LLM OS War: Ops War: Prompt Management vs Gateway vs Observability * [01:02:57] LLM OS War: BM42 Vector DB Wars, Memory Databases, GraphRAG * [01:06:15] LLM OS War: Agent Tooling * [01:08:26] LLM OS War: Agent Protocols * [01:10:43] Trend: Commoditization of Intelligence * [01:16:45] Trend: Vertical Service as Software, AI Employees, Brightwave, Dropzone * [01:20:44] Trend: Benchmark Frontiers after MMLU * [01:23:31] Crowdstrike will save us from Skynet * [01:24:30] Bonus: ChatGPT Advanced Voice Mode Demo * [01:25:37] Voice Mode: Storytelling * [01:27:55] Voice Mode: Accents * [01:31:48] Voice Mode: Accent Detection * [01:35:00] Voice Mode: Nonverbal Emotions * [01:37:53] Voice Mode: Multiple Voices in One * [01:40:52] Voice Mode: Energy Levels Detection * [01:42:03] Voice Mode: Multilinguality * [01:43:53] Voice Mode: Shepard Tone * [01:46:57] Voice Mode: Generating Tones * [01:49:39] Voice Mode: Interruptions don't work * [01:49:55] Voice Mode: Reverberations * [01:51:37] Voice Mode: Mimicry doesn't work Transcript Charlie [00:01:08]: Welcome back, listeners. This is your AI co-host, Charlie. It's been a few months since we took a step back from the interview format and talked about the show. We're happy to share that we have crossed one million downloads and two million reads on Substack. Woo-hoo. We are really grateful to those of you who keep tuning in and sharing us with your friends, especially if who watch and comment on our new YouTube channel, where we are trying to grow next. For a special millionaire edition, SWIX and Alessio are finally back in person in sunny Singapore to discuss the big vibe shift in the last three months, that we are calling the Winds of AI Winter. We also discuss my nemesis, ChatGPT Advanced Voice Mode, with a special treat for those who stay till the end. Now, more than ever, watch out and take care. Alessio [00:02:02]: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence and Decibel Partners, and today we're in the Singapore studio with SWIX. Swyx [00:02:11]: Hey, this is our long-awaited one-on-one episode. I don't know how long ago the previous one was. Do you remember? Three, four months? Alessio [00:02:20]: Yeah, it's been a while. Swyx [00:02:22]: People really enjoyed it. It's just really, I think our travel schedules have been really difficult to get this stuff together. And then we also had like a decent backlog of guests for a while. I think we've kind of depleted that backlog now and we need to build it up again. But it's been busy and there's been a lot of news. So we actually get to

    1 小時 55 分鐘
  4. 7月23日

    Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

    If you see this in time, join our emergency LLM paper club on the Llama 3 paper! For everyone else, join our special AI in Action club on the Latent Space Discord for a special feature with the Cursor cofounders on Composer, their newest coding agent! Today, Meta is officially releasing the largest and most capable open model to date, Llama3-405B, a dense transformer trained on 15T tokens that beats GPT-4 on all major benchmarks: The 8B and 70B models from the April Llama 3 release have also received serious spec bumps, warranting the new label of Llama 3.1. If you are curious about the infra / hardware side, go check out our episode with Soumith Chintala, one of the AI infra leads at Meta. Today we have Thomas Scialom, who led Llama2 and now Llama3 post-training, so we spent most of our time on pre-training (synthetic data, data pipelines, scaling laws, etc) and post-training (RLHF vs instruction tuning, evals, tool calling). Synthetic data is all you need Llama3 was trained on 15T tokens, 7x more than Llama2 and with 4 times as much code and 30 different languages represented. But as Thomas beautifully put it: “My intuition is that the web is full of s**t in terms of text, and training on those tokens is a waste of compute.” “Llama 3 post-training doesn't have any human written answers there basically… It's just leveraging pure synthetic data from Llama 2.” While it is well speculated that the 8B and 70B were "offline distillations" of the 405B, there are a good deal more synthetic data elements to Llama 3.1 than the expected. The paper explicitly calls out: * SFT for Code: 3 approaches for synthetic data for the 405B bootstrapping itself with code execution feedback, programming language translation, and docs backtranslation. * SFT for Math: The Llama 3 paper credits the Let’s Verify Step By Step authors, who we interviewed at ICLR: * SFT for Multilinguality: "To collect higher quality human annotations in non-English languages, we train a multilingual expert by branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingualtokens." * SFT for Long Context: "It is largely impractical to get humans to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for long documents, and reasoning over code repositories, and describe them in greater detail below" * SFT for Tool Use: trained for Brave Search, Wolfram Alpha, and a Python Interpreter (a special new ipython role) for single, nested, parallel, and multiturn function calling. * RLHF: DPO preference data was used extensively on Llama 2 generations. This is something we partially covered in RLHF 201: humans are often better at judging between two options (i.e. which of two poems they prefer) than creating one (writing one from scratch). Similarly, models might not be great at creating text but they can be good at classifying their quality. Last but not least, Llama 3.1 received a license update explicitly allowing its use for synthetic data generation. Llama2 was also used as a classifier for all pre-training data that went into the model. It both labelled it by quality so that bad tokens were removed, but also used type (i.e. science, law, politics) to achieve a balanced data mix. Tokenizer size matters The tokens vocab of a model is the collection of all tokens that the model uses. Llama2 had a 34,000 tokens vocab, GPT-4 has 100,000, and 4o went up to 200,000. Llama3 went up 4x to 128,000 tokens. You can find the GPT-4 vocab list on Github. This is something that people gloss over, but there are many reason why a large vocab matters: * More tokens allow it to represent more concepts, and then be better at understanding the nuances. * The larger the token

    1 小時 5 分鐘
  5. 7月12日

    Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

    The first AI Engineer World’s Fair talks from OpenAI and Cognition are up! In our Benchmarks 101 episode back in April 2023 we covered the history of AI benchmarks, their shortcomings, and our hopes for better ones. Fast forward 1.5 years, the pace of model development has far exceeded the speed at which benchmarks are updated. Frontier labs are still using MMLU and HumanEval for model marketing, even though most models are reaching their natural plateau at a ~90% success rate (any higher and they’re probably just memorizing/overfitting). From Benchmarks to Leaderboards Outside of being stale, lab-reported benchmarks also suffer from non-reproducibility. The models served through the API also change over time, so at different points in time it might return different scores. Today’s guest, Clémentine Fourrier, is the lead maintainer of HuggingFace’s OpenLLM Leaderboard. Their goal is standardizing how models are evaluated by curating a set of high quality benchmarks, and then publishing the results in a reproducible way with tools like EleutherAI’s Harness. The leaderboard was first launched summer 2023 and quickly became the de facto standard for open source LLM performance. To give you a sense for the scale: * Over 2 million unique visitors * 300,000 active community members * Over 7,500 models evaluated Last week they announced the second version of the leaderboard. Why? Because models were getting too good! The new version of the leaderboard is based on 6 benchmarks: * 📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper) * 📚 GPQA (Google-Proof Q&A Benchmark, paper) * 💭MuSR (Multistep Soft Reasoning, paper) * 🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper) * 🤝 IFEval (Instruction Following Evaluation, paper) * 🧮 🤝 BBH (Big Bench Hard, paper) You can read the reasoning behind each of them on their announcement blog post. These updates had some clear winners and losers, with models jumping up or down up to 50 spots at once; the most likely reason for this is that the models were overfit to the benchmarks, or had some contamination in their training dataset. But the most important change is in the absolute scores. All models score much lower on v2 than they do on v1, which now creates a lot more room for models to show improved performance. On Arenas Another high-signal platform for AI Engineers is the LMSys Arena, which asks users to rank the output of two different models on the same prompt, and then give them an ELO score based on the outcomes. Clémentine called arenas “sociological experiments”: it tells you a lot about the users preference, but not always much about the model capabilities. She pointed to Anthropic’s sycophancy paper as early research in this space: We find that when a response matches a user’s views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. The other issue is that Arena rankings aren’t reproducible, as you don’t know who ranked what and what exactly the outcome was at the time of ranking. They are still quite helpful as tools, but they aren’t a rigorous way to rank capabilities of the models. Her advice for both arena and leaderboard is to use these tools as ranges; find 3-4 models that fit your needs (speed, cost, capabilities, etc) and then do vibe checks to figure out which one is best for your specific task. LLMs aren’t good judges In the last ~6 months, there has been an increased interest in using LLMs as Judges: rather than asking a person to evaluate the outcome of a model, you can ask a more powerful LLM to score it. We covered this a bit in our Brightwave episode last month as well. HuggingFace also has a cookbook on it, but Clémentine was actually not a fan of this approach: * Mode collapse: if you are asking a model to choose which ou

    58 分鐘
  6. 7月5日

    The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

    Livestreams for the AI Engineer World’s Fair (Multimodality ft. the new GPT-4o demo, GPUs and Inference (ft. Cognition/Devin), CodeGen, Open Models tracks) are now live! Subscribe to @aidotEngineer to get notifications of the other workshops and tracks! It’s easy to get de-sensitized to new models topping leaderboards every other week — however, the top of the LMsys leaderboard has typically been the exclusive domain of very large, very very well funded model labs like OpenAI, Anthropic, Google, and Meta. OpenAI had about 600 people at the time of GPT-4, and Google Gemini had 950 co-authors. This is why Reka Core made waves in May - not only debuting at #7 on the leaderboard, but doing so with all-new GPU infrastructure and 20 employees with and a relatively puny $60m in funding. Shortly after the release of GPT3, Sam Altman speculated on the qualities of “10,000x researchers”: * “They spend a lot of time reflecting on some version of the Hamming question—"what are the most important problems in your field, and why aren’t you working on them?” In general, no one reflects on this question enough, but the best people do it the most, and have the best ‘problem taste’, which is some combination of learning to think independently, reason about the future, and identify attack vectors.” — sama * Taste is something both John Schulman and Yi Tay emphasize greatly * “They have a laser focus on the next step in front of them combined with long-term vision.” — sama * “They are extremely persistent and willing to work hard… They have a bias towards action and trying things, and they’re clear-eyed and honest about what is working and what isn’t” — sama “There's a certain level of sacrifice to be an AI researcher, especially if you're training at LLMs, because you cannot really be detached… your jobs could die on a Saturday at 4am, and there are people who will just leave it dead until Monday morning, or there will be people who will crawl out of bed at 4am to restart the job, or check the TensorBoard” – Yi Tay (at 28 mins) “I think the productivity hack that I have is, I didn't have a boundary between my life and my work for a long time. So I think I just cared a lot about working most of the time. Actually, during my PhD, Google and everything [else], I'll be just working all the time. It's not like the most healthy thing, like ever, but I think that that was actually like one of the biggest, like, productivity, like and I spent, like, I like to spend a lot of time, like, writing code and I just enjoy running experiments, writing code” — Yi Tay (at 90 mins) * See @YiTayML example for honest alpha on what is/is not working and so on. More recently, Yi’s frequent co-author, Jason Wei, wrote about the existence of Yolo researchers he witnessed at OpenAI: Given the very aggressive timeline — Yi left Google in April 2023, was GPU constrained until December 2023, and then Reka Flash (21B) was released in Feb 2024, and Reka Core (??B) was released in April 2024 — Reka’s 3-5 person pretraining team had no other choice but to do Yolo runs. Per Yi: “Scaling models systematically generally requires one to go from small to large in a principled way, i.e., run experiments in multiple phrases (1B->8B->64B->300B etc) and pick the winners and continuously scale them up. In a startup, we had way less compute to perform these massive sweeps to check hparams. In the end, we had to work with many Yolo runs (that fortunately turned out well). In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut

    1 小時 45 分鐘
  7. 6月25日

    State of the Art: Training >70B LLMs on 10,000 H100 clusters

    It’s return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition): Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B. While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else: * Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks * An entirely new code-focused reasoning benchmark * A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity * A new dataset of 450,000 human judgments about ambiguity * Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training * Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any size As well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta’s OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue. We are busy running the sold-out AI Engineer World’s Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes. Video pod Timestamps * [00:00:00] Introduction and catch up with guests * [00:01:55] Databricks' text to image model release * [00:03:46] Details about the DBRX model * [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases * [00:09:18] Challenges of training foundation models and getting infrastructure to work * [00:12:03] Details of Imbue's cluster setup * [00:18:53] Process of bringing machines online and common failures * [00:22:52] Health checks and monitoring for the cluster * [00:25:06] Typical timelines and team composition for setting up a cluster * [00:27:24] Monitoring GPU utilization and performance * [00:29:39] Open source tools and libraries used * [00:32:33] Reproducibility and portability of cluster setup * [00:35:57] Infrastructure changes needed for different model architectures * [00:40:49] Imbue's focus on text-only models for coding and reasoning * [00:42:26] CARBS hyperparameter tuner and cost-aware optimization * [00:51:01] Emergence and CARBS * [00:53:18] Evaluation datasets and reproducing them with high quality * [00:58:40] Challenges of evaluating on more realistic tasks * [01:06:01] Abstract reasoning benchmarks like ARC * [01:10:13] Long context evaluation and needle-in-a-haystack tasks * [01:13:50] Function calling and tool use evaluation * [01:19:19] Imbue's future plans for coding and reasoning applications * [01:20:14] Databricks' future plans for useful applications and upcoming blog posts Transcript SWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome. JOSH [00:00:12]: Hey, glad to be here. SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B? JONATHAN [00:00:30]: Yeah, back

    1 小時 22 分鐘

關於

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

如要聆聽兒童不宜的單集,請登入。

隨時掌握此節目的最新消息

登入或註冊後即可關注節目、儲存單集和掌握最新消息。

請選擇國家或地區

非洲、中東和印度

亞太

歐洲

拉丁美洲與加勒比海

美國和加拿大