The Nonlinear Library: Alignment Forum

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

About

Reviews

Promote

AF - Comparing Alignment to other AGI interventions: Extensions and analysis by Martín Soto

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comparing Alignment to other AGI interventions: Extensions and analysis, published by Martín Soto on March 21, 2024 on The AI Alignment Forum. In the last post I presented the basic, bare-bones model, used to assess the Expected Value of different interventions, and especially those related to Cooperative AI (as distinct from value Alignment). Here I briefly discuss important enhancements, and our strategy with regards to all-things-considered estimates. I describe first an easy but meaningful addition to the details of our model (which you can also toy with in Guesstimate). Adding Evidential Cooperation in Large worlds Due to evidential considerations, our decision to forward this or that action might provide evidence about what other civilizations (or sub-groups inside a civilization similar to us) have done. So for example us forwarding a higher aC|V should give us evidence about other civilizations doing the same, and this should alter the AGI landscape. But there's a problem: we have only modelled singletons themselves (AGIs), not their predecessors (civilizations). We have, for example, the fraction FV of AGIs with our values. But what is the fraction cV of civilizations with our values? Should it be higher (due to our values being more easily evolved than trained), or lower (due to our values being an attractor in mind-space)? While a more complicated model could deal directly with these issues by explicitly modelling civilizations (and indeed this is explored in later extensions), for now we can pull a neat trick that gets us most of what we want without enlarging the ontology of the model further, nor the amount of input estimates. Assume for simplicity alignment is approximately as hard for all civilizations (both in cV and cV=1cV), so that they each have pV of aligning their AGI (just like we do). Then, pV of the civilizations in cV will increase FV, by creating an AGI with our values. And the rest 1pV will increase FV. What about cV? pV of them will increase FV. But the misalignment case is trickier, because it might be a few of their misaligned AGIs randomly have our values. Let's assume for simplicity (since FV and cV are usually small enough) that the probability with which a random misaligned (to its creators) AGI has our values is the same fraction that our values have in the universe, after all AGIs have been created: FV.[1] Then, cV(1pV)FV goes to increase FV, and cV(1pV)(1FV) goes to increase (1FV). This all defines a system of equations in which the only unknown is cV, so we can deduce its value! With this estimate, and with some guesses αV and αV for how correlated we are with civilizations with and without our values[2], and again simplistically assuming that the tractabilities of the different interventions are approximately the same for all civilizations, we can compute a good proxy for evidential effects. As an example, to our previous expression for dFC|VdaC|V we will add cVαVdpC|VdaC|V(1pV)+cVαVdpC|VdaC|V(1pV)(1FV) This is because our working on cooperativeness for misalignment provides evidence cV also do (having an effect if their AI is indeed misaligned), but it also provides evidence for cV doing so, which only affects the fraction of cooperative misaligned AIs if their AI is indeed misaligned (to their creators), and additionally it doesn't randomly land on our values. We similarly derive the expressions for all other corrections. Negative evidence In fact, there's a further complication: our taking a marginal action not only gives us evidence for other civilizations taking that action, but also for them not taking the other available actions. To see why this should be the case in our setting, notice the following. If our estimates of the intermediate variables like FV had been "against the baseline of our correlated agents not taking...

Mar 21, 2024

7 min

AF - Stagewise Development in Neural Networks by Jesse Hoogland

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stagewise Development in Neural Networks, published by Jesse Hoogland on March 20, 2024 on The AI Alignment Forum. TLDR: This post accompanies The Developmental Landscape of In-Context Learning by Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei and Daniel Murfet (2024), which shows that in-context learning emerges in discrete, interpretable developmental stages, and that these stages can be discovered in a model- and data-agnostic way by probing the local geometry of the loss landscape. Four months ago, we shared a discussion here of a paper which studied stagewise development in the toy model of superposition of Elhage et al. using ideas from Singular Learning Theory (SLT). The purpose of this document is to accompany a follow-up paper by Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei and Daniel Murfet, which has taken a closer look at stagewise development in transformers at significantly larger scale, including language models, using an evolved version of these techniques. How does in-context learning emerge? In this paper, we looked at two different settings where in-context learning is known to emerge: Small attention-only language transformers, modeled after Olsson et al. (3m parameters). Transformers trained to perform linear regression in context, modeled after Raventos et al. (50k parameters). Changing geometry reveals a hidden stagewise development. We use two different geometric probes to automatically discover different developmental stages: The local learning coefficient (LLC) of SLT, which measures the "basin broadness" (volume scaling ratio) of the loss landscape across the training trajectory. Essential dynamics (ED), which consists of applying principal component analysis to (a discrete proxy of) the model's functional output across the training trajectory and analyzing the geometry of the resulting low-dimensional trajectory. In both settings, these probes reveal that training is separated into distinct developmental stages, many of which are "hidden" from the loss (Figures 1 & 2). Developmental stages are interpretable. Through a variety of hand-crafted behavioral and structural metrics, we find that these developmental stages can be interpreted. The progression of the language model is characterized by the following sequence of stages: (LM1) Learning bigrams, (LM2) Learning various n-grams and incorporating positional information, (LM3) Beginning to form the first part of the induction circuit, (LM4) Finishing the formation of the induction circuit, (LM5) Final convergence. The evolution of the linear regression model unfolds in a similar manner: (LR1) Learns to use the task prior (equivalent to learning bigrams), (LR2) Develops the ability to do in-context linear regression, (LR3-4) Two significant structural developments in the embedding and layer norms, (LR5) Final convergence. Developmental interpretability is viable. The existence and interpretability of developmental stages in larger, more realistic transformers makes us substantially more confident in developmental interpretability as a viable research agenda. We expect that future generations of these techniques will go beyond detecting when circuits start/stop forming to detecting where they form, how they connect, and what they implement. On Stagewise Development Complex structures can arise from simple algorithms. When iterated across space and time, simple algorithms can produce structures of great complexity. One example is evolution by natural selection. Another is optimization of artificial neural networks by gradient descent. In both cases, the underlying logic - that simple algorithms operating at scale can produce highly complex structures - is so counterintuitive that it often elicits disbelief. A second counterintui...

Mar 20, 2024

17 min

AF - Comparing Alignment to other AGI interventions: Basic model by Martín Soto

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comparing Alignment to other AGI interventions: Basic model, published by Martín Soto on March 20, 2024 on The AI Alignment Forum. Interventions that increase the probability of Aligned AGI aren't the only kind of AGI-related work that could importantly increase the Expected Value of the future. Here I present a very basic quantitative model (which you can run yourself here) to start thinking about these issues. In a follow-up post I give a brief overview of extensions and analysis. A main motivation of this enterprise is to assess whether interventions in the realm of Cooperative AI, that increase collaboration or reduce costly conflict, can seem like an optimal marginal allocation of resources. More concretely, in a utility framework, we compare Alignment interventions (aV): increasing the probability that one or more agents have our values. Cooperation interventions given alignment (aC|V): increasing the gains from trade and reducing the cost from conflict for agents with our values. Cooperation interventions given misalignment (aC|V): increasing the gains from trade and reducing the cost from conflict for agents without our values. We used a model-based approach (see here for a discussion of its benefits) paired with qualitative analysis. While these two posts don't constitute an exhaustive analysis (more exhaustive versions are less polished), feel free to reach out if you're interested in this question and want to hear more about this work. Most of this post is a replication of previous work by Hjalmar Wijk and Tristan Cook (unpublished). The basic modelling idea we're building upon is to define how different variables affect our utility, and then incrementally compute or estimate partial derivatives to assess the value of marginal work on this or that kind of intervention. Setup We model a multi-agentic situation. We classify each agent as either having (approximately) our values (V) or any other values (V). We also classify them as either cooperative (C) or non-cooperative (C).[1] These classifications are binary. We are also (for now) agnostic about what these agents represent. Indeed, this basic multi-agentic model will be applicable (with differently informed estimates) to any scenario with multiple singletons, including the following: Different AGIs (or other kinds of singletons, like AI-augmented nation-states) interacting causally on Earth Singletons arising from different planets interacting causally in the lightcone Singletons from across the multi-verse interacting acausally The variable we care about is total utility (U). As a simplifying assumption, our way to compute it will be as a weighted interpolation of two binary extremes: one in which bargaining goes (for agents with our values) as well as possible (B), and another one in which it goes as badly as possible (B). The interpolation coefficient (b) could be interpreted as "percentage of interactions that result in minimally cooperative bargaining settlements". We also consider all our interventions are on only a single one of the agents (which controls a fraction FI of total resources), which usually represents our AGI or our civilization.[2] And these interventions are coarsely grouped into alignment work (aV), cooperation work targeted at worlds with high alignment power (aC|V), and cooperation work targeted at worlds with low alignment power (aC|V). The overall structure looks like this: Full list of variables This section safely skippable. The first 4 variables model expected outcomes: UBR: Utility attained in the possible world where our bargaining goes as well as possible. UBR: Utility attained in the possible world where our bargaining goes as badly as possible. b[0,1]: Baseline (expected) success of bargaining (for agents with our values), used to interpolate between UB and UB. Can be i...

Mar 20, 2024

13 min

AF - New report: Safety Cases for AI by Josh Clymer

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New report: Safety Cases for AI, published by Josh Clymer on March 20, 2024 on The AI Alignment Forum. ArXiv paper: https://arxiv.org/abs/2403.10462 The idea for this paper occurred to me when I saw Buck Shlegeris' MATS stream on "Safety Cases for AI." How would one justify the safety of advanced AI systems? This question is fundamental. It informs how RSPs should be designed and what technical research is useful to pursue. For a long time, researchers have (implicitly or explicitly) discussed ways to justify that AI systems are safe, but much of this content is scattered across different posts and papers, is not as concrete as I'd like, or does not clearly state their assumptions. I hope this report provides a helpful birds-eye view of safety arguments and moves the AI safety conversation forward by helping to identify assumptions they rest on (though there's much more work to do to clarify these arguments). Thanks to my coauthors: Nick Gabrieli, David Krueger, and Thomas Larsen -- and to everyone who gave feedback: Henry Sleight, Ashwin Acharya, Ryan Greenblatt, Stephen Casper, David Duvenaud, Rudolf Laine, Roger Grosse, Hjalmar Wijk, Eli Lifland, Oliver Habryka, Sim eon Campos, Aaron Scher, Lukas Berglund, and Nate Thomas. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Mar 20, 2024

1 min

AF - AtP*: An efficient and scalable method for localizing LLM behaviour to components by Neel Nanda

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AtP*: An efficient and scalable method for localizing LLM behaviour to components, published by Neel Nanda on March 18, 2024 on The AI Alignment Forum. Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda A new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom Lieberum Tweet thread summary, paper Abstract: Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Mar 18, 2024

1 min

AF - Improving SAE's by Sqrt()-ing L1 and Removing Lowest Activating Features by Logan Riggs Smith

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features, published by Logan Riggs Smith on March 15, 2024 on The AI Alignment Forum. TL;DR We achieve better SAE performance by: Removing the lowest activating features Replacing the L1(feature_activations) penalty function with L1(sqrt(feature_activations)) with 'better' meaning: we can reconstruct the original LLM activations w/ lower MSE & with fewer features/datapoint. As a sneak peak (the graph should make more sense as we build up to it, don't worry!): Now in more details: Sparse Autoencoders (SAEs) reconstruct each datapoint in [layer 3's residual stream activations of Pythia-70M-deduped] using a certain amount of features (this is the L0-norm of the hidden activation in the SAE). Typically the higher activations are interpretable & the lowest of activations non-interpretable. Here is a feature that activates mostly on apostrophe (removing it also makes it worse at predicting "s"). The lower activations are conceptually similar, but then we have a huge amount of tokens that are something else. From a datapoint viewpoint, there's a similar story: given a specific datapoint, the top activation features make a lot of sense, but the lowest ones don't (ie if 20 features activate that reconstruct a specific datapoint, the top ~5 features make a decent amount of sense & the lower 15 make less and less sense) Are these low-activating features actually important for downstream performance (eg CE)? Or are they modeling noise in the underlying LLM (which is why we see conceptually similar datapoints in lower activation points)? Ablating Lowest Features There are a few different ways to remove the "lowest" feature activations. Dataset View: Lowest k-features per datapoint Feature View: Features have different activation values. Some are an OOM larger than others on average, so we can set feature specific thresholds. Percentage of max activation - remove all feature activations that are < [10%] of max activation for that feature Quantile - Remove all features in the [10th] percentile activations for each feature Global Threshold - Let's treat all features the same. Set all feature activations less than [0.1] to 0. It turns out that the simple global threshold performs the best: [Note: "CE" refers to the CE when you replace [layer 3 residual stream]'s activations with the reconstruction from the SAE. Ultimately we want the original model's CE with the smallest amount of feature's per datapoint (L0 norm).] You can halve the L0 w/ a small (~0.08) increase in CE. Sadly, there is an increase in both MSE & CE. If MSE was higher & CE stayed the same, then that supports the hypothesis that the SAE is modeling noise at lower activations (ie noise that's important for MSE/reconstruction but not for CE/downstream performance). But these lower activations are important for both MSE & CE similarly. For completion sake, here's a messy graph w/ all 4 methods: [Note: this was run on a different SAE than the other images] There may be a more sophisticated methods that take into account feature-information (such as whether it's an outlier feature or feature frequency), but we'll be sticking w/ the global threshold for the rest of the post. Sweeping Across SAE's with Different L0's You can get widly different L0's by just sweeping the weight on the L1 penalty term where increasing the L0 increases reconstruction but at the cost of more, potentially polysemantic, features per datapoint. Does the above phenomona extend to SAE's w/ different L0's? Looks like it does & the models seems to follow a pareto frontier. Using L1(sqrt(feature_activation)) @Lucia Quirke trained SAE's with L1(sqrt(feature_activations)) (this punishes smaller activations more & larger activations less) and anecdotally noticed less of these smaller, unintepreta...

Mar 15, 2024

7 min

AF - More people getting into AI safety should do a PhD by AdamGleave

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More people getting into AI safety should do a PhD, published by AdamGleave on March 14, 2024 on The AI Alignment Forum. Doing a PhD is a strong option to get great at developing and evaluating research ideas. These skills are necessary to become an AI safety research lead, one of the key talent bottlenecks in AI safety, and are helpful in a variety of other roles. By contrast, my impression is that currently many individuals with the goal of being a research lead pursue options like independent research or engineering-focused positions instead of doing a PhD. This post details the reasons I believe these alternatives are usually much worse at training people to be research leads. I think many early-career researchers in AI safety are undervaluing PhDs. Anecdotally, I think it's noteworthy that people in the AI safety community were often surprised to find out I was doing a PhD, and positively shocked when I told them I was having a great experience. In addition, I expect many of the negatives attributed to PhDs are really negatives on any pathway involving open-ended, exploratory research that is key to growing to become a research lead. I am not arguing that most people contributing to AI safety should do PhDs. In fact, a PhD is not the best preparation for the majority of roles. If you want to become a really strong empirical research contributor, then start working as a research engineer on a great team: you will learn how to execute and implement faster than in a PhD. There are also a variety of key roles in communications, project management, field building and operations where a PhD is of limited use. But we believe a PhD is excellent preparation for becoming a research lead with your own distinctive research direction that you can clearly communicate and ultimately supervise junior researchers to work on. However, career paths are highly individual and involve myriad trade-offs. Doing a PhD may or may not be the right path for any individual person: I simply think it has a better track record than most alternatives, and so should be the default for most people. In the post I'll also consider counter-arguments to a PhD, as well as reasons why particular people might be better fits for alternative options. I also discuss how to make the most of a PhD if you do decide to pursue this route. Author Contributions: This post primarily reflects the opinion of Adam Gleave so is written using an "I" personal pronoun. Alejandro Ortega and Sean McGowan made substantial contributions writing the initial draft of the post based on informal conversations with Adam. This resulting draft was then lightly edited by Adam, including feedback & suggestions from Euan McLean and Siao Si Looi. Why be a research lead? AI safety progress can be substantially accelerated by people who can develop and evaluate new ideas, and mentor new people to develop this skill. Other skills are also in high demand, such as entrepreneurial ability, people management and ML engineering. But being one of the few researchers who can develop a compelling new agenda is one of the best roles to fill. This ability also pairs well with other skills: for example, someone with a distinct agenda who is also entrepreneurial would be well placed to start a new organisation. Inspired by Rohin Shah's terminology, I will call this kind of person a research lead: someone who generates (and filters) research ideas and determines how to respond to results. Research leads are expected to propose and lead research projects. They need strong knowledge of AI alignment and ML. They also need to be at least competent at executing on research projects: for empirically focused projects, this means adequate programming and ML engineering ability, whereas a theory lead would need stronger mathematical ability. However, what real...

Mar 14, 2024

18 min

AF - Laying the Foundations for Vision and Multimodal Mechanistic Interpretability and Open Problems by Sonia Joseph

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems, published by Sonia Joseph on March 13, 2024 on The AI Alignment Forum. Join our Discord here. This article was written by Sonia Joseph, in collaboration with Neel Nanda, and incubated in Blake Richards's lab at Mila and in the MATS community. Thank you to the Prisma core contributors, including Praneet Suresh, Rob Graham, and Yash Vadi. Full acknowledgements of contributors are at the end. I am grateful to my collaborators for their guidance and feedback. Outline Part One: Introduction and Motivation Part Two: Tutorial Notebooks Part Three: Brief ViT Overview Part Four: Demo of Prisma's Functionality Key features, including logit attribution, attention head visualization, and activation patching. Preliminary research results obtained using Prisma, including emergent segmentation maps and canonical attention heads. Part Five: FAQ, including Key Differences between Vision and Language Mechanistic Interpretability Part Six: Getting Started with Vision Mechanistic Interpretability Part Seven: How to Get Involved Part Eight: Open Problems in Vision Mechanistic Interpretability Introducing the Prisma Library for Multimodal Mechanistic Interpretability I am excited to share with the mechanistic interpretability and alignment communities a project I've been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP. With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind. Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non-language modalities is crucial for safety and alignment. The goal of Prisma is to make research in mechanistic interpretability for multimodal models both easy and fun. We are also building a strong and collaborative open source research community around Prisma. You can join our Discord here. This post includes a brief overview of the library, fleshes out some concrete problems, and gives steps for people to get started. Prisma Goals Build shared infrastructure (Prisma) to make it easy to run standard language mechanistic interpretability techniques on non-language modalities, starting with vision. Build shared conceptual foundation for multimodal mechanistic interpretability. Shape and execute on research agenda for multimodal mechanistic interpretability. Build an amazing multimodal mechanistic interpretability subcommunity, inspired by current efforts in language. Set the cultural norms of this subcommunity to be highly collaborative, curious, inventive, friendly, respectful, prolific, and safety/alignment-conscious. Encourage sharing of early/scrappy research results on Discord/Less Wrong. Co-create a web of high-quality research. Tutorial Notebooks To get started, you can check out three tutorial notebooks that show how Prisma works. Main ViT Demo Overview of main mechanistic interpretability technique on a ViT, including direct logit attribution, attention head visualization, and activation patching. The activation patching switches the net's prediction from tabby cat to Border collie with a minimum ablation. Emoji Logit Lens Deeper dive into layer- and patch-level predictions with interactive plots. Interactive Attention Head Tour Deeper dive into the various types of attention heads a ViT contains with interactive JavaScript. Brief ViT Overview A vision transf...

Mar 13, 2024

26 min

AF - Virtual AI Safety Unconference 2024 by Orpheus Lummis

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Virtual AI Safety Unconference 2024, published by Orpheus Lummis on March 13, 2024 on The AI Alignment Forum. When: May 23rd to May 26th 2024 Where: Online, participate from anywhere. VAISU is a collaborative and inclusive event for AI safety researchers, aiming to facilitate collaboration, understanding, and progress towards problems of AI risk. It will feature talks, research discussions, and activities around the question: "How do we ensure the safety of AI systems, in the short and long term?". This includes topics such as alignment, corrigibility, interpretability, cooperativeness, understanding humans and human value structures, AI governance, strategy, … Engage with the community: Apply to participate, give a talk, or propose a session. Come to share your insights, discuss, and collaborate on subjects that matter to you and the field. Visit vaisu.ai to apply and to read further. VAISU team Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Mar 13, 2024

1 min

AF - Transformer Debugger by Henk Tillman

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transformer Debugger, published by Henk Tillman on March 12, 2024 on The AI Alignment Forum. Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into circuits underlying specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders. TDB enables rapid exploration before needing to write code, with the ability to intervene in the forward pass and see how it affects a particular behavior. It can be used to answer questions like, "Why does the model output token A instead of token B for this prompt?" or "Why does attention head H to attend to token T for this prompt?" It does so by identifying specific components (neurons, attention heads, autoencoder latents) that contribute to the behavior, showing automatically generated explanations of what causes those components to activate most strongly, and tracing connections between components to help discover circuits. These videos give an overview of TDB and show how it can be used to investigate indirect object identification in GPT-2 small: Introduction Neuron viewer pages Example: Investigating name mover heads, part 1 Example: Investigating name mover heads, part 2 Contributors: Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders. Thanks to Johnny Lin for contributing to the explanation simulator design. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Mar 12, 2024

1 min