Learning Bayesian Statistics | Podcast on Podbay

Learning Bayesian Statistics

Alexandre Andorra

Are you a researcher or data scientist / analyst / ninja? Do you want to learn Bayesian inference, stay up to date or simply want to understand what Bayesian inference is? Then this podcast is for you! You'll hear from researchers and practitioners of all fields about how they use Bayesian statistics, and how in turn YOU can apply these methods in your modeling workflow. When I started learning Bayesian methods, I really wished there were a podcast out there that could introduce me to the methods, the projects and the people who make all that possible. So I created "Learning Bayesian Statistics", where you'll get to hear how Bayesian statistics are used to detect black matter in outer space, forecast elections or understand how diseases spread and can ultimately be stopped. But this show is not only about successes -- it's also about failures, because that's how we learn best. So you'll often hear the guests talking about what *didn't* work in their projects, why, and how they overcame these challenges. Because, in the end, we're all lifelong learners! My name is Alex Andorra by the way. By day, I'm a Senior data scientist. By night, I don't (yet) fight crime, but I'm an open-source enthusiast and core contributor to the python packages PyMC and ArviZ. I also love Nutella, but I don't like talking about it – I prefer eating it. So, whether you want to learn Bayesian statistics or hear about the latest libraries, books and applications, this podcast is for you -- just subscribe! You can also support the show and unlock exclusive Bayesian swag on Patreon!

The Next Step Beyond LLMs: Foundation Models for Inference

Today's clip is from episode 161, featuring Luigi Acerbi. In this conversation, Luigi explains one of the biggest engineering bottlenecks facing transformer-based probabilistic models—and how his group found a way around it.The core challenge is that many inference models treat data as an unordered set, making them naturally permutation invariant. That's statistically elegant, but computationally painful: every time a new data point arrives, the model has to recompute attention over the entire dataset from scratch, preventing the kind of KV caching that makes modern language models so efficient.Luigi walks through his team's solution: a hybrid architecture that keeps the original context fully set-based while introducing a causal-attention buffer for newly arriving data. The result is dramatically faster inference- up to 100× faster in some settings - opening the door to applications like reinforcement learning, active data acquisition, and, ultimately, Luigi's long-term vision of a foundation model for Bayesian inference.Get the full discussion hereSupport & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work

Jul 22

5 min

#161 Amortized Inference & Neural Processes, with Luigi Acerbi

Support & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome workTakeaways:Q: What is Variational Bayesian Monte Carlo (VBMC) and how is it different from Bayesian optimization?A: VBMC borrows the machinery of Bayesian optimization but aims at a different target. Bayesian optimization fits a Gaussian process surrogate to an expensive function and uses it to hunt for the optimum. VBMC instead treats the log-posterior as the function to model, evaluates it at a few carefully chosen points, and keeps the whole reconstructed shape rather than just its peak. That gives you the full posterior, not a single best-fit value. Where MCMC might need tens of thousands to millions of evaluations, VBMC often reconstructs a good posterior approximation from a few hundred, which matters when each evaluation is slow.Q: When should you reach for PyVBMC, and when is it the wrong tool?A: Two symptoms tell you PyVBMC might help. First, speed: if a single evaluation of your log density takes on the order of a second, running MCMC over tens of thousands of evaluations becomes painful, and PyVBMC's few-hundred-evaluation budget pays off. Second, dimensionality: because it leans on a Gaussian process surrogate, it works well up to roughly 10 to 15 parameters and degrades beyond that. If your model already runs fine in Stan or PyMC, you do not need it. It shines for expensive, low-dimensional models common in science and engineering, where you are modeling a process rather than composing nice distributions.Full takeaways hereChapters:00:18:13 What is Variational Bayesian Monte Carlo (VBMC) and how does it differ from Bayesian optimization?00:30:21 When should you use VBMC versus BADS in practice?00:31:20 What is Bayesian Adaptive Direct Search (BADS) and how does its hybrid optimization strategy work?00:39:18 What are neural processes, and why are transformers a natural neural process architecture?00:45:54 What is the Amortized Conditioning Engine (ACE) and what problem does it unify?00:55:42 What do PriorGuide and the new autoregressive buffer paper solve for amortized inference?01:02:03 How does the new autoregressive buffer speed up predictions in transformer probabilistic models?01:06:11 What is Luigi Acerbi's vision for a foundation model for inference?01:09:26 What is ALINE and how does it add active data acquisition to amortized inference?01:12:43 How does Luigi Acerbi connect LLM agents, Bayesian decision theory, and the nature of intelligence?01:18:44 For a PyMC, Stan, or NumPyro user, where should you start with VBMC, BADS, or BayesFlow?Thank you to my Patrons for making this episode possible!Links from the show here

Jul 16

1 hr 32 min

Bayesian Statistics vs Epistemology, with Vaden Masrani

Support & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome workTakeaways:Q: What's the difference between Bayesian statistics and Bayesian epistemology?A: Bayesian statistics uses Bayes' theorem on actual data: you put a prior over parameters, combine it with a likelihood, and the data is allowed to tell you your model is wrong. Vaden loves it. Bayesian epistemology, in his tongue-in-cheek phrase, is "Bayesian statistics minus the statistics" - taking Bayes' theorem as a general account of how anyone should reason under uncertainty, including about events where there is nothing to count. The first is falsifiable and grounded; the second, he argues, lets people attach authoritative-sounding numbers to pure belief.Q: Why is it a problem to put a probability on a one-off future event like human extinction?A: Because there are no statistics behind it. Vaden's trigger example is Toby Ord's The Precipice, where a data-derived probability (supervolcanoes per millennium) is placed side by side with a probability of extinction-by-superintelligence that came from no data at all. His reaction is the statistician's first instinct: where are the numbers coming from, and what could ever make them come out differently? A subjective degree of belief is fine as a hunch. The trouble starts when it is communicated as though it were an objective, data-grounded frequency.Q: What does Vaden Masrani actually like about Bayesian statistics?A: The freedom to encode domain knowledge as a prior and have the result respect common sense - estimating an average human height, you can rule out zero and a hundred feet before seeing a single measurement. But the part he keeps stressing is falsifiability: you fit the model, compare it to data, and the data can tell you the model was bad. That contact with reality is exactly what makes the statistics legitimate and what the epistemology lacks. On Bayesian-versus-frequentist for engineering problems, he says he has no dog in the fight -- both are useful, and any working statistician uses both.Full takeaways hereChapters:00:24:01 What's the difference between Bayesian statistics and Bayesian epistemology?00:33:12 How can Bayesian epistemology lead to bad real-world decisions?00:36:36 Is Bayesian or frequentist statistics better for real-world problems?00:39:31 What is the problem of induction, and how does Bayesian epistemology try to solve it?00:43:50 What are the main logical problems with Bayesian epistemology?00:48:40 What is Popper's critical rationalism, and how does falsifiability fit in?00:52:31 How does critical rationalism work when you can't run a clean experiment?01:15:03 Why should you treat criticism as a gift, even when it hurts?01:19:54 How do Stoicism and equanimity help you handle criticism?01:23:19 Why does critical rationalism apply to everyday life, not just science?Thank you to my Patrons for making this episode possible!Links from the show here

Jun 29

1 hr 40 min

Why Bayesian Statistics Is More Computational Than Ever

Today's clip is from Episode 158 featuring Stefan Radev. In this conversation, Alex Andorra and Stefan break down a core argument from their paper: Bayesian statistics has never been more computational than it is now, and simulation is the thread that ties the whole workflow together.Stefan parcellates the Bayesian workflow into four stages, and this clip covers the first two. Stage one is model specification, where the workflow community has long recommended prior predictive checks. You can do this informally, just running simulations from your model and eyeballing whether the output meets your expectations, or formally, à la Michael Betancourt, by pushing your model's high-dimensional output through a transformation into a low-dimensional, interpretable space and checking it against reality. The punchline: a surprising number of models can be discarded before you've even seen real data, yet Stefan notes these checks remain underused in practice.Stage two is model verification, where the question shifts to whether your inferences are well calibrated. This is the territory of simulation-based calibration and parameter recovery studies, classic tools that have always carried a steep computational price. You simulate thousands of synthetic datasets and run inference on every single one, which is exactly why these checks are so often skipped in papers, even though doing one well can be a contribution in its own right.Here's where amortized simulation-based inference changes the math entirely. Checks that used to take days now take seconds, and instead of laboriously running inference dataset by dataset, you get millions of posterior samples essentially for free. The calibration checks that the field has always known it should be doing finally become cheap enough to actually do.Get the full discussion hereSupport & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work

Jun 19

4 min

Exact GPs vs Approximations: When to Use Each (and Why It Matters)

Today's clip is from episode 159 featuring Matthijs Hollanders. In this conversation, Alex and Matthijs dig into a deceptively practical question: when you're modeling wildlife across space and time with Gaussian Processes, how do you keep the math from becoming computationally unbearable - and what does good engineering actually look like in the field?Matthijs explains that for most real camera trapping datasets, exact GPs still hold up fine. The reason is less about clever math and more about ecological reality: researchers are usually resource-constrained, so datasets tend to be a few hundred sites, not thousands. And when datasets do get large, they're rarely one giant connected grid - they're clusters of independent regions. That structure is exploitable. Run a separate, smaller GP per region, share the hyperparameters, and you avoid building the massive covariance matrix that makes exact GPs expensive in the first place.But the more interesting thread is where this is heading. Alex introduces Hilbert Space Gaussian Processes (HSGPs) - an approximation that makes compute time nearly linear in dataset size, rather than cubic. The catch, as Matthijs points out, is that approximations aren't always better: if your dataset isn't large enough to be in the regime where the approximation accuracy kicks in, you're better off with the exact GP and its mathematical guarantees. The rule of thumb is simple - if you can use the vanilla GP, just do it.Get the full discussion hereSupport & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work

Jun 10

4 min

#159 Bayesian Occupancy Models, with Matthijs Hollanders

Support & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome workTakeaways:Q: What is a Bayesian occupancy model and what problem does it solve?A: An occupancy model accounts for the fact that you don't always detect a species when surveying for it, especially when the species is rare. A naive count of where you found it underestimates true occupancy. The model adds a repeated-measures component: you visit each site multiple times, and from the pattern of detections vs. non-detections it estimates a detection probability. Matthijs framed it as a zero-inflation structure where the zero-inflation happens at the site level rather than the observation level -- which keeps the model conceptually simple, just a standard GLM with a Bernoulli “is the species here at all?” stacked on top of a detection-rate process.Q: What are Automated Recording Units and why don't traditional occupancy models handle them well?A: ARUs are camera traps and acoustic monitors that record continuously over deployment periods of days, weeks, or months. The data they produce isn't a sequence of discrete human-led surveys; it's a continuous-time observation stream. Traditional occupancy models were designed for the discrete case -- a human visits a site, records yes or no, goes home. With ARUs, the question becomes how to bin or threshold the continuous data without losing the richer signal it actually contains.Q: When should you not reach for occARU?A: When your dataset is large and your survey interval is fine-grained. The bottleneck is Stan's fitting speed -- years of daily count data across many sites will fit slowly. The workaround is to bin coarser (weekly or monthly), which doesn't hurt occupancy estimation at all and only loses some detection-rate resolution. If you're only interested in occupancy, big grouping windows are fine.Full takeaways hereChapters:00:12:14 What is an occupancy model and what problem does it solve?00:16:16 What are Automated Recording Units and why do they need different models?00:18:45 What is the occARU R package and why does it exist?00:23:55 Why does occARU model counts directly rather than binary detection?00:26:38 What does multi-species hierarchical modeling with Gaussian processes look like?00:32:22 How does occARU implement Gaussian processes efficiently?00:41:01 Why are Gaussian processes such a powerful but tricky modeling tool?00:44:11 What is variance decomposition with global-local shrinkage priors?00:49:02 How does occARU leverage recent Stan features for zero-sum constraints?00:57:37 When does within-chain parallelization actually help?01:01:30 How does Monte Carlo integration reduce high Pareto-k values?01:15:27 When does occARU underperform and what's on the roadmap?Thank you to my Patrons for making this episode possible!Links from the show here.

Jun 8

1 hr 26 min

Can AI Learn What Experts Know? Automating Prior Elicitation with Generative Models

Today's clip is from episode 158 featuring Stefan Radev. In this conversation, Alex and Stefan explore a genuinely fascinating problem: how do you turn an expert's intuition into a mathematically valid prior distribution - and can AI help automate that process?Alex explains that prior elicitation is essentially a translation problem. Experts don't walk around thinking in probability distributions - their knowledge lives in intuitions, rules of thumb, and rough ranges. The challenge is converting that into something a Bayesian model can actually use.The traditional approach? Ask an expert for quantiles or a mean, then parameterize your prior with hyperparameters and simulate until the model-implied quantities match what the expert described. If your pipeline is differentiable end-to-end, you use gradient descent. If not, you fall back to something like Bayesian optimization. Either way, you're iterating toward a prior that genuinely reflects expert knowledge - not just a convenient assumption.But the really exciting part is what came next. In a follow-up paper, they pushed this further: instead of optimizing within a fixed parametric family (say, a Gaussian), they replaced the prior entirely with a normalizing flow - a flexible generative network - and ran the same procedure. No assumed distribution family. Just let the data and the expert's knowledge shape the prior from scratch.The catch? More flexibility means more non-identifiability and stability headaches. But the direction is clear: a fully automated, end-to-end pipeline for building priors from non-probabilistic expert knowledge. And in 2026, that pipeline could theoretically be driven by an agent.Get the full discussion hereSupport & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work

Jun 2

4 min

#158 Bayesian Workflows & Foundation Models, with Stefan Radev

Support & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome workTakeaways:Q: Why are prior predictive checks so underused in practice, and how do simulations help?A: They're underused because researchers don't always think to run them before seeing data -- but also because doing them rigorously (in the style Michael Betancourt advocates, with prior push-forward checks on interpretable summaries) takes effort. Simulations make it cheap to generate thousands of “what-if world” datasets from your model and check whether they look plausible, catching bad priors before you ever touch real data.Q: How can generative AI help with prior elicitation?A: Rather than forcing a domain expert to choose a distributional family and parameterize it, you can use a generative model to translate their qualitative knowledge directly into a prior. The expert describes what realistic data should look like; the generative model produces synthetic datasets matching that description; those datasets are used to fit a prior distribution. It removes the assumption that experts can think in terms of parameters and replaces it with the more natural question: does this look like your data?Q: What would a foundation model for Bayesian inference actually look like?A: Stefan's bet is that it won't be a fine-tuned general LLM. The right analogy is chess: you don't fine-tune GPT to play chess, you teach it when to call Stockfish. For Bayesian inference, you'd want a semantic layer – an LLM that understands the analysis goal – calling specialized numerical engines (MCMC samplers, amortized inference networks) that do the actual computation. Agent skills are already a step in this direction; the longer-term vision is engines that have been trained from scratch to generalize across large families of models and priors.Full takeaways here.Chapters:00:00 How does amortized inference fit into modern Bayesian workflows?06:01 What role do simulations play across the full Bayesian workflow?12:12 How do you elicit priors from a domain expert who doesn't think in distributions?19:01 What would a foundation model for Bayesian inference actually look like?35:32 What is self-consistency in amortized inference and why does it matter?39:22 How does semi-supervised learning improve simulation-based inference?43:16 Why is sensitivity analysis so important yet so underused in Bayesian practice?47:40 What is multiverse analysis and how does it change how we report Bayesian results?51:32 How does amortized inference make sensitivity and multiverse analysis affordable?01:02:47 How do amortized inference and classical MCMC complement each other?01:10:08 What are the next major directions for BayesFlow and amortized inference research?Thank you to my Patrons for making this episode possible!Links from the show here.

May 21

1 hr 18 min

The Hidden Geometry of Hierarchical Models

Today's clip is from Episode 157 featuring Stefan Radev. In this conversation, Alex and Stefan dig into one of the hardest open problems in simulation-based inference — hierarchical models.The core idea: when you move from flat to hierarchical models, you're no longer estimating one set of parameters. You have local parameters that vary by location (or subject, or city) and global parameters that capture what's shared across all of them. And you don't just want each separately — you want the full joint posterior, because that's where the Bayesian magic of shrinkage actually lives.Stefan builds the problem from the ground up. Start with the simplest hierarchical case: a two-level model. He uses electoral forecasting in France as the example — cities nested inside departments nested inside the whole country.Now your simulator has to cover all three levels. If that simulator is slow (think: brain emulators, minutes per sample), scaling to hundreds of groups becomes completely intractable. Memory issues, specialized network requirements, the works.The key insight: this problem has structure you can exploit. The joint posterior factorizes in a particularly nice way — each local parameter depends on its own local data and on the global parameters. That means instead of cramming everything into one giant high-dimensional vector and hoping a neural network figures it out, you can decompose the problem. Estimate local parameters conditioned on local data and the globals. Use composition.The takeaway: hierarchical models aren't just "harder flat models" - they have a geometry that demands a different architecture. Respecting that structure is what makes amortized inference scale.Get the full discussion hereSupport & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work

May 13

3 min

#157 Amortized Inference & BayesFlow in Practice, with Stefan Radev

Support & Resources→ Support the show on Patreon→ Bayesian Modeling Course (first 2 lessons free)Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome workTakeaways:Q: What is simulation-based inference and what does "sim-to-real" mean?A: Simulation-based inference (SBI) uses a mechanistic simulator as an epistemic tool: you train a neural network on a large number of labeled simulations and then deploy it on real, unlabeled data. The "sim-to-real" framing captures the key asymmetry -- your network never sees real data during training, only simulations, but it generalizes to real observations at inference time. This is the opposite of the more common "synthetic-for-ML" approach, where fake data is used purely to augment real training data.Q: What is the amortized inference agent skill and what does it do?A: It's an open-source AI agent skill, co-developed by Stefan and Alexandre, that teaches an AI coding agent to run a complete, state-of-the-art amortized inference workflow. Because amortized inference is recent enough that it's underrepresented in LLM training data, vanilla agents tend to get it wrong. The skill injects the right methodology: it guides the agent to set up the simulator, choose the right network architecture, run a pilot, train with appropriate diagnostics, and produce an actionable report -- without the user needing to know the details.Q: What is calibration coverage and why should you never skip it?A: Calibration coverage tells you whether your posterior uncertainty is honest -- whether your credible intervals actually contain the true parameter at the right frequency. A model can show poor parameter recovery yet still be well-calibrated (because it's falling back on the prior), or it can appear to recover parameters while being poorly calibrated. Running calibration diagnostics both in-sample and out-of-sample is especially revealing for hierarchical models, which often appear to underfit in-sample but generalize much better out-of-sample thanks to shrinkage.Full takeaways hereChapters:00:00:00 How does amortized inference fit into the Bayesian workflow?00:12:03 What does "sim-to-real" mean in simulation-based inference?00:15:57 Why is amortized inference particularly suited to psychology and neuroscience?00:21:51 What is the amortized inference agent skill?00:39:00 What is calibration coverage and how do you interpret it?00:41:50 How do you decide what to do next after your first training run?00:44:53 How do actionable insights make Bayesian workflows more usable?00:49:08 What are the unique challenges of hierarchical models in amortized inference?01:00:51 What is the current state of BayesFlow's support for hierarchical models?01:05:00 What are the main failure modes of amortized inference and how do you handle model misspecification?Thank you to my Patrons for making this episode possible!Links from the show

May 6

1 hr 18 min