Causal Bandits Podcast

Causal AI, Effect Heterogeneity & Understanding ML || Alicia Curth || Causal Bandits Ep. 006 (2023)

December 27, 2023 Alex Molak Season 1 Episode 6
Causal Bandits Podcast
Causal AI, Effect Heterogeneity & Understanding ML || Alicia Curth || Causal Bandits Ep. 006 (2023)
Show Notes Transcript Chapter Markers

Support the show

Video version available on YouTube
Recorded on Nov 29, 2023 in Cambridge, UK


Should we continue to ask why?

Alicia's machine learning journey began with... causal machine learning.

Starting with econometrics, she discovered semi-parametric methods and the Pearlian framework at later stages of her career and incorporated both in her everyday toolkit.

She loves to understand why things work, which inspires her to ask "why" not only in the context of treatment effects, but also in the context of general machine learning. Her papers on heterogeneous treatment effect estimators and model evaluation bring unique perspectives to the community.

Her recent NeurIPS paper on double descent aims at bridging the gap between statistical learning theory and a counter-intuitive phenomenon of double descent observed in complex machine learning architectures.

Ready to dive in? ------------------------------------------------------------------------------------------------------ About The Guest
Alicia Curth is a Machine Learning Researcher and a final year PhD student at The van der Schaar Lab at Cambridge University. Her research is focused on causality, understanding machine learning methods from ground up and personalized medicine. Her works are frequently accepted at best machine learning conferences (she's a true serial NeurIPS author).

Connect with Alicia:
- Alicia on Twitter/X
- Alicia on LinkedIn
- Alicia 's web page

About The Host
Aleksander (Alex) Molak is an independent machine learning researcher, educator, entrepreneur and a best-selling author in the area of causality

Climate Confident
With a new episode every Wed morning, the Climate Confident podcast is weekly podcast...

Listen on: Apple Podcasts   Spotify

Support the show

Causal Bandits Podcast
Causal AI || Causal Machine Learning || Causal Inference & Discovery
Web: https://causalbanditspodcast.com

Connect on LinkedIn: https://www.linkedin.com/in/aleksandermolak/
Join Causal Python Weekly: https://causalpython.io
The Causal Book: https://amzn.to/3QhsRz4

We don't need more and more and more methods if they all have the same failure mode. If we keep on focusing on one aspect of the problem and ignoring another one, because no one ever takes a step back and looks at what are we actually doing? That's actually a question that I like to ask myself about every new project or every new setting that we study.

I like to ask, Hey, causal bandits, welcome to the causal bandits podcast, the best podcast on causality and machine learning on the internet. Today, we're traveling to Cambridge to meet our guest. As a child, she loved math, but wanted to do something more applied. The first machine learning paper she read in her life was the one about causal trees.

She loves rowing and swimming and has the courage to ask why machine learning works. Researcher at the University of Cambridge. Ladies and gentlemen, please welcome Miss Alicia Curth! Let me pass it to your host, Alex 

Molak. Welcome to the podcast, Alicia. 

Thank you so much for having me, Alex. Where are we today?

Today we are in Cambridge, UK. In my office in the CMS, the Center for Mathematical Sciences, in the Department of Applied Math and Theoretical Physics, which is where our group is based. 

Before we start, I wanted to share with you that I'm really grateful for your work. Your papers and also your coding for CATENets, was super helpful for me, when I was working on my book.

And I believe the approach that you present in your papers, is something that we need more of in the community, looking at what has already been done and testing it from more perspectives, gathering this knowledge, in my opinion, immensely helps to move the field. Forward. What was the source for you of this approach, taking this perspective of trying to combine what we already know and trying to stress test it against, different cases.

Oh, well, 

first of all, thank you so much. That's a huge compliment. I think part of it was just trying to understand what's already there. Cause I think if you want to, if we want to build more new methods that work better, we kind of need to understand what the ones that we have already work well for, and what do they fail at.

And also to understand what is a viable strategy. We kind of need to stress test different ones, to see what are promising avenues and why. So I think most of my research has just very much focused on the word why. I mean, machine learning is very good or the literature on treatment effect estimation in machine learning specifically, had been very good at building solutions.

And I think machine learning. As the literature often focuses on the showing that something works, but never so much on why or when, and that's the questions that I'm always most curious about, I think, just by background, like from the more statistics, econometrics, heavy backgrounds that I have in these kind of literatures.

And people often don't focus so much on a single method, but more on understand or like on defending a single method, but more on understanding why specific methods work and when. And that's something that I've been trying to do in my research throughout. 

Your most recent paper, the one that will be presented in the New York Business Share .

is not focusing on causality, but it's also based, on the "why" question. Can you share with our audience a little bit more about this paper? 

Oh yeah. Totally. So indeed, my last paper that we've published is the first paper of my PhD that really doesn't have anything to do with causality at all.

 It's called a U turn on double descent, and we're looking at. the double descent phenomenon, which kind of gained popularity in recent work because you see this very counterintuitive finding that what we know from statistics that usually as you increase model complexity, you have kind of a U shaped trade off between model complexity and model performance, but in this like new, double descent regime, you observe a second descent as, the train, as the, number of, parameters exceeds, the number of training examples.

And we found that somehow counterintuitive together, with the co author. We kind of asked ourselves, all the statistical intuition that we've been taught, in our undergraduate or graduate level courses, can't just stop holding right here at this interpolation threshold. So we wanted to ask, Why does this happen?

Because I think that there's been a lot of work on showing that it happens. But we've been trying to kind of find the root cause of this phenomenon to understand it better. And that's what we did in this paper. We kind of found, we found out that there's actually a kind of a mechanism shift happening.

 In the parameter increasing mechanism at this, interpolation threshold, and that it happens naturally in these settings, but the threshold itself isn't the cause of the phenomenon. Actually, there's a different root cause, and you can find that out experimentally, and you can move the peak, and you can kind of get back to statistical intuition, 

and yeah, I absolutely loved writing that paper, it was super fun, and it was just really cool little statistics rabbit hole to kind of go down. To 

give a little bit of a context to those in our audience who are less familiar with what's going on in contemporary machine learning. This double descent is this phenomenon that model loss, let's say model loss is going down and it goes up, but then it drops again.

And you gave a very, very beautiful geometrical in a sense, or geometric explanation for, for this phenomenon. Can you give us an intuition how you 

looked at this? Yeah. It's going to be kind of hard holding a microphone because I usually like to do it with my hand. Okay. So let's, yeah. Yes.

Yes. Cool. So usually this is, this is kind of the shape of the curve that we've been observing. So basically test our first decreases and increases and then decreases again. So this is, this is kind of the double descent curve as it's been presented in the literature. But what we're kind of able, we're able to show is that for the examples that we studied in our paper, it's not so much this one 2D curve, but actually the underlying thing is that there's two separate complexity axes.

And it's a 3D plot with two, orthogonal complexity axis and double descent only arrives because you traverse them sequentially. So actually you're basically, you're taking a 3D plot and you're presenting it as a 2D plot. Yeah. And that's how this, appears.

And that's actually You're really looking at a, 3D plot with two convexity axes that each have this convexity, this normal, convex shape that we're expecting. And when you're like unfolding them, that's when the peak appears. And like this intuition is kind of powerful because it also shows you that there's different points at which you can, like, unfold this curve, and depending on where you unfold, that, determines the shape that you're getting.

So yeah, that was a really cool intuition to kind of come up with, because once you're seeing that, like, once you've seen that It, it almost becomes a trivial phenomenon. So yeah, that was really fun. I had a great time writing that paper. And yeah, it was something very, very different from what I'm used to working on.

Where normally it's mainly treatment effect estimation, whereas this, which, which is kind of a niche within, machine learning. Whereas this is a, as a phenomenon kind of applies to all of supervised learning. So that was a really cool experience working on something slightly more broader.

And in slightly like a wider context than what I usually focus on. 

 I think this is a beautiful example and, correct me if I, if I'm wrong, but I understand this as, a phenomenon that is possible because we are trying to project down three dimensional information to a two dimensional space.

Mm-Hmm. , would that interpretation be correct? 

Yeah, I think so. I think like the reason that it happens quite naturally to present it this way, is that there's something very interesting on some, I say there's two complexity axes, and the two axes, as far as we see it in our kind of play different roles, and basically there's always one axis.

where you have a limit to how many parameters you can add. So the double descent phenomenon kind of appears because on one axis you're only able to add n, so the number of examples that you have, parameters. Whereas on the other axis you can kind of like indefinitely increase parameters. Like the fact that you actually need to switch axes at some point because you can no longer actually add parameters, in one way.

So the, my favorite example is the first example we give in the paper, which is an example about trees, because a tree can only ever have as many terminal leaf nodes as there are examples in your data set. So once you've reached full depth trees, if you want to add any more parameters, you're gonna have to do something different.

And so like in the original Double Descent examples, what they did is to add more trees. So basically increase the number of parameters. Not through adding more leaves to a single tree, but by adding more trees. But quite obviously that's a very different parameter adding mechanism than adding single leaves.

Kind of once you realize that that's what happened, what's happening in trees, we then went back to what the, what was happening in the linear regression example, where this isn't obvious at all, where you need to understand kind of some connections between min norm solutions and singular value decompositions and PCA regression.

There, once you realize it, you realize, There's actually also some kind of two dimensional space and you're only transitioning once you've reached N, but you could do this at other points as well. I 

think it's a great example showing us that we don't always are very good at understanding what is going on in those models.

In your work on conditional average treatment effect estimators or heterogeneous treatment effect estimators, you also looked into different mechanisms a little bit deeper. In particular things related to the function of how complex the function of one outcome versus the outcome is and how complex the difference between them is.

Could you share what would be the main insights from 

this work? Yeah, I think the main insight from most of like my early work on heterogeneous treatment effect estimation was. That what type of treatment effect strategy works best really depends on what you, what your assumption is on what the underlying problem structure is.

So like broadly speaking, you can estimate the treatment effect is just the expected value between the difference between someone's outcome given treatment versus not given treatment. That's kind of our treatment effect. And you can estimate this in two ways. You can estimate this either by first estimating the expected value under one treatment and then estimating the expected value under the other treatment and then take the difference between these two estimates.

Or you can do something slightly more involved, which where you're basically estimating the treatment effect directly without as having as byproducts these two expected values separately. Which of these two strategies work best in all the experiments? that I've conducted and I've done quite a lot of simulation studies on this, usually mainly depends on whether the learning problem of just learning the treatment effect itself is technically speaking of simpler structure and then learning the two functions separately.

And whenever you have a simpler treatment effect then the potential outcome separately is that like direct targeting a much better strategy because you're, uh, you're being much more efficient kind of. 

And direct targeting is again for those of us who are less familiar with the term. It's this idea of looking at CATE.

So the average effect directly rather than through differentiating between the potential outcomes. 

Yeah, exactly. Exactly. Well, you came 

to causality, through a path that is maybe not very usual, for most people in the field. So your first, machine learning paper that you read in your life was a causal machine learning paper by Susan Athey.

How did this shape your perception of the field? Later on in your career. 

Oh, that's a very interesting question. Yeah, indeed. So I came at I came to machine learning via treatment effect estimation from like the policy economics background that I had from my undergrad. So I, I indeed, I learned, I kind of learned about machine learning methods as tools to infer treatment effects.

So it's basically the other way around than what you normally do where you first learn about machine causality? For me, it was more. Here's a causal, here's a problem that I have heterogeneous treatment effect estimation. How do I solve that with methods beyond linear regression, which is what I think what we usually use in econometrics.

And I think maybe how this background from causality has shaped my approach to machine learning is I think specifically in econometrics, you think a lot about. What are the assumptions that I need to make things work? And under what scenarios do things work? Very well, because often the questions that you're looking at in policy economics or, but also in biostatistics are often, they often appear in kind of safety critical environments where you, if you're making policy or treatment decisions based on something you've estimated, you better be sure that you've estimated it correctly.

So I think that's kind of the, that's influenced very much how I go about evaluating machine learning methods. I like to look at, I like to be sure why things work and when things work to be confident that they will be working in practice in under different scenarios. Is that something 

that also inspired your recent work on non causal machine learning?

The paper that we discussed 

earlier? Definitely. I think because I have a bit of an applied statistic background, I think there was a lot of emphasis always put on building intuition as to when and why you expect things to work. And I think I don't like this modern machine learning regime where we're currently in and we say, well, statistical intuition doesn't seem to hold.

We need something new, but I like statistical intuition. I like what we've built throughout the last years and years, and I want to get some of that back, because if you have this intuition, you can maybe be a bit more sure about when things will work, when things won't work. I don't like seeing machine learning as like this magic bullet that defies the laws of statistics.

That we know, because I think that cannot be, that to me cannot possibly be true, right? There is no such thing as a free lunch. Nothing will always work. There's so much value in understanding, especially failure modes of machine learning methods, be it for causal inference or for prediction or something more generally.

I think that's so important and I think we were like falling behind a little bit on that part. You work 

on worked on causality in health and the context of health. And as you mentioned before, this is one of those safety critical applications where the cost of mistake or the cost of mistakes can be very, very high from financial and ethical , first and foremost point of view.

Do you think that building reliable automated decision systems that for building those system causlaity is necessary. 

Yeah. I think in some form or another, some kind of knowledge on how a system behaves under intervention is necessary. Is it sufficient? I guess depends on what you mean by causality.

If you have a perfect model of like the underlying. SCM of the world. I think it would be, it would be because as soon as you've like captured everything possible in your model about the world, you should be able to make decisions on any type of intervention that you'd want, which is I think what you kind of need, but I've never thought about this before.

It's a good question. 

What would you define as as a main challenge in your work with, with CATE models 

challenge in what context or in what sense technical or applied or. Yeah. Or all of them. Yeah. 

I want to leave it open. Yeah. Just what seemed like a biggest challenge for you in this, in this work?

That's actually a question that I like to ask myself about every new project or every new setting that we study. I like to ask what makes this problem setting unique? What are the unique challenges that appear here that maybe don't apply in like a purely predictive setting? So in the CATE estimation context, just from like a.

Like a technical standpoint, I think there's two very interesting things that are happening there. On the one hand, if you have treatment of assignment biases, you have this kind of covariate shift between the two groups. So if the group that receives the treatment looks very different from the group that doesn't look at, doesn't receive treatments, if you're kind of fitting models to that, you might suffer from some effect of covariate shift.

So that's been a huge theme in the literature, looking at this effect of the covariate shifts between groups. And then the second thing that I've looked at mainly in my work is, This problem that you're, that you not only have coherent, but you actually have a label that's missing, like the label that you're really interested in, is the difference between potential outcomes for an individual.

So basically the difference if I were treated versus if I weren't treated. But in reality, I can only ever observe one of the two. So the actual label that I'm interested in is unobserved, and that makes learning quite challenging, but also quite interesting. And so that like as a technical challenge, I think that these are the two unique aspects of the like treatment effect estimation problem relative to a standard prediction problem.

But I think what then comes in addition, if you actually want to bring any of these models to practice, I think. The, like the evaluation problem is the one that I think brings the biggest challenge for actually deploying these models because you'd never have access to these two potential outcomes in practice.

And because the assumptions that you need are untestable it's much harder to validate whether a, like a heterogeneous treatment effect estimation system works when you want to deploy it in practice. And I think that's like the big hurdle that I see for taking these things, these kinds of systems and like deploying them in practice in like safety critical applications, and that it's so hard to be sure that it works very well, given that the label is unobserved and you have covariate shift in your data.

In causal discovery literature, we have this paper by Alexander Rezach which is titled Beware of the Simulated DAG. In this paper, he shows that the way we construct synthetic data sets may impact how the models perform. And the clues that come from the simulation process might be exploited by those models.

And the problem or the challenge here is that in reality, those clues might be missing. You find you in one of your papers or in one of your research projects, you found. Something similar in spirit, I would say, regarding, heterogeneous treatment effect estimation. Can you share a little bit more about the insights from this work?

Yeah. So this is, a NeurIPS 21 paper, I think called "Really Doing Great at Estimnating CATE". Where we just have a critical look at the benchmarking practices in the field. And I think that the findings actually very, are very similar to, to this project that you just mentioned which is more in the context of causal discovery because treatment effect estimation, because you have this missing data problem, you kind of need to simulate your your data to have some ground truths to evaluate against if you're doing benchmarking.

So people have been kind of relying on the same simulated data sets. And what we, what we showed in that paper is that basically the problem characteristics encoded in these simulated data sets. Very much favor a specific type of estimator over another, which I'd say would be fine if these problem characteristics were at all really like rooted in real data generating processes that we expect in the real world.

But I think this is not the case. So the some of the some of these benchmark data sets have what I consider kind of random, not necessarily realistic outcome generating processes there that very heavily favor one type of estimator over another. That I don't necessarily think reflects the types of data generating processes that we'd actually see in reality.

So I think there's like a big lack in this context of having some authoritative statements on What are the what are the likely data running, data generating processes to be observing in reality to have some kind of better benchmarking test beds, given that it is so hard to validate these models on real data.

So you 

mentioned untestable assumptions that we need to deal with, in causal models in general. And, in many real world systems, it would be very difficult to, especially the open ones and complex ones, to validate that the graph that we are passing to the model has a good correspondence to reality.

At the other end of the spectrum, we have this idea of, testable implications coming from Judea Pearl. If we have a structural causal model, we can generate the data from this model. And see if this data resembles what we observe in reality. That's in particular might be useful when we are able to even minimally intervene on whatever the real world system is, because then we can cut down the space of possible solutions what are your thoughts about this direction in particular in the context of model evaluation? 

Yeah, I think this is a really great point. So there's at least parts of the treatment effect estimators that we can indeed evaluate. So for example, in this like CATE estimation setting, you can indeed use, like if you have a model that also outputs potential outcome predictions, you can at least check these potential outcome predictions against the factuals and have this as like a first step of model evaluation to at least say, okay, well.

The predictions that it generates at least resemble the real observed data to some extent. I think the only slight problem that I see with this is at least in my own work at like slightly more recent paper, we had an ICML 23 paper on model selection, different model selection criteria for treatment effect estimation.

And what we found is that the models that perform best at predicting outcomes aren't necessarily the ones that perform best at predicting treatment effects. esPecially if you're in like low data regimes, sometimes there's a bit of a trade off between. doing very well at fitting a possibly very complex regression surface for the potential outcomes relative to fitting a treatment effect itself.

Well, because the treatment effect is the difference between two predictions. If you're making the same error on both predictions that will cancel out. So there's a question of do errors cancel out or add up once you take the difference between these two predictions. That just doesn't really have testable implications, but I still think it is a very good idea to if you want to use any of these models in practice to indeed take as a step one the testable implications and look at that you're at least in the right ballpark of getting the predicted outcomes right.

So what you're saying is that, even though we can learn the functions of potential outcomes pretty well, but still with some error. This error can actually, be magnified when we compute CATE based on those. 

Yeah. Yeah. Or cancel out. Or cancel out. Or cancel out. And that's the thing that we just don't really know.

Having said that, of course, once we're in like very large data regimes, like once we get the potential outcomes perfect, perfectly right then the treatment effect is obviously also perfect. So like in, in the limit, doing well at treatment effect estimation and doing well at potential outcome predictions is probably like the same.

It's just like more like in lower data regimes. Where you have more prices, I think it becomes harder to judge from how well like potential outcome predictions do to how well a treatment effect estimator would do. You 

started your journey with causality from the grounds of potential outcomes framework.

But, I know that you're also, familiar, at least to an extent with the Pearlian framework. What would be the best tools from those two frameworks that you found the most helpful in your journey so far? 

Yeah, I think so. I think causal graphs, dags are just incredibly useful as a tool to represent your assumptions.

 Especially also to communicate with stakeholders that maybe are not machine learners or statisticians. I think that they are, like, very appealing to go talk to a doctor, for example, to talk about if this is my outcome, if this is my treatment, what do you think potential confounders that there could be and do we have them measured?

I think it's a great way of communicating and it's a great way of depicting problems that have slightly more complex structures. So any types of biases that arise by due to causal structures that are not your simple treatment outcome confounder kind of three variable setup. And also as soon as you get into like, as your problem becomes, gets more dimension.

So like if you add time, for example. Then a causal graph is very, very useful to understand I think the different patterns that you could have. So we, for example, we looked at treatment effect estimation with survival outcomes in the presence of competing risks. So if you're interested in, let's say, a cancer survival outcome but a patient could also die of a competing cause, for example, a cardiovascular cause.

In these kind of settings without a causal graph, I think you're kind of lost in understanding what all the different paths that there could be of effect. So I think that's where the, where that comes in extremely handy. But I still very much like, using potential outcomes. to just reason more about like the estimators itself.

What would be the helpful features, the most helpful features of potential outcomes that help you think more clearly about estimation? Is it about the fact that potential outcomes are like primitives in this framework and just like maybe? 

Yeah, I just like to think about like the treatment effect as the difference between two potential outcomes.

I think that's just a very, very useful way of depicting things. I personally don't really think that the potential outcomes framework. And the pearly and dark world stand in any contrast to each other at all. I think there's also, there's paper showing equivalence, right? So I think I just like to write in terms of potential outcomes but I like to, I think it's very useful to, to depict problem structures.

In terms of causal graphs, 

what was your original motivation to go into, into economics and econometrics? 

I'm not sure how much thought was put into that at the time. Obviously, I made this decision when I was 18 years old. I just wanted to study something related to math. I knew I liked math a lot. And.

I wanted to do something that also has real world applications. So I just looked into like applied, more like applied mathematically oriented programs. And I came across econometrics. I don't think I knew what I was getting myself into when I chose it at all. Actually, I think I was expecting like 50 percent economics, like 50 percent math, what it actually was.

was about 80 percent statistics and yeah, a little bit of economics, a little bit of other mathematical sciences, but it was really mainly statistics. I don't think I would have chosen it at 18 had I known how much statistic it was because I thought I didn't like statistics just because of how it was taught in high school.

I don't necessarily think my, like in my experience, at least statistics was something that was like particularly. attractive in school. But I loved it. I'm so happy that I, that I went down that path. Turns out I love statistics. I think it's a very cool way of understanding the world. So yeah, I just kind of, I ended up in econometrics a little bit by accident.

Very, very happy accident. And yeah, I just liked it a lot. I think there's just something to me, there's something magical about statistics and where you learn why things work. When you actually see, when you actually see that they do work in practice, I always think it's cool.

Like things like law of large numbers, you under, like you prove why this works. But when you like in, in reality see something actually converting to the mean as expected, I still think there's a little bit of magic to it. I like it a lot. 

Do you think mathematics is a. accurate description of reality, or is it just a useful language that helps us structure our experience?

Great question. I guess I also don't have such a strong mathematical background per se, but I think statistics specifically is a great way of just like giving, of giving us a language of describing processes in the world. Like statistics, probability theory, I think it's a very nice way. Of describing how certain processes just work in the world.

And I like that a lot. When you 

studied economics or econometrics, does the experiences from, this time of your life were something that was also useful for you when you moved into working more in the context of health? 

I think so. Yeah. I think, I mean, ultimately. Like I, I consider myself more of an applied statistician where the applies can apply to any application.

And I think the, I mean, the problems that you encounter are the same. You just give them different names, like the variables, like whether I'm inferring the effect of a policy as you do in like policy economics, or whether I'm inferring the effect of like in medical treatment, the like statistical properties of the.

problem that you're trying to describe are the same. So I think there's actually, there's many, many, many parallels between the two fields. There's also differences in assumptions you'd usually place on problem structure, which is I think where it becomes very interesting or necessary to have domain experts on board.

to help you understand, like, what are the most likely ways in which certain assumptions can be met. But I think just, like, overall, I think everything I learned in econometrics is still relevant in what I'm doing right now. Because it's, it was indeed just a way of describing, understanding the world. And whether I'm trying to describe an economic problem or medical problem, the, the like statistical skills that I need are still exactly the same.

What do you think is the future of machine learning and causal machine learning or causal analysis in more general terms? It 

seems like the causality is slowly making its way into the more mainstream machine learning literature where I feel like. It was a bit of a niche area but it's, I think, gaining quite some traction because I think if you look at questions of like generalization, people are slowly starting to realize that there's a lot of causal thinking involved in going to new settings and similarly, any, anytime you want to Build any system that takes actions, like taking an action and in some ways is performing an intervention on your environment.

So I think there's huge potential to take ideas from causality and people are obviously doing it to take ideas from causality to make any kind of autonomous systems better. 

If you could imagine that we could solve just one challenge in causality today which one should that be? 

So for me personally I think it would be the biggest challenge that I see that I would love to see solved is coming up with better ways of evaluating.

That things work in practice. So coming up, like, if I could build a model and then put it in a test box and someone tells me this is safe, this works that would be great. And I think this goes kind of hand in hand with just something, like, because ultimately what I need is someone to validate the assumptions, because these are untestable assumptions, either on identification or on, like, the problem structure.

What I need is someone to tell me. The assumptions that you make are very likely to hold here so to have some way of, like, mapping. Domain expertise into something that can validate my assumptions I think would be great. 

I had a conversation with, with Steven Senn recently was a statistician as well working with experimental data and drug development.

And so very, very much, safety critical applications. And we had a short conversation about Fisher, about randomization and how people. And myths about randomization that some of them are maybe more, some of them less prevalent in the community mentioned that when Fisher was thinking about randomization, he didn't think about it in terms of how not to be wrong.

but rather on quantifying how wrong we are. Do you think that this idea could also somehow be translated to the challenge of causal model or CATE model heterogeneous treatment effect model evaluation? 

Maybe in a way this relates to being able to at least put some kind of bounds. On how certain we are of treatment effects being in the space of where we think that they are.

So kind of some, maybe sensitivity analyses go in this kind of direction, uh, where you put some kind of sensitivity model on your data. Sorry, not on your data, on your sensitivity model, on your data models and then use that to inform at least some kind of bounds on, say, if my unobserved confounding was this or that strong, this is how much that would impact my estimates and whether that would like flip their sign.

I think there's been some very cool recent work over the last like two, three years on using modern, like modern machine learning ideas, for example. Some cool work on using conformal prediction intervals to bound the effects of possibly unobserved confounding. I think it's very cool. You did 

your, master's at Oxford and, now you're finalizing your PhD here, at Cambridge.

What are the most important, things, resources characteristics or habits, something you've done that helps you to go through the course of all those challenging, studies? Oh, 

I Think curiosity is probably, definitely one of them. I think often when I, when for me. When I don't understand why something works, I really want to know.

I need to know, I need to understand it. So there's like, I think, a healthy portion of just curiosity to get myself through it. It's like, I don't think I could do what I did if I didn't love it. I think there's a lot of, like, you need to love what you do, especially in a PhD. I think to keep yourself motivated.

Yeah. I also just love statistics. I think it's very, I love learning new things. I think it's very fun. So like, especially throughout the masters, I think just, I found it fascinating to learn about things I didn't know anything about. I think especially coming from econometrics where you have more, it's a very parametric approach to statistics.

And then in my masters in Oxford you learn, I learned a lot more about non parametrics and about machine learning that was fascinating to me, just kind of a paradigm shift. And then during my master thesis, I learned a lot about semi parametric statistics, which is really cool. And yeah, then coming here, I think I always, in every project that I did, I just tried to learn something new about some area that I didn't know anything about, say in There, there are lots of synergies between different fields.

In like treatment effect estimation, I first learned a lot about domain adaptation, because that's where the covariance shift problems appear. Then you learn a lot about multitask learning, because there's lots of architectures that actually just build on multitask learning. I learned a lot about survival analysis, competing risks by statistics.

There's just a lot of things to like learn new things about. And I think I just have enough curiosity to be kind of, yeah, to keep going. You mentioned 

this, intersection of many different fields or sub fields. Yes, yes. Related sub fields. When you look at the causal community today, What would you think would be the most beneficial thing for the community to learn outside of causal analysis itself that could spark maybe more inspiration that could help move the field forward?

That is a great question. It's a very heterogeneous field itself, right? I'm not entirely sure if like, so I'd say other fields have a lot to learn from the fields of causality. Because people working on causality are so heterogeneous, it's very difficult to tell what, like, difference people need to take from where.

Yeah, I don't have a good recommendation. How to average them. How to average us. Yeah, yeah, I know, but it's, right, because there's, there's It's such a broad field. You have people that work more on estimation. You have like statisticians, econometricians, biostatisticians, computer scientists like causal discovery.

Who am I to tell them what they don't know yet?

So which of those fields were most helpful for you, for yourself? 

So I learned most I mean, obviously from econometrics, I think I took a lot and then I've just been very fascinated by learning more about biostatistics and like the semi parametric statisticians, like the way that they formalize problems, I think it's very cool, especially as an econometrician.

I come from a world where every, where everything is highly parameterized, where we usually like a treatment effect, how I was taught it as a as a regression coefficient in a linear model. So, and then when I started learning more about how the, uh, biostatistics, how things are presented, especially by like, like semi parametric statistics literature, where you think of treatment effects more as like a functional of a statistical model.

Where you think in, in terms of like expected values of differences between potential outcomes and stuff. That was, that was to me, that was mind blowing because it kind of detaches, it makes it so much easier to reason about causality if you don't also have to go this step. Through this, this, this weird parametric linear model that you, no one really believes holds anyway.

So what's the meaning? So that was really cool for me. The, the, like the, the Biostats literature, I think there's so much interesting stuff in there. 

What two or three books, had the most impact on you and has changed your life the most? 

I'm gonna answer this from an academic standpoint now. So I think, so I read the Book of Why, I think in 2019.

As an econometrician who had only learned about causality within an economics like within an economics framework so that was super interesting to me because, yeah, I think I was predominantly taught in like a potential outcomes kind of setting. book that I've read where I kind of felt a little bit offended by the book itself because obviously Pearl makes a couple of little digs at economists and whether or not he thinks they are able to reason about causality well in that book.

But it was just so interesting to me to think about causality on a bit more of a philosophical level, like counterfactuals, for example, like true rung three counterfactuals is not something that we had visited at all in the courses on like impact evaluation that I had taken. At university.

So that was definitely, super, super interesting to me. Then while I was doing my master's thesis in 2020 I started to get more into like this, like the semi parametric statistics, yeah, way of thinking, and I read these parts of targeted learning by Mark van der Laan and Sherry Rose, and I just love I absolutely loved it, blew my mind how they introduce a completely different way of thinking about like target parameters, like the estimates, like causal quantities, how they could be of interest to you.

I think that completely changed how I thought about statistics. I think that that was just really, really cool. And then I think just my favorite book on statistical learning, which generally must be elements of statistical learning. I think. The way like as a textbooks goes, I think that's like the, my goal with any of my research would be to write papers and anything down in such a way that it's as intuitive as they make it seem.

I think it's a great, great book. 

What would be your advice for people who are just starting with causality and they maybe feel a little bit overwhelmed that there's so much to in order to make any progress in this field? 

For me, it was kind of Like, I started quite naturally, slowly in it, just from, like, economics.

I didn't even know how much stuff there was out there. So indeed, I think it's about finding a way to get the intuition right first. I think, so I think the Book of Why is actually very good. Really good, like, gateway drug to causality I think, like, because it's just in a popular science way, like, written in a way that it's actually appealing to a more general audience I think otherwise other introductory books by people that have actually taken the time to possibly come up with a way of presenting these things that are that are a bit more intuitive, This book, for example, and I think that that would be, I think a great way.

Uh, because I think there's an overwhelming amounts of literature out at this moment. I think it's very hard to go in like at what the state of the art is. Yeah. But I don't, don't have like great advice. Yeah. Except for maybe looking for some courses. As well that actually teach you the basics. Who would you like to 

thank?

Who would I like to thank? Oh wow, that's a great question. If I, like for life, like academic life, I have big, big, a big thank you to, my undergrad's institution, in econometrics in Rotterdam, I think. They did such a great job at teaching us about intuitions and I think they, they put so much focus on making sure that the students are happy which I think we didn't appreciate at the time, but like, I had a great time in Rotterdam.

So more, more generally, but like specifically there was a, our, study coordinator, or well, the person who really took care of like the education committee and everyone. Professor Christian High, who also wrote what we call the Heibel Econometrics, for business, Econometric Methods for Business and Economics, I think it is called.

Great book. He taught us, like, Econometrics 1. Great, great professor. As well as I took a course in Rotterdam that was called Impact Evaluation by a professor, Kippuslaus who said in the opening lecture, he was like, he studied econometrics here a couple of years ago and he kind of felt he was missing a course on, on actually talking about impact evaluation.

Like, how do you use the methods that we were, we've been given to actually evaluate policies. So he made that course. And that was the best course I've ever taken. And, Yeah, I had a great friend in undergrad, his name was Thomas, Thomas Viman, who's now doing a PhD in Chicago, who introduced me to all of the machine learning, for treatment effect estimation, so he kept on sending me the papers by, Susan Athey specifically, on what kind of, how you can use machine learning for causal inference, and without him, I would not be here today.

And then, obviously, without people like Susan Immens who've written these really cool papers on using machine learning for economics, I also wouldn't have started on machine learning. Well, I have lots of people to thank. But these are just some of them of, like, in terms of early, uh, like, just early academics, I think.

Alicia, before we conclude, I would like to take a step back and go back to your research. In Another research project that we haven't mentioned before, you are looking from the causal point of view at models or situations that have like a higher complexity. So for instance, if we add time dimension, what happens?

One of the very popular methods in, in the context of health. And policymaking as well is, is sensitivity analysis. What are your main insights from your work regarding sensitivity analysis as seen from the causal point of view,

I think, yeah, since it's sensitivity analysis, I like a really interesting tool that I haven't used a lot myself.

Like not massively, much, but I think there's a huge opportunity to use them more to kind of perturb our, the assumptions that we're making and seeing what would happen if the, our assumptions are held a little bit more or less. So yeah, I think there's lots. That could be done in like causality, but also in machine learning more generally by playing around a bit more instead of focusing on like point identification, like making individual predictions, like point predictions saying like, Oh, here's the set of predictions that I think is most likely given certain assumptions.

So I think that's a very interesting direction that I haven't made a lot of use yet myself. But I mean, in treatment effect estimation, you usually put like sensitivity on like some kind of sensitivity model on the amount of like hidden confounding. But there are like, there's many other aspects that real problems have that you could also put these kind of sensitivity models on.

So I've worked a lot on survival data. And in survival you have an additional complexity, which is like the presence of censoring. And similar arguments and questions apply there. It's like, is, is your censoring ignorable? So is, is there no kind of hidden confounders that affect both censoring and your outcome variables?

Or missing this is another question, like, we, in all of our problems, we assume that data is completely observed. But often actually we don't record everything. So putting some kind of model on how data is observed, it would be another way of looking at this. So looking, 

when we think about survival analysis, even if our causal model is well specified in the beginning of the, of the time interval that we are looking at, it might get confounded with, with time.

Yeah, I think overall looking at some kind of like error propagation over time of like, how bad is it if my assumption isn't met in the beginning versus how do like errors compound over multiple time steps, I think is a very interesting question that I haven't thought about yet myself at all.

I also have not worked a lot on time series setting. Particularly anyways, but I think it's a great, it's a very interesting direction to think about what are like additional complexities that you can add to models. Because I think that in like the treatment effect estimation literature in, in machine learning specifically, I think we focused on like a very, very small part of real problems with this characteristics that could arise, right?

We've like picked, we've picked this like teeny tiny area and we keep on looking at. How do you estimate heterogeneous treatment effects when you have like a single treatment, yes or no, and an outcome that's kind of simple and continuous, maybe, and you have static data, but there's so many more problems out there.

I think most, most real problems actually have five, six more axes of complexities. You have time, you have different outcomes, you have multiple treatments, uh, treatment combinations, missingness, like informativeness and how things are sampled, informatism, when things are measured. So I think there's Overall, a huge opportunity to look at kind of more realistic problems, wider classes of problems.

And that is also something we've tried to do over the last couple of years. So like one step at a time, like look at one extra level of complexity that you could add to these problems.

What would be your main insights from this work in 

general? Very interestingly, most additional problems that we've considered be it survival analysis or competing risks, missingness censoring, informative sampling, all of them ultimately come.

Like come down to a missingness problem. You're just, you don't observe everything you want to observe. And the more layers of complexity you add The more missingness there is actually the less obvious, like if you, if you have centering missing outcomes and treatment selection, you actually off of all the potential outcomes that there are, you actually observe less and less.

So it becomes very much like a sparsity problem of I'm observing. A lot less than I would want to. And there's like covariate shifts that are induced by all these different mis mis missingness mechanisms. So I think there's some very interesting opportunity in looking at how do you tackle all of these missingness problems jointly.

Cause that's something we haven't done yet. Like I like to look at problems one at a time because that makes it kind of easier, but actually. There's probably huge value in looking at them all together because they are all in some ways are kind of the same problem that manifests in slightly different ways.

So I understand that you are proposing to look at all those challenges as a missing data problem of treating the missing data framework as a unifying perspective, that's a little bit in the spirit of what Donald 

Rubin proposed. I mean, there, there's definitely, there's a reason why the, why Donald Rubin worked on causality and on missingness.

I'm assuming not that I know, but like, I'm assuming there's a big, there's definitely a reason that like in biostatistics, these are usually all treated as like unified. Unified problems because that like ultimately they are all missing this problems where you maybe assume slightly, slightly different causal structures.

Like you, you change what causes what if missing this causes like if, if basically if treatment. If missingness in a, in a, like a treatment estimation problem, the missingness indicator is kind of your treatment assignment mechanism or your treatment assignment, actually. So in that case, the thing like your treatment indicator, you do, you do assume that it has an effect on outcomes, whereas in a typical missingness problem, you usually assume that there's no effect of missingness on outcome.

So there's slight differences in what you assume of like what causes what and what you, slight differences in what you need for identification. And like what the, what the like target parameters of interest are. But ultimately, like statistically speaking, if you look at part of the structure, they're all very, very, very related, which is why some people like Donald Rubin Robbins or Mark Van Laan, why they've all studied so many of these problems.

Because once you, once you realize how related they are. Why not treat them all as like one class 

with all of those challenges, you speak about some structural information, information about the process or how one thing or one variable in the model can impact the outcome or another variable and so on and so on.

The structural approach is very far away from what we usually, what we got used to when it comes to, the machine learning community and the culture of machine learning publishing. We rather look at empirical results and very, very fast publication or pre publication cycles, feedback cycles, and so on and so on.

What are your thoughts? Yeah. About these two, I feel like slightly contrastive 

approaches. Yeah. So I think the, what I've really liked about the worlds of statistics and econometrics that in the publish it, like in, in publications, it at least seems to me that it's often a lot less about like performing horse races, but where you have a horse in the race yourself where.

You don't, you don't usually bring your, like, machine learning research seems to be a lot about like achieving the state of the art. 

Like my attention horse is running faster than yours. 

Exactly. Exactly. Exactly. So it's bringing something new and showing here, I can make, I can make this line bold in my results table because I've beaten everything else, whereas at least how I feel about like Like literatures and like econometrics statistics you focus a lot more on understanding structure of a problem and what kind of solutions you need to solve it.

So it's a lot more about like the problem itself and understanding it. And I think that's like what machine learning is missing hugely. Like, I try, I try in the papers that I write, I try, like, I'm mainly interested not in building methods, but in understanding the methods that already exist and why, why they work because I think that's kind of lacking.

But it's been really hard as a PhD student writing papers like that because I think I'm yet to send a paper to a conference and not have at least one reviewer question the novelty. Of what I'm doing just because then the specific novelty that like this community is often looking for is like in the architecture, it's in the method itself, whereas I think there can be like novelty in elsewhere.

If you, if you are getting new insights into a problem, I think that's also a novelty that's worth publishing, but that's kind of very hard to get through a review process because. We're kind of as a community, we're on the lookout for novelty like in methods very specifically. And I think that's a problem.

Because we don't, we don't need more and more and more methods necessarily. Like if they all have the same failure mode, for example, if we keep on focusing on one aspect of the problem and ignoring another one, because no one ever takes a step back and looks at. What are we actually doing here? I think that's, that's kind of what I see lacking.

And so there's a paper from a couple of years ago called Troubling Trends in Machine Learning Scholarship by Zach Lipton and I think Jacob Steinhardt it's one of my favorite papers I've ever read. Cause it talks about, it talks about all the things that are kind of going wrong in machine learning research well at the time, and I think kind of still true today and one of them, and I think this is the one that I've looked at the most is that they talk about how.

People don't really think about sources of gain, like you show, okay, you show I beat the current state of the art I beat the benchmark, but not necessarily why. So if you're, if you're proposing a new architecture, let's say it has like, it changes five things from the previous architecture. If you're then not a.

like ablating the sources of gain. You've not really learned anything. Maybe it was just the way that you were optimizing. Like maybe it was just the step size you know, that you were using, like how you were optimizing hyper parameters. That's something that I dislike about how research is going at the moment is that there's such a big focus on methodological novelty, but not so much rewards for kind of trying to understand problems better.

Is the future causal? Hopefully, at least partially, at least causality, like I think the near term future is probably causality inspired, you know, like at least I love the types of research that are at least taking some ideas from causality like for robustness or transfer learning taking some ideas of like the like invariance of causal mechanisms, these kinds of things to build better or like more stable models.

And I think, yeah, at some point, maybe it's fully causal. 

What can people find more about you, your team and your work? 

I mean, I have a website myself where you can find something more about me. I mean, obviously all, luckily machine learning research, I think if there's one good thing about machinery research is very public, right?

There's because we're publishing like usually both on archive and on conference venues, everything. So the work that we do is usually not hidden behind paywalls, which I think is kind of nice. So like our papers are obviously very very public. So there's like my own work is on my own website, but then we have I'm obviously part of a much bigger lab supervised by Professor Mihailov and Ashar here in Cambridge.

And our lab, I mean, we. There's, there's a part of us that works on causality, but we, I think, look at machine learning for health much more generally. And there's obviously the website of our group as well that has like all these different research pillars where I'm kind of part of the research, but it looks like causality.

But there are also people looking at other things. So if people are more interested in our work on like machine learning for health more generally, you should definitely check out our website. I think it's a funderlabsharlab. com. We'll link to, 

to the description. 

Yeah. Yeah, 

what's your message to the causal Python community? , 

keep asking why? I think the, I think there's a, maybe one of the reasons that in my, in my own work, I look so much at why do different methods work so well? That's also a causal question, ultimately, right? I'm looking, I mean, I'm looking at treatment effect estimation as a causal problem, but then also.

in most of my work on like simulating and evaluating methods, I'm also trying to find the sources of gain. So like root causes of why models perform better. So maybe there is something to it that one of the reasons that the like machine learning literature on causality is slightly different than other areas of machine learning is that we're very trained to ask why.

Like distilling, Correlation from causation. And yeah, I think it's good. Good to keep asking why examining our 

assumptions. Alicia. Thank you so much for your time. That was a great 

conversation. Thank you so much for having me. Thank you. Thank you. Congrats on reaching the end of this episode of the Causal Bandits podcast.

Stay tuned for the next one. If you like this episode, click the like button to help others find it. And maybe subscribe to this channel as well. You know. Stay causal.

(Cont.) Causal AI, Effect Heterogeneity & Understanding ML || Alicia Curth || Causal Bandits Ep. 006 (2023)