Causal Bandits Podcast

Causal Bandits @ AAAI 2024 | Part 2 | CausalBanditsPodcast.com

Alex Molak Season 1 Episode 23

Send us a text

 *Causal Bandits at AAAI 2024 || Part 2*

In this special episode we interview researchers who presented their work at AAAI 2024 in Vancouver, Canada.

Time codes:
 00:12 - 04:18 Kevin Xia (Columbia University) - Transportability
4:19 - 9:53 Patrick Altmeyer (Delft) - Explainability & black-box models
9:54 - 12:24 Lokesh Nagalapatti (IIT Bombay) - Continuous treatment effects
12:24 - 16:06 Golnoosh Farnadi (McGill University) - Causality & responsible AI
16:06 - 17:37 Markus Bläser (Saarland University) - Fast identification of causal parameters
17:37 - 22:37 Devendra Singh Dhami (TU/e) - The future of causal AI 

Support the show

Causal Bandits Podcast
Causal AI || Causal Machine Learning || Causal Inference & Discovery
Web: https://causalbanditspodcast.com

Connect on LinkedIn: https://www.linkedin.com/in/aleksandermolak/
Join Causal Python Weekly: https://causalpython.io
The Causal Book: https://amzn.to/3QhsRz4

 Causal Bandits at AAAI 2024 | Part 2 | CausalBanditsPodcast.com

Alex: Causality, large language models, explainability, generalization, and fairness. This is Causal Bandits Extra at AAAI 2024, part two. Enjoy. 

Kevin Xia: Hi, my name is Kevin Xia. I'm a fifth year PhD student working with Professor Elise Barenboim in the Causal AI Lab at Columbia University. And here I, I'm discussing my, my colleague's work, Transportable Representations for Domain Generalization.

We may want to solve a, a statistical task like classification, for example. You have features X, label Y. You're trying to predict Y. Um, now the issue is, You have a source domain where you have your data and then the target domain where you're trying to use the model is different from your source domain.

So if you blindly use all of your features X, you might not get the right answer in the target domain, right? Now, how do you, the whole idea of the paper is how do you get some classifier that is capable of generalizing across environments, right? One naive way of thinking about it might be, okay, let's just use the causal feature.

for example. Now, while that might work sometimes, the issue is that it's not often the best. Sometimes you can leverage more information than the causal features in order to get a better representation, a better generalization. The concept that is used here is the concept of transportability. We say that something is transportable if it is invariant across domains.

So we would like the classifier to be invariant across domains. Now, while sometimes the causal parents are invariant across domains. We can often get something that is more informative than that, and it's still invariant across domain. Um, and, that's generalized through, uh, this function phi here. So we can have phi of x, which, Maps to a representation R, and now we're trying to predict Y using our representation R.

The paper discusses, um, many results on how to get this R. Um, and the ultimate goal is for this prediction of Y given the representation R to be transportable, to be invariant across domains. And, um, the paper discusses many ways to guarantee that. Um, 

Alex: but can we, can we also do it without the causal graph?

Kevin Xia: Yes. Yes. So, while the first half explains of the paper explains how you might do it, um, knowing some information about the causal mechanisms, uh, the second half of the paper discusses what happens if you don't have that. And you need to make some other assumptions, uh, in order to work entirely, um, on the statistical level.

Um, using, using just your available data and even with these assumptions, um, uh, and, and without the cost diagram, you, you are still able to get some guarantees on. when you can get a representation that's invariant across domains. In the end, um, what you, what you get, what you end up with is you may have some representation that's more informative than just the causal parents and is still in, is still transportable, is still invariant across domains, such that you can get a better prediction that still works in the target domain.

Alex: What impact do you think this work can have on, in the real world? 

Kevin Xia: The issue is that, like, oftentimes in the real world, Um, we simply can't collect data in the domain that we intend to use our model in, right? In healthcare, for example, we want to study humans, but all we have is like a setting with a lab, for example, right?

There's so many of these cases where, um, the area of study is just different from the area of application. And, um, if we ignore these cases, if we ignore the differences in these cases, um, we can't truly hope to get accurate results, right? So this will basically, this type of work, which essentially leverages some sort of causal understanding of the system, I think is, is very important for applications in these domains where, uh, we, we still want to be able to have some powerful predictive performance even when the domain's different.

Alex: What was the most interesting causal paper that you read last month? 

Kevin Xia: Oh, I see, I see. Oh, I, I don't remember the name, but, uh, but there was a paper, um, that studied, uh, how, how large language models behave on, in solving causal inference tasks. Uh, I think the, the paper, um, the paper produces a data set. Sea lather.

Yes, yes, yeah, C Lab, yes, that one, yes. It was basically a data set and I think the, uh, the takeaway was basically that LLMs are really bad at solving these tasks, right? But I think it really paves the way for like a better understanding of like how LLMs understand the world and having these data sets could be great for having benchmarks for future research.

Patrick Altmeyer: So this work is about counterfactual explanations in the broadest sense. There's been a lot of work in this space and Counterfactual explanations essentially try to explain the behavior of black box models without actually opening the black box. So they work on the premise of just looking at how inputs into a model need to change for it to produce different outputs.

And in this context, counterfactuals, they can also have a causal meaning, but in a lot of the literature, the causal meaning is a bit more implicit. In this paper, what we try to tackle is, is faithfulness. Uh, so a lot of work in the past has focused on, on a different desideratum, which is called plausibility.

Of course, in the context of explanations of human decision makers, of algorithmic recourse, uh, people are interested in generating explanations that, that actually look intuitive and that makes sense. Uh, and, and plausibility, uh, the way that we find it, uh, in our work, uh, essentially means that the counterfactual.

That you generate is consistent with the true data generating process. So we have a little image here. If you want to just slightly tilt the camera here. So here we show, uh, the counterfactual path going from one class to the other. So from orange to blue and contour just shows the current density estimate for the true distribution of the data or rather the, the observed data.

So it's only. An estimate of that. But plausibility in this context means that the counterfactual should end up in a, in a region that is characterized by, by high density. But if we go from plausibility to faithfulness, we know it's in our work. And this is very much the, the motivation that if we focus only on, on plausibility of counterfactuals, we might kind of lose track on, on our primary, I guess, uh, objective, which is to explain black box models.

And to illustrate this here, we, um, we look at a couple of, uh, simple illustrative examples using MNAs. So here we have a factual image, which is correctly classified by a simple MLP, um, as, as a nine. And the, the task for the counterfactual, generator for the explainer is to, to see what is necessary in the eyes of the model to go from predicting nine, predicting seven.

Uh, so that's, that's the, the counterfactual, the target label in this case. And all of these different counterfactuals that you see here, um, using different approaches, all of them are valid. So for all of these with, with high confidence, the classifier predicts that this is now indeed a self. What you can see is these two here, Wachter and Schutt, they look very much like adversarial attacks.

That's not unsurprising because just methodologically counterfactual explanations and adversarial examples, they're, they're related. The only really plausible explanations is, is this one generated by a very interesting approach called revise. Here what the authors propose is to use a surrogate model, a variational autoencoder, under the hood to try and understand what actually makes the model work.

What is a plausible factual or counterfactual to understand the data generating process. And that's, that's great. That's, that's a plausible explanation. Everyone would probably agree that this can pass as, as a seven, but these try to this to one. Exactly. Yeah. So these are just, you know, the nineness of those high nineness.

Exactly. But to me, there's, there's a bit of a friction war. So again, since we're in the business of explaining models, How can we sort of confidently show just this, this plausible explanation. If these other explanations are also valid in my mind, we are committing the risk of whitewashing a black box model because we're showing something that's plausible and explanation that pleases us.

But it. Doesn't necessarily reflect accurately how the model behaves. And that's what we're trying to tackle in this work. We want to have faithfulness first and plausibility second, because in my mind, I see very few practical cases where generating plausible, but unfaithful explanations for black box models makes much sense.

Alex: Yeah, we should mark here as this is in the context of a, people will be watching this video in the context of causality that faithfulness is understood here different, differently than the assumption of faithfulness that we use in causal discovery. 

Patrick Altmeyer: Okay. Yeah. That's a, that's a good point. Maybe to, to, since you mentioned causality, there are also interesting approaches in the context of counterfactual explanations, most notably by Karimi et al.

And Bernd Schoelkopf is also involved in this work and Isabelle Valera, um, they've, they've basically shown that. It is possible given causal knowledge to generate counterfactuals that are causally valid. And you can actually use that, that causal knowledge to generate counterfactuals more efficiently at smaller costs to, to individuals.

So what we try to do in our work instead is to, to simply rely on properties that the model provides. So instead of using some, some surrogate tool to derive better explanations, we. Put all the accountability on the model itself. And it turns out there's actually a fairly straightforward way to, to still get plausible counterfactuals provided that the model has learned plausible explanations for the data.

And to do that, we, we borrow ideas from energy based modeling and also from, from conformal prediction. The, the intuition here is that we want to be able to characterize or quantify the generative potential. Capacity of the classifier question and the predictive uncertainty. So that's what energy modeling and a call for prediction respectively.

Both of these approaches are model agnostic. So we can do this with pretty much any, uh, differentiable model classifier. 

Lokesh Nagalapatti: Hello, I'm Lokesh, a IIT Bombay. So this is our IIAI 24 work on continuous treatment effect estimation using airing interpolation So we address, uh, This is a problem by augmenting new treatments in the data set and estimating pseudo outcomes for them.

We observed that estimating pseudo outcomes for close to observed treatments is easy. And can be up and can be obtained by performing a first store retail expansion. However, for treatments that lay far from the observed treatments, we need to have uncertainty measures so that we can scale down the lost contribution of unlevel su outcomes.

We found that, uh, this kind of, uh, data augmentation helps break the confo confounding that exists in the observation data set and leads. Better performance. We applied our method on vcnet is a kind of state of the art neural network architecture. Continuous treatment of a distribution. And we found that such simple data augmentation technique boosts the performance of vcnet by a significant margin.

For more details, please refer to our paper in AAA20. Thank you so much. How would you summarize the main contribution of this paper? The main contribution of this paper is to have uncertainty estimates for treatments that are far from the observed treatment. And. It is very important to scale down the contribution of such underleveled outcomes.

What will be the best, uh, application for, for your research? So, the best application we want to apply, eventually our goal is to learn, uh, recourse. In recourse, uh, we want to find an optimal treatment that works well for the given patient. Now, one intermediate step in performing recourse is treatment effect estimation, where the goal is to estimate the difference in outcomes.

As we give different treatments to the individual, efficient methods in estimating treatment effects, finding the optimal treatment is kind of, uh, you know, difficult. So we believe that, uh, this work will help in addressing the challenges that we have in recourse problems. 

Alex: What is the most interesting causal paper that you read last month?

That you read last 

month? 

Lokesh Nagalapatti: So currently I'm working on treatment effect estimation, given post treatment covariates, so I have been reading papers on self supervised learning. There are some impossibility results that state that, uh, you know, Unless you are given counterfactual data, finding representations that will make the treatment effect estimation approaches that otherwise handle, that are applicable on observational data set, kind of, you know, void.

I mostly refer to the papers by the author, Bernard Trollkopf. So he has a lot of papers on, uh, Uh, you know, impossibility results that, uh, yeah, had great. Thank you so much. 

Golnoosh Farnadi: So, my name is Golnoosh Farnadi, and I'm an assistant professor at McGill University, and, you know, responsible AI literature. Often the causality, robustness, and fairness are considered a separate topic.

So they are studied independently, while when we are training the robust, uh, kind of responsible AI model, the robustness and fairness are the properties of the needs to enforce that legal model. So in this work, we are trying to see whether it's possible to actually look for a model that is adversarial, robust, and fair, while we are also considering the causal structure.

That's the motivation of this work. If you're thinking about adversarial robustness, robustness is that we are looking for perturbing an individual data points for the points that, like, the label is changing. So, we want to make sure that the model is robust with this kind of perturbation. And if you're thinking about that, uh, individual fairness, you're saying that two individuals should be similar, two networks should be similar, and receiving a similar decision of the model.

So, these two notions are very connected to each other. So, they can actually think robustness. In terms of the fairness, that the perturbation is on the sensitive attribute. So you want to make sure that the two individuals with the perturbation on the sensitive attribute, they are similar to each other.

While in the other way around, you can also think of, like, this perturbation to define a similarity between the two. So here we are accounting for this metric that we want to define a similarity between data points that are accounting for perturbation but also for the sensitive attribute perturbation.

And also it's knowing that like there is a causal structure, that the sensitive and non sensitive features are related to each other. So you are not able to perturb one feature while another feature is remaining the same. Which is if you have adversarial robustness, these kind of perturbations are uniform, right?

So you are always kind of like, if you are having thinking of the shape of that perturbation, you are dealing with a ball. So the main idea of this paper is just like we are creating this metric that's We can actually create this kind of perturbation with the idea that are coming from the causal literature which is counterfactual.

And we are looking for the twin, perturbation in the robustness literature and also in the fairness for the single sensing attribute that if you are pushing all together, we can actually have a model that are fair and robust. 

Alex: What I would do, one or two main lessons from this work for you. 

Golnoosh Farnadi: So the athlete show is possible, right?

So you can actually have a model that is robust and fair and causally aware, right? So they should not be defined independently because they are very much connected. And if you're looking for a model to be deployed live, right, they should be accounting for all these dimensions, right? So this work, I think, for me is kind of like looking at this kind of intersections, right?

And getting one step further of like, Creating a model that have more than one property in terms of results. 

Alex: What impact of this work would you like to see in the real world? 

Golnoosh Farnadi: Impact of this work? Just showing that it is possible, right? So we can train a model and the performance, like here we are showing that like, you can account for all this thing, but the optimization is not changing.

So you can have same optimization, same structure of the optimization that we were using. Also the performance is not changing too much, right? So you are not paying so much of a price to account for this. I think this is kind of like if you are deploying this kind of models, you have a better model in real words, that is not going to be more expensive in terms of computation, but also not more expensive in terms of the performance, but 

Markus Bläser: much better in terms of disposable AI.

Hi, my name is Markus Knever, I'm from Saarland University. And presenting a joint work with Ayaan Gupta from IIT Bombay. It's an interesting problem, right? That's my motivation. So, structural causal models, um, we assume the graph is given. And we have these random variables, which are linear combinations of all the other variables with, um, error terms, which are normally distributed and zero mean.

And now, essentially, we have to observe covariances of the random variables. And we ask ourselves, can we identify these parameters? Problem is solved in principle, for instance, by Bruckner Basis approaches. These, um, have doubly exponential running time. On a global basis, they are complete, which means that they're always identical when it's possible to identify.

However, they have doubly exponential running time, which is prohibitive. There are also other algorithms known, like here, with an instrumental variable or a generalization of this. These algorithms are more efficient, however, they fail to be complete, which means that they might not be able to identify the parameters which in principle are identical.

And what we do in our work is we look at the restrictions of the structural causal models Namely, where the underlying graph of directed edges forms the tree. And from those structural causal models, we're able to give an algorithm which runs a randomized polynomial time, and which is also an addition complete.

Three shift structural causal models resolve this problem completely. 

Alex: What impact of this work would you like to see? 

Markus Bläser: Essentially, what I'm, I have a serious perspective on this problem, and I would like to understand the complexity of this problem. And this, I think, is a very, very important problem. Nice first step towards this because we have a natural graph of structural causal models.

We have a complete and efficient algorithm. 

Devendra Singh Dhami: Hi, I'm Devinder Singh Dhami and I'm an assistant professor at the Eindhoven the conference so far? Pretty good. We started with the bridge program on the 20th and the 21st. We organized a bridge on continual causality, which is the fields of continual learning and causality as the name suggests.

Pretty good, very nice talks. Especially this time we invited speakers, uh, who are not very senior and they made some very good effort to bridge the two things together, the two fields together. So that was pretty interesting. Nice contributory talks as well. And after that the main conference has also been good.

Some really nice invited talks. The papers have been good. I mean the causal papers have been, I mean I'd say it's like, uh, 50 50. Some have been really good, some have been underwhelming. But overall, I think I'm satisfied 

Alex: What are the main insights or lessons from the from the bridge that you have added in the beginning of the conference 

Devendra Singh Dhami: that combining continual learning and causality is difficult.

I think what both continual learning and causality lack are these real world applications, right? And what we realized from the bridge and from the invited dogs as well as several discussions with the participants was that in order to actually scale causality, or even scale continual learning to real world applications, you need a combination of both.

So I think that's very interesting. I hope that we will take it forward, uh, based on all the discussions that we have. 

Alex: What are the main challenges that the community, that you believe that the community should, um, should start addressing when it comes to causality and its intersection with continual learning?

Devendra Singh Dhami: Uh, there's several. Uh, the first one that comes to my mind is benchmarking. So we don't have specific, talking from causality point of view. We don't have these specific causal benchmarks that we can test our models on. You'll see, okay, your paper might have really nice theorems. Uh, you might, uh, even propose very nice methods, algorithms.

But then if you see the empirical section, they are on synthetic data or at best they're on this Asia data set or right? So which is I mean, I can understand why but I think it's also time to move on. 

Alex: What's next for you? What research programs are you planning to focus on in the next one three five years?

Devendra Singh Dhami: So maybe next one year will be causality and large language models So for example, our workshop is called Causal Parallels. Large language models may talk causality, but they're not exactly causal, right? And you'll see, as you have also mentioned in your podcast, that there's been recent spurt in the papers that talk about LLMs and causality, which is nice, but you'll see that people stop at a point where they say, okay, there are these several open problems and we don't know how to solve them.

Which is nice. I think it's a very important for the community to bring forward what often problems are there, but I think we have to go and step ahead and try to solve these problems. So for the first, yeah, first year and couple of years, I think causality and LLMs will be a major goal and longterm. I think, uh, scaling causality is the longterm goal for me.

And specifically specifically for that, I have something called probabilistic circuits in mind. So probabilistic circuits are these generative models, uh, but the inference is linear in the time, in the number of, uh, network parameters that you have. Uh, we have already some works where we have bridged causality with probabilistic circuits, and now it's time to actually apply it to a real world problem.

Of course, also come up with new ways of this. combination because there have been several new approaches in probabilistic circuits as well. So basically taking causality, marrying causality and probabilistic circuits so that both can benefit from each other is the long term goal. 

Alex: What is the most interesting causal paper that you read last month?

Devendra Singh Dhami: Ah, interesting. I think it was, uh, this paper by Microsoft, Amit Sharma's paper, where they talk about this, uh, correlations and causality and I'm forgetting the title of the paper. Uh, but where they kind of, uh, said that under some assumptions, actually your large language models can learn causality, which is kind of, uh, different to what we propose in the causal bias paper.

So they have, I think, two major findings where they say, okay, sure, there might be correlations of causal facts, but if you make some assumptions, then you can actually have large language models learn causality. Um, I don't remember the title of the people in the top of my head. But that was pretty interesting to see.

And, uh, recently Amit Sharma, who is one of the co organizers, he has tweeted a few papers that I saw again in the realm of causality and open world. Uh, next on the list, they seem pretty interesting. I, of course, I just read the abstract and they seem pretty interesting. So yeah. What's your message to the causal Python community?

You're doing amazing work, but now it's time to scale the model. And somehow I also feel that, uh, again, we are lacking in terms of benchmarks. So maybe libraries, uh, are important, but then, uh, keep in mind that at the end we want to scale these models to large models and subsets, right? Yeah. Keeping that in mind while developing these libraries would be interesting.