Causal Bandits Podcast
Causal Bandits Podcast with Alex Molak is here to help you learn about causality, causal AI and causal machine learning through the genius of others.
The podcast focuses on causality from a number of different perspectives, finding common grounds between academia and industry, philosophy, theory and practice, and between different schools of thought, and traditions.
Your host, Alex Molak is an a machine learning engineer, best-selling author, and an educator who decided to travel the world to record conversations with the most interesting minds in causality to share them with you.
Enjoy and stay causal!
Keywords: Causal AI, Causal Machine Learning, Causality, Causal Inference, Causal Discovery, Machine Learning, AI, Artificial Intelligence
Causal Bandits Podcast
Causal Bandits @ AAAI 2024 | Part 1 | CausalBanditsPodcast.com
Causal Bandits at AAAI 2024 || Part 1
In this special episode we interview researchers who presented their work at AAAI 2024 in Vancouver, Canada and participants of our workshop on causality and large language models (LLMs)
Time codes:
00:00 Intro
00:20 Osman Ali Mian (CISPA) - Adaptive causal discovery for time series
04:35 Emily McMilin (Independent/Meta) - LLMs, causality & selection bias
07:36 Scott Mueller (UCLA) - Causality for EV incentives
12:41 Andrew Lampinen (Google DeepMind) - Causality from passive data
15:16 Ali Edalati (Huawei) - About Causal Parrots workshop
15:26 Adbelrahman Zayed (MILA) - About Causal Parrots workshop
Causal Bandits Podcast
Causal AI || Causal Machine Learning || Causal Inference & Discovery
Web: https://causalbanditspodcast.com
Connect on LinkedIn: https://www.linkedin.com/in/aleksandermolak/
Join Causal Python Weekly: https://causalpython.io
The Causal Book: https://amzn.to/3QhsRz4
Causal Bandits at AAAI 2024 | Part 1 | CausalBanditsPodcast.com
Alex: Hi, Causal Bandits, it's Alex. Welcome to Causal Bandits Extra. In this special episode, we'll hear from researchers presenting their work at the triple AI conference earlier this year in Vancouver, Canada, and participants of our workshop on causality and large language models. Enjoy.
Osman Ali Main: Hi, my name is Osman and I am a PhD student at CISPA.
Helmholtz Center for Information Security in Germany.
Alex: Can you tell us about the motivation for your work that you are presenting here at the conference?
Osman Ali Main: My, this work is among some lines of going work that I did in my PhD. The main overarching goal of most of them was to have practical causal discovery algorithms that we can actually apply in real world scenarios.
And this one is about the case where you can adapt your causal discovery algorithms to data as it comes along. So instead of having your data set fully specified from the beginning, you could make use of data that arrives over time and adapt your learned causal discovery algorithms to, to learn as the data comes in.
Alex: Could that also be applied in a scenario where the structure behind the data generating process is dynamical?
Osman Ali Main: It could be. So, I mean, if you look at the underlying concept, it comes down to learning by compression. You can show that you can use this particular concept, not just if you assume that, uh, Incoming data set has fixed causal relationships, but if those causal relationships over time, the compression based strategy that at least I work on can be used to basically separate out those two different types of structures from each other.
Alex: Very powerful. What are the main insights or main learnings that you learned during this work?
Osman Ali Main: Because the problem setting is slightly more general, it feels a lot harder. There are a number of challenges. For example, it's not always easy to weed out the differences, especially if you have really, you know, you're just starting out, data starting to arrive.
And if like in the worst case, at each point you get some data that is different, that is from a different dynamical network, then at least practically it gets really, really hard to distinguish them. And if you can't do this well earlier on. Your later results could be not as good as you like them. So that's one practical consideration.
We don't have a solution for that yet, but we are, let's say, trying to figure out and work out how we can overcome that.
Alex: What impact your work you see could have in the real world and what impact would you like to see? What impact of your work would you like to see in the real world?
Osman Ali Main: Well, the obvious one is if this could be applied to some practical problem.
So let's say for example, in healthcare domain, where you might not have for diagnosis of different diseases, right? You might not have enough data to begin with, but at the same time, as you get to know more and more and more, you would like your algorithms to not always just start from scratch, but basically use what you already have and then adapt your knowledge.
And that I think would be a powerful tool, especially if applied into domains like diagnosis. And I hope that it is one day.
Alex: Before we conclude one, um, I would like to ask you one technical question. We know that some causal discovery algorithms like PC are not very well at scaling. So they, the number of computations case super exponentially with the number of notes.
What are the computational requirements for your algorithm? The one that you presented in your work?
Osman Ali Main: Well, I work with a score based causal discovery. So in the worst case, there is no escaping the exponential. You might still have to. pay that that much of a cost in terms of scaling. In practice, however, what kind of saves some of the score based algorithm is the greedy nature of search.
So in this particular case, you could basically instead of exhaustively searching over everything, you could try building networks or try discovering networks in a greedy fashion. And assuming that you're the score that you use to evaluate your networks fits the assumptions that you have. Then even the greedy approach works quite well in practice.
It can even scale to, let's say in my case, for one of my earlier algorithms, this greedy approach was able to scale to let's say 500 variables. But that was, let's say, but there are certain assumptions aside, but 500 variables over a sparse network. It's still able to work, but that only comes down to how well your score kind of captures the assumptions that you have about your data.
Alex: And what assumptions, uh, do you rely on in your work?
Osman Ali Main: As far as the causal relationship functional assumptions are concerned, the main one that we have is basically nonlinear functions with additive Gaussian noise. It's still an assumption nonetheless, but at least in my subjective opinion. This is not as restrictive as assuming, let's say, linearity.
Emily McMilin: Hi, my name is Emily McMilin. I am an independent researcher. I'm here presenting research at the intersection of LLMs and causality. I also am a research scientist at a major company.
Alex: What's the motivation behind your work?
Emily McMilin: Yeah, so starting with hypothesis that, um, even though LLMs are trained on increasingly ever more data, it seems like the whole world's data, in fact, they are not the same.
It is a data set in the end that they're trained on, so it's some subsampled representation of the real world target domain in which we use them. And so if there's sampling, then there is potential for sample selection bias. And so I explore different areas where there may be selection pressures of interest.
Alex: What are the main insights from your work?
Emily McMilin: I mean, definitely to consider, um, Consider the data generating process. Definitely don't assume that your data set is IID, no matter how large it is. Also, you know, there's a lot of parallels really between large language models and like a lot of the conditional probabilities that you can hypothesize with a causal DAG.
So there is a lot of nice parallels. You know, you can propose a causal DAG. Dag and, uh, see if, uh, you can measure the conditional probability that it entails in an LLM. And there also, you know, I would just encourage, like, looking actually at the log probs, at the probabilities that are coming from the LLM.
You know, they don't necessarily give you, uh, strong, um, It's hard to assign meaning to them, but you do know that nonetheless, token you can't introspect one level deeper at the at the probabilities.
Alex: What have you personally learned during this project?
Emily McMilin: I've learned a lot, so I'm an independent researcher, my PhD was in a very different field, so coming into AI definitely a different world.
I learned a lot about peer review, you know, even just It's the scales. There's many scores that you get, you get the confidence, you get the impact, you get the actual score of your work. So there's just a lot to navigate, a lot to learn. Definitely there's a lot of great, fantastic community out there, machine learning collectives and different organizations that are interested in independent research and can support you along the way.
Alex: What impact can your work have on the real world?
Emily McMilin: So I would hope that, you know, particularly this, this missing data problem, the problem of sample selection bias. I do think that there are. Definitely, maybe very important things that you can hypothesize are being left out of the data sets and it might be the most mundane, the most boring, the least interesting.
But maybe that's actually just the common sense reasoning that we have that language models are lacking because no one's taking the time to write down what is obvious to everyone since, you know, they're a month old.
Alex: What is the most interesting causal paper that you read last month?
Emily McMilin: Okay, so last month, so I'm definitely interested in the causal parrots work.
I definitely find it very, like, um, satisfying. And, uh, this is now maybe even a couple years old, but it might be interesting to people who are interested in LLMs and causal, causality. And it was a 2020 paper in NeurIPS, and they looked at mediation analysis, and they actually, like, introspected into the LLM, and they froze some weights, and they had a nice parallel between, like, a causal DAG and the transformer model.
And they validated some interesting hypotheses. And I think in general, just an interesting method that people could use.
Scott Mueller: Uh, my name is Scott Mueller. I'm affiliated with UCLA and, uh, and Judea Pearl is my PhD advisor.
Alex: What was the motivation behind the work you presented here at the conference?
Scott Mueller: I presented a paper along with, uh, three other coauthors.
Uh, I was not the first author. She had these issues. So I, I filled in here along with, uh, TRI, Toyota Research Institute. And the idea there of the paper was about procurement of electric vehicles. Vehicles, how to incentivize that effectively. So the, the problem was, or is that, uh, governments and policy makers try to give incentives for people to purchase electric vehicles for greenhouse gas emissions and environmental sustainability.
And that's, that's great because people then buy electric vehicles and electric vehicles in terms of environmental sustainability are better than ice vehicles, internal combustion engine vehicles. The problem though, is they're not always better. Because the manufacturing greenhouse gas emission costs are far higher for electric vehicles.
I feel like I'm giving the whole presentation here, but just to frame the problem and what we're trying to do. The problem is if you don't drive the electric vehicle that the government incentivized you to purchase, or you don't drive them much, which many people buy them as complementary vehicles. So they're not driving them that much.
Most of their miles are on their ICE vehicles, then they can actually have a negative impact on the environment as opposed to if they had not ever bought an electric vehicle to begin with. It's important that governments and policymakers incentivize the right people. And so this is a problem of counterfactual nature with unit selection and with probabilities of necessity and probabilities of necessity and sufficiency.
So for example, it may be very important that, that governments, instead of that have the characteristic, such as those households, respond a certain way to these incentives. Some households might respond negatively to incentives. Maybe the incentive is coming from a government that's the opposite political party that the household is a devotee of.
And so then they may never buy an electric vehicle because they were strongly incentivized to, in a way, with a message or from a person or group they don't like, to And then we've cut them off from electric vehicle and lower gas greenhouse gas emissions for a very long time. And so that's a really, really a bad thing that policymakers want to avoid as opposed to, you know, the opposite is that somebody upon receiving this incentive, they purchased the vehicle and drive it a lot.
And if they didn't receive the incentive, they would not purchase a vehicle or they would purchase it and not drive it much. And so that's the notion of a probability of benefit or a probability of necessity and efficiency. And we want to. So you might want to wait those response types accordingly. So you might want to wait the, the harmful consequence, the harmful response type where upon receiving the incentive, now they do not buy the electric vehicle, not just now, but for the foreseeable future as well.
And if you do not give them an incentive, they would have actually bought it on their own. And maybe that has a weight of negative 10 and we really want to avoid that. Whereas the, the opposite where the incentive really. works and really benefits the individual in the environment that has a weight of two still positive and still we really want that to happen but we really don't want the opposite to happen.
And so we incorporate all that into our formula. That's what our presentation was about.
Alex: What are the main insights from this work?
Scott Mueller: You can get, depending on your preferences for those weights, you can come up with really different conclusions of who, what groups to incentivize as opposed to if you don't take that into account at all.
And what kind of Kinds of incentives you want to give.
Alex: What have you personally learned during working on
Scott Mueller: this project, during your work on this project? I knew the causality involved. I don't think I learned any extra. There was more an application of the knowledge that I and others had. Uh, what we learned was the electric vehicle and the ice vehicle sort of industry and how that works.
And, uh, to some degree how governments and policy makers make their decisions.
Alex: What impact on the real world this work may have?
Scott Mueller: Well, the hope is that governments and policy makers. Make decisions, incentivize in the right way and for the right people in a way that benefits the environment. What's the most interesting causal paper you read last month?
There's a great one by a colleague of mine at UCLA, Yizhou Chen, along with, uh, with his PhD advisor, Adnan Darwish on, uh, causal Bayesian networks, along with structural causal models where the variables, if they're deterministic, meaning like in a causal Bayesian network, you know, you get ones and zeros if certain variables are deterministic.
You can get point estimates, you can identify probabilities of causation or counterfactual probabilities, even if you don't know what those deterministic variables, what formulas, or the conditional probability tables are.
Andrew Lampinen: I'm Andrew Lempinen, I'm a researcher at Google DeepMind, and today I was talking about some work we've done trying to get a scientific understanding of what language models could potentially learn about causality from the passive training that they get.
What was the motivation behind this work? Well, it was sort of to make sense of a puzzle in the literature, which is that on the one hand, we know that systems can't learn from observational data about causal structures. On the other hand, people were showing all these things where language models were able to do some sort of interesting, causal seeming tasks.
And the sort of resolution that we come to in this paper is that There's a difference between observational data, which you really can't learn about causality from, and passive data that might contain interventions. And we argue that actually internet data is of the latter form, and we show that systems that are trained on this kind of data can learn about generalizable causal strategies and other kinds of causal reasoning.
Alex: What is the main insight from this work?
Andrew Lampinen: The key insight is, well, I'd say there's two insights. One is just this distinction between the observational interventional dimension and the passive active dimension. Understanding how language models fit into this space because of the nature of The internet text that they're trained on.
But the other insight is actually to make sense of what systems can learn from this kind of data. And it turns out that it's possible to learn strategies that will allow a system to experiment and generalize in new situations just from passive training, which has interesting implications, not just for causality and language models, also for things like philosophy of agency.
Alex: What do you think could be impact of this work on the real world?
Andrew Lampinen: Probably very little under the, other than understanding what language models can and can't do. But I think that more generally it's. It's hopefully going to be useful as a fundamental science contribution to understand the nature of learning causality, what can be learned from certain kinds of data.
And I hope that that will help us to make better, more robust learning systems in the future and to understand how we can shape systems to generalize better, which is another aspect of what we studied in the work.
Alex: What's the most interesting causal paper you read last month?
Andrew Lampinen: I think that probably some of the most interesting causal papers I've read recently have actually been some We're actually studying humans world models and the extent to which we do or don't have them.
And there's some particularly interesting work looking at the kinds of situations that people's world models fail to capture fully faithfully and seeing how humans, when they try to mentally simulate something, will sometimes be unable to discriminate between physically possible and physically impossible situations in certain kinds of ways, suggesting that our world model is making some imperfect abstractions about the world.
Alex: Any particular title or authors you would like to mention?
Andrew Lampinen: The paper that I was mentioning there is, uh, One of the main authors is Todd Gerakis, who's a cognitive scientist at NYU. I think that would be the place to go to look for it. I forget what the title is. Hello. My name
Ali Edalati: is Ali. I'm a researcher at Huawei and I attended this workshop today and it was one of the most amazing and interesting workshops and talks that I attended in this conference so far.
Adbelrahman Zayed: My name is Abdelrahman Zayed or just Abdel. I'm a PhD candidate at the Mila, which is the Institute of Artificial Intelligence in Montreal. I'm here in Vancouver. And I really loved the workshop on causal influence. It was one of the best workshops I've ever attended in my life. Honestly, I love discussion on whether or not large language models can learn causal relations.
And I like both points of view when they were just like, each group was trying to put claims as to why they are not going to, or are going to learn causal relations. So I see both points of view, and I'm actually, that's the best part of the workshop. When you have both points of view. View in your mind and you keep thinking about which one is the correct one.
But I also really liked the the person who, at the end who said, maybe it's too early to decide now. Maybe it's just we need to wait a bit. That was one of the things that were said at the last five minutes, is that discussion, which I loved really, because the person was saying, okay, large language models have been around only for the last.
Let's say five years that, like this, like really huge size of, uh, language models. So maybe it's too early to say, do they have causal, can they learn causal relationships or not? So that's something I like, but apart from this person, everyone else was saying whether they think they can learn causal relationships or not and providing claims.
And I really love this kind of discussion. I also, of course, loved the, uh, talk by the godfather of causal inference, Judy Apern. It was really a pleasure. I mean, just listening to him talking was really a pleasure. Cut about everything. I mean, I read his book, of course, The Book of Why. He went a bit over it, but he also added more things because there is a theme for this workshop.
So the theme is whether or not large language models can have positive relationships. So he provided his own insights and thoughts about this question. And I really appreciated all of the questions that were asked of him and all of the speakers. I love the workshop, really.
Alex: What would be the main insight from the workshop for you?
Adbelrahman Zayed: The main insight is that the part when Judy Pearce said Judy Pearce said something I really loved. He said If the only thing you have access to is just observations, then you would not be able to come up with a causal relationship because you only see observations. But this theorem, or this statement, could be proven wrong in the case of LLMs because LLMs have access to A huge amount of data, and this data already has some interventions.
So, this kind of theorem, or this rule, doesn't really apply to them, because they have access to more data than, like, human beings could actually access when they read a book or two books. So, this statement might be actually proven wrong if that could, could actually happen, and they might actually have some causal, like, understanding.
So I like the fact that a theorem could also work in maybe all cases except maybe this specific case. I like the fact that we're now looking about looking at LLMs as they actually are because they're not us. They're not human beings. We don't, I don't train myself on like billions of, I don't train myself on the whole internet, so I shouldn't compare myself to a large language model.
So I like how he tailored the theorem so that he sees whether or not it will apply to a large language model. And this is basically how we started talking about his answer to this question about whether or not large language models can have causal understanding.
Alex: Great. What are you currently working on?
Adbelrahman Zayed: I work on fairness in language models as well. So anything related to bias. So why language models can learn biases, why language models can even make the biases that we have in our society even worse, how to mitigate these biases, how to measure these biases, anything related to biases, being gender biased or anything.
Bias, religion, bias, any bias. Yeah, this is my PhD.
Alex: If you could recommend one paper that you offered or co authored to people, what, what paper would that be?
Adbelrahman Zayed: So we have a recent paper on pruning like large language models. So we basically say that there exists some parts in every language model that are responsible for bias.
If you find them and you remove them, the model will be less biased. And we prove that this is the case.