Causal Bandits Podcast

Causal Inference for Drug Repurposing & CausalLib | Ehud Karavani Ep 18 | CausalBanditsPodcast.com

Alex Molak Season 1 Episode 18

Send us a text

Was Deep Learning Revolution Bad For Causal Inference?

Did deep learning revolution slowed down the progress in causal research?

Can causality help in finding drug repurposing candidates?

What are the main challenges in using causal inference at scale?

Ehud Karavani, the author of the CausalLib Python library and Researcher at IBM Research shares his experiences and thoughts on these challenging questions.

Ehud believes in the power of good code, but for him code is not only about software development.

He sees coding as an inseparable part of modern-day research.

A powerful conversation for anyone interested in applied causal modeling.

In this episode we discuss:

  • Can causality help in finding drug repurposing candidates?
  • Challenges in data processing for causal inference at scale
  • Motivation behind Python causal inference library CausalLib
  • Working at IBM Research Ready to dive in? 

About The Guest
Ehud Karavani, MSc is Research Staff Member at IBM Research in the Causal Machine Learning for Healthcare & Life Sciences Group. He focuses on high-throughput causal inference for finding new indications for existing drugs using electronic health records and insurance claims data. He's the original author of Causallib - one of the first Python libraries specialized in causal inference.

Connect with Ehud:


About The Host
Aleksander (Alex) Molak is an independent machine learning researcher, educator, entrepreneur and a best-selling author in the area of causality.

Connect with Alex: Alex on the Internet

Links
Links for this episode can be found here

Video version of this episode can be found here

Support the show

Causal Bandits Podcast
Causal AI || Causal Machine Learning || Causal Inference & Discovery
Web: https://causalbanditspodcast.com

Connect on LinkedIn: https://www.linkedin.com/in/aleksandermolak/
Join Causal Python Weekly: https://causalpython.io
The Causal Book: https://amzn.to/3QhsRz4

 018 - CB020 - Ehud Karavani - Audio

Ehud Karavani: I can even go further, like, I will say that, like, in, in some ways, deep learning revolution is the worst thing that happened to causal inference. In CSS, the code is God, because it doesn't matter what you describe in your methods section, and like how well you describe it or not. The code is very definitive about what it does and how it does it.

They don't respect the data generating process. It's ineffective. It's, it's a waste of money to gather all that data, but not to analyze it properly. And it's also, I think it's like, it's disrespectful to the patients. So I think the hardest part for like machine learning researchers going to causal inference is a 

Marcus: causal bandits Welcome to the causal bandits podcast the best podcast on causality and machine learning on the internet 

Jessie: today We're traveling to tel aviv to meet our guest before starting his adventure with computation and causal inference.

He used to be a bass player He believes in the power of good code and loves to play music with his children creator of causalib and research staff member at ibm research Ladies and gentlemen, please welcome Welcome, Mr. Ehud Karavani. Let me pass it to your host, Alex Molak. 

Alex: Ladies and gentlemen, please welcome Mr.

Ehud Karavani. 

Ehud Karavani: Thank you. Thank you for having me. I'm grateful for that. 

Alex: Welcome to the podcast, Ehud. 

Ehud Karavani: Thank you for having me again. Yeah. 

Alex: Before you started your journey with causal inference, You used to be a bass player. What is one or two experiences that you had or one or two skills that you learned during your music career that translated to something unique about you as a researcher in causality?

Ehud Karavani: And that's. 

That's an excellent question. So I think Ed Young, he was the science journalist for the Atlantic once. He wrote, he had a piece about like the relevance of, of Nobel prize. And he wrote like that, um, science is the teamiest of team sports. And I think there's something about being a musician that prepares you for that, especially if you're a bass player, like that takes like a certain, certain character that I guess, like, I don't know, give you appreciation, like for good foundations, having like building a flow that will carry others.

moving them forward, being a basis they can build on. Yeah. I mean, it all exists in music, being a part of like a greater thing and let others build on top of you and provide those foundations. I think that the leap from that to science is, is quite small in order to like, to collaborate. Well in science because you know that to make good science like you you can no longer be a lone thinker like 

Alex: what's good science?

Ehud Karavani: I think good science is Trustworthy science. It's science that we know can generalize And we can look at its internal and realize whether it is made up of nothing or it's, it has some substance in it. Good science is science that is communicated properly and well to others. I mean, a scientific discovery worth nothing if other don't know about it, like a person can discover tomorrow, like what happened before the big bang, but if he, if he doesn't tell anyone else.

It's, it's worthless. And so communicating is really important for science. And so that's, that's the presentation, but we don't have really, we don't want to put like a lipstick on a pig, right? So like the, the, the essence itself should also be trustworthy. And so, and so, yeah, so good science is, is, is open and it's, and it's transparent science.

Alex: When we think about causal models. Especially when we talk about them with people who are new to causal modeling, and maybe they come from more traditional machine learning or data science backgrounds. Those people will often ask questions, well, but how can we evaluate those models? In Causal Leap that you created, there is a module that is dedicated to model validation or model evaluation.

What was the idea behind creating this module? What was the motivation behind it? 

Ehud Karavani: The biggest thing that differentiates machine learning prediction from causal inference, I guess, like it's at least one of those is like the fundamental problem of causal inference, right? The fact that you never have ground truth labels.

So like you can never truly evaluate whether your model works, I guess. In a direct way? In a direct way. Yes. It's not like you could, you could generate like an, an area under the curve and say like it's. 0. 8, it's 0. 7, it's trustworthy or not. In order to convince people that your model works, you need to market it.

It's not just a matter of like showing some numbers and convincing it. And even if people are aware of like machine learning evaluations, then they will expect that. And so you need to counter that, uh, in advance. And so Causalib, the evaluation module in Causalib really was raised by that need. But us needing to convince.

stakeholders that the modeling process that we did was okay. So first of all, it's graphical. It's, there are like many plots with many flavors and colors. You need to make the presentation compelling in order to market it properly. And only later did we like converted these graphical visual insights into like numeric values to, to enable like users to do like automatic selections or selection of models based on those ideas.

Alex: What are some motives of this model? Give us like a, like a You know, like a bird's eye view. 

Ehud Karavani: Yeah. So, so causally generally, like it has like a, uh, its main components are maybe like an, an estimation module, which has like a lot of the more common models that we work with, like IPW or an S learner. Which is named standardization at the time and a t learner or like yeah, which is like stratified standardizations Because it goes way back like it goes like before the the Kunzl paper with 

2019 

Yeah, with those with those naming conventions.

So like I had to make make those up. And so it has like this estimation module and it also has a survival, um, estimation module, uh, for, for survival, uh, time to event analysis. And it also has the evaluation module, which again, like is, is, uh, is comprised of, uh, of graphical evaluations. It has It basically tries to, to replicate a scikit learn structure, so it has like a metrics module with scores that are like compatible in some way, which you can like put inside like other components like a, like a grid search or, or a half space or having, helping, helping Um, uh, uh, search objects.

It has, um, like a feature selection module, uh, again, it's like it tries to mirror scikit learn. So it has like a feature selection module with like confounder selection specific methods. And it also, also has like a contrib. Module, which as well as scikit learn, like it, it, it, it has like slightly more, maybe state of the art models or models that has been contributed, like not by the core team.

So like, might not be as tested or, um, or, or that, but like might be interesting for users to 

play with. You created the, at least the first version of causal leap, essentially alone, which is, well, I think it's a great idea. Great. It's a great challenge and a great achievement. I really admire this. Thank you.

How was your journey with this, uh, with this library and how does it, how did it start? It started 

when I joined IBM research. So I joined the machine learning for healthcare in life science group. When I joined the people already knew they, they needed causal inference methods in order to answer the questions that interest them.

Um, and so, you know, like when I, when I first like opened, opened my mailbox. There's, I don't know if people know, but when you, when you, when you start a position in a corporate, like you open the mailbox, there's lots of like onboarding spam and also like just regular spam, spam. But there was like one, uh, non automated email that I received, which was my, my mentor, uh, Ishai Shimoni.

Uh, we've, we've like a single, uh, hyperlink in it to the Hernan and, and Robin's book. Today, it's called, what if back then it, it didn't really have a name. It was the causal inference book. And when I joined since the, the team already had some previous projects doing causal inference, there were some fragments of code, you know, like each project had like, maybe this, the one project like needed an IPW and another one needed like an S learner.

And so, It was, it was scattered around. And we're like, when the new project came along, like people would maybe copy paste those files, maybe tweak them a little bit, but there was nothing like coherent about it. And, and then I came along as a student, uh, in the student position. And, and I don't know what I, maybe partly because like, maybe coding is like, is like donkey walk.

Maybe because I had some basic immature sense about like how, you know, important it is to build like tools for other to use. Maybe it was just like, you know, maybe it was just a vacuum I came in and filled, but I started like, uh, uh, uh, organizing, uh, those fragments into one consistent, coherent Python package with like, uh, With a reasonable API, like I broke down the code, uh, because like it was all jangled up together.

So like I broke it down to like a, um, a nice psychic learn, like a, uh, API. And I combined like all the, uh, and I combined fractions like from all over the place, uh, in order to create like a library that will have causal inference methods, like for causal estimation that could be passed around. Uh, project that not even passed around, like be be installed like any other tool that like we use.

And that was the, and that was like the internal beginning of, of, of Causally probably like, I don't know, like maybe early 2017. And then like, once we realized that's. Like after some like internal tests and uses, like in projects that we have, I guess, like, I realized that this is like something that can be useful to, to other people as well, to many more people.

And those are the early days. Like those are like 2017 again. So like, do why looks very different from today's do why. Causal ML is non existing, I think. EconML is maybe around, but like the, like the most advanced methods that EconML currently has, like the papers. weren't out yet back in 2017. So it probably looked quite different, um, than now.

And so I had this idea that like we will open source it. It was a new idea. Um, and we had some And it's required some like uphill battles, you know, like fighting corporate, uh, bureaucracy and corporate politics. And I was a student, like I was not acquainted with, with corporate politics and I had nothing to do with it.

Then Kifo, me, uh, my mentor, Ishai, like he had my back and we went along with it and we thought, and we were able to publish it, uh, and make it and make it available for everyone to use. And that's how we started. 

Alex: How was your experience starting in a field like causality so early in 2017? As you mentioned, many of the papers for the methods that today are considered being Like classics, like double machine learning or, or some other methods, they were not even out there.

And the idea about causal machine learning was something that was incredibly, still incredibly niche. Maybe a year or two years later, it started building more momentum with the book of why as well. But that was a very, very. The very, very beginning of modern causal inference and machine learning.

So how was your experience? How did you feel back there?

Ehud Karavani: It was tough in the sense that, as you say, I think there was like relatively few materials available. So as I said, like I got the Hernan and Robin's book. Um, but. That's a hard book to start with, like, I don't know how other people learn, but like that book is so full of, it's an excellent book.

It's full of details. I regularly go back to it to like relearn things and like get a better understanding, but that's not, I think like a good first book in some sense. I prefer, I don't know, my learnings, you know, like starting from a bird's eye view and, and, and drilling down. Uh, I'm not getting into the math and the details, uh, very early, it's easier to conceptualize like obstructions.

Like when you, when you look at, at, at, at the bigger picture before then like drilling down to the details. So like resources were limited in that sense. Like you, you needed to work harder in order to learn. In order to get the intuition, in order to fully grasp, like, not what the math looks like, but like, what it actually does, like, I don't know, like, what IPW really does, like, what standardization really does, like, how does it balances, how does it, uh, uh, allows us to obtain.

causal effect. So, and, and, and, and also like in terms of resources, they were not like digitized. And so there was no code. Like, I like code because like code is very direct what it does. I mean, there, there, there are no second guesses. Um, it does what it does. And there were like, I don't know if no, but like very little available.

Code resources that like would allow you to see like what's actually happening, uh, within those models, how things are actually being done, how they're actually being implemented. But I think like nowadays is much better. Like we have more software, we have, we have more books written, like written for, for like varying audience for probably more, uh, computer science, machine learning people for like more for econometrics, econometricians.

Uh, epidemiologists. So we, yeah, so, so we are definitely better shaped now. It was slightly harder at the start, I guess, uh, but, but people had to, to, to struggle as well in order to make the resources that are today 

exist today.

Alex: What do you think we as a community should do today in order to make causal inference and causal machine learning, causal AI?

even more accessible to broader public than it is today? 

Ehud Karavani: That's a good question. I mean, I, so we just established that like we already have like the, the learning resources piling up and we have excellent software by now. We have excellent software that like given data metrics of covariates and treatments and outcome will spit out causal effects and it will do it nicely.

However, um, I guess like as every. I don't know, practitioners would probably know getting data to look as a matrix of covariates and treatment and outcome is a complex process. It takes as much knowledge in causality to organize data to be ready for a causal inference analysis. As it is to, to develop methods and, and applying causal inference methods.

You need to know like what covariate you need to, to choose to adjust for. You need them to take them at a specific point in time. Like they need to be like before the treatment happens. They can be like after the treatment happens, the outcome you need to, to establish the follow up time you need to establish a time zero, all those things are complex to.

Um, all those, all those things are like quite complex to implement using like, I don't know, uh, databases, um, queries and data processing. And I think a solution that will make it easier. We'll be, we'll provide like a great leap in order to make causal inference easier solution that will allow you to take a database, like an event, like databases, because no one curates data for research for like, I don't know, unless like you're a research lab that like, Uh, uh, uh, uh, generates your own data.

We work like I worked with insurance claims data or electronic health records. I don't know, like marketing people work with like events, like website events and, and, and interactions. Banking, um, finance people work like with banking databases, like all those databases were not created with research in mind.

They, they have like a, uh, they, they fulfill like a relatively, a relatively different need. 

Alex: A different purpose in mind. 

Ehud Karavani: Yeah. A different purpose in mind in, in, in, in the sense that, you know, they might be used for audits or for, for, for, for a different task entirely. Maybe they will. Um, And we repurpose them for, for, for research.

And practitioners who do that, who use like observational data, probably know how difficult it is to take the, the data that, that it's out there and make it and fit it in a causal analysis. Um, And so I think a solution that will make, make this process slightly more automated will be more beneficial than like another causal inference methodology package or something like that.

Yes. 

Alex: So what you say is that you feel that we are actually missing tools for this preparation phase where we can take Data that was recorded maybe just for, well, with, with some other idea in mind, some other goal in mind and repurpose this data or reshape this data in a way that is, that can be consumed by causal inference machinery that we already have.

Ehud Karavani: Yeah, definitely. That's, that's the point. Yes. 

Alex: You mentioned electronic health records and health projects is something that you worked on, like practical applied projects around health at IBM research. What are some of the main learnings, main insights you got from this stream of your work? Let me give you some context.

Many people are fairly interested in understanding What is needed in order to make a, a actual real world, uh, causal inference or causal machine learning project? What are the intricacies? What to pay attention to, uh, when we are in a preparation phase for a project like this? So for many people who are just starting with causality, they don't have too much practical experience.

They are very curious about the stuff, like how do you look at the data and how do you communicate with the stakeholders and so on and so on. So if there are any insights that you think could be valuable for the community. I'm, I'm pretty confident that people will be very, very grateful for them. 

Ehud Karavani: Yeah.

So I think like the first, first and foremost, what you need to understand is the domain. And for that, you don't, you don't even need data. It's simple. It's the, the, the, the work that you need to do is, is before you even take, take a look at the data, it's, it's when you, when you try to, to realize what data you even need.

And so revolves interviewing the, the domain experts, like in our case, it's probably like doctors and physicians to set up the problem, to know like what. What questions do you want, do they want to answer? Um, what's the problems, what's the, what's the complexities in answering those questions? All of those can guide you to know what data you want to collect, like what's the estimate you want to estimate.

And extracting that knowledge from domain experts is hard. Um, you need to take their knowledge and somehow fit it in. To, I don't know, to, to, to direct it a safety graph in some sense, right? You need to, to distill their knowledge to make, to make something sensible, uh, for a causal analysis. And so, for example, like I, I think after lots of trial and error, like we realize that like, especially physicians, they know the process of treatment decision.

Much better than the know the process of outcome of what determines an outcome because they prescribe the drugs, for example, so like they know like what patient would receive like what drugs relatively speaking, and they know where the fuzziness exists, which. We need to take advantage of in, in, in, in modeling, because like, if there are strict guidelines that like everyone adheres to, like you have no place to operate within and make causal claims because I know it's like, it's, it's that, that will be like an overlap issues, right?

Do you have no variance to. To start with, to start with. Yes. Thank you. So we realized that like modeling the, the, the, the decision to treat is much simpler than modeling, whatever the terms of the outcome may be. And so that's a slow process of interviewing and figuring out the complexities because they will like, they will start and tell you, oh, but those kinds of patients, like that, we get something different.

So, you know, that's. Uh, covariate, you need to adjust for, I know the, 

Alex: what is the best part of working with physicians and people who are so deeply ingrained in ingrained in the domain, 

Ehud Karavani: their passion. 

They come to us because they have. Problems that they really seek to solve it's it's and it's a strong driver for for doing good work 

Alex: You worked with on many interesting projects in your career some of them Were involving topics like child delivery.

You also work with people who are interested in causal effects It's related to, uh, surgical in medical sense, surgical interventions. What were the main lessons from, from those projects and looking at something, uh, so close to, to human body and human life? 

Ehud Karavani: So I studied computational biology university, um, because, uh, I had a vague knowledge about computers, and I know they can be a great tool, but I was not interested in the tool itself.

I was interested in what, what, what it can do. And biology seems, seemed like, uh, like, like a very straightforward domain. To do good in I don't know like quote unquote good as I evolved my career Like I moved from biology to more applied biology right to medicine and and and and human health Uh more directly and so if you if you work on human health, it's like it's a very direct way to improve Uh people's life.

I mean There can be lots of other ways that you can, uh, improve lives in a very indirect way. This is what basic research is, is, is always about, right? Like no one thought we will discover CRISPR. They just like studied some, somehow microbiome, how bacteria fights against viruses. It's like very basic science, but it's, it ended up, it ended up being so meaningful.

I didn't have the patience. I want to do something slightly more direct. And so. Working, um, directly in healthcare seemed like the right opportunity to do 

that. 

Alex: What are the main challenges in working with practitioners? From your point of view as, as, as a, as a scientist or researcher. 

Ehud Karavani: First of all, is close.

As I said, as we said before, it's like closing the gap, like, uh, trying to extract knowledge from them is a very communication heavy, like interaction heavy, uh, it's a communication, heavy interaction that you need to carefully listen. And like carefully convey and then try to distill what, what they say in, in, in natural language and distill that into a DAG.

The second thing I think is how is the disappointment because they are not statisticians and they are not causal inference experts. And so they. Like their worldview is, uh, is based on things they hear from the, from like a, a pop media or popular science in some, in some sense. And they come to you and they want, and, and, and they, and they heard that causal inference is important and it's, it now can be done and, and, and so exciting and they're hyped about it.

And they come to you and they want you to use the tool. Like they heard machine learning, they heard about deep learning, they might even know PyTorch or something like that. And they, they come to you and they say like, Oh, you have this causal lib tool. Come on, let's, let's fire it up and let's get some causal effects.

And then you sit down, and you start to talk about estimates, and causal gaps, and a causal road map, and target trial emulation, and, I don't know, time zero, and self inflicted biases, and they are baffled. You can't like You can see how less engaged they become as you speak. Um, they came for an answer for a tool and they get a teaching, a teaching session in some sense.

Alex: It's like in life in general. 

Ehud Karavani: Yeah. Yeah. You want, you want, you want the easy solution and you get more homework in some sense. And then they become upset because they realize, uh, I don't know where, where I read it before, but they realized that causal inference is a bait and switch scheme. Because. We speak so highly about the importance of causality and we say that counterfactual prediction is so important and, and that's the way that we need to operate.

And yet when we provide the tools, we provide a regression model. We provide like a source, uh, like we provide like regular machine learning models or regression models. And we tell them that if they think hard enough and careful enough. It will become causal. And that sounds like snake oil. That sounds like a self help book, like the bad kind, like the secret was in you this whole time.

Well, sometimes it is. I guess. Sometimes it is. I can even go further. Like I will say that like in, in some ways, deep learning, the deep learning revolution is the worst thing that happened to causal inference because back in the good old days where like SVMs were all the rage, researchers Did, mostly did feature engineering.

They would sit and think about like clever ideas to represent the data. So when they pass it through, like this simple SVM, they will get like good results. And 

this has changed. 

Yeah. And, and under that climate, like I would think it would be much easier for causal inference practitioners to sell what we do in causal inference to think carefully about the input data.

Uh, what variables you select, uh, at what time you select them, like the temporal relations between variables and how they interact. But then like along came deep learning, like, I don't know, the 2012 revolution, I don't know, like starting AlexNet and stuff. And the idea that, um, you will just input as many raw data that you have and let the model figure it out, uh, in a much like more automated way.

It really instilled that insight in people that like, it's, you can just take the data and you get something out of it. There is no deep learning framework for causation. Like identification cannot be automated in such a way, or at least like fully automated in such a way. It always requires metadata. And so when you, when you tell, so people come with, with, so, so, so collaborators come with different expectations.

And when, when it hits them that they just like me to be more rigorous about their understanding of the underlying data generating processes, they're slightly disappointed, but then it's up to you to come and sweep them up and like encourage them. And make them partners for this journey with you, 

and also show them that.

Well, there is, we have something that is more than just a regular machine learning model, right? So to de bias the effect, for instance, and so on and so 

forth. And I think, like I said before, that like, uh, it's a bait and switch scheme because like the underlying machinery, like under the hood is, is, is regular regression model or machine learning models.

But the good thing about, about causal inference software is that it makes the counterfactual prediction explicit. And so you, it, it, it allows you to conceptually grasp the difference between what the regression model would have done to what a counterfactual regression model does. 

Alex: One of very exciting to general public exciting, um, Use cases for machine learning is, is drug discovery.

But 

when we think about the drug discovery process, sometimes maybe report repurposing existing drugs might be a more efficient way to address certain diseases and so on. You worked with, uh, projects like this where you used causal machinery. to repurpose existing drugs. Can you share a little bit with our audience about this experience and what those projects were about?

What were the main challenges and main lessons from those projects? 

Ehud Karavani: Right. So, I mean, first of all, just to briefly explain, like drug discovery is the mission to find new drugs, um, drug repurposing is, uh, is about finding is about taking existing drugs. And finding new usages, new uses for them. Um, in some sense that's, pharma companies love it because it, it, it shortcuts the, the, the funnel that you usually have, like from discovering a molecule to testing it on animals.

To testing it on humans, like converting, uh, uh, different levels that it's not toxic, that it's effective and so on. And taking, uh, existing drugs really bypasses most of this process because it's already approved. So you're like, you know, it's relatively safe and you just need to prove it's efficient.

It's efficient for the new diseases. And there are many approaches you could do for drug repurposing, right? For, from like the molecular level of the proteomics level, you could find molecules that bind to the target proteins and, and I don't know, like inhibit or excite its function. But more broadly in the statistical sense, you can even shortcut this process of finding the molecular, uh, pathway, uh, the mechanism itself, because you can take electronic health records, Um, and you can, and you can stratify on the people with some disease and you can enumerate all the drugs they're taking.

And then you can start comparing people to the drugs to people who didn't took the drug and see whether it improves the outcomes related to the disease, which is fairly simple conceptually. But how do you compare those who took the drug and those who didn't took the drug? Like you need this comparison.

To not be biased, to be de confounded, right? To isolate the effect of the drug itself. And this is where causal inference, um, can help us to do this sort of modeling. So here in IBM, we set out to develop such a system, uh, which takes, like, a configuration file written, like, in very high level, abstract level that physicians, Can even define, and then it goes through some strong black magic database querying of setting up the, the translating this configuration into again, like translating it into matrices of, of confounders and treatment assignment and outcomes, which is really a tremendous fit that the team did here.

And they did excellent work on it. And then provide it into, let's say, the causal inference engine that will estimate the causal effects, the survival difference, or the cumulative incidence used under each drug regime. And to do that, it also needs to be done in a high throughput. Because we, as I said, like we enumerate all the drugs patients might have taken, that can be hundreds of candidates.

So there's like, first of all, like there's, In technical terms, there is some docker machinery that needs to run those things in parallel in order to be efficient. But in terms of the analysis, you also need to account for that later because you have this multiplicity. Uh, did you need to account for and doing high throughput analysis?

Like you can no longer tailor the design for each specific drug because each treatment outcome pair probably has like a slightly different structure, like a dog around it, but you can't tailor a specific DAG. to each outcome treatment pair because you have a hundred of those. And so you need to, to, to, to have some trust that your data is rich enough in order to capture some of the internal state of patients, right?

You have all the, all their diagnostics and you have all their tests and you have All their previous prescriptions. And so it's not that far fetched that those capture something about the inherent health status of those, of those patients. And together with some variable selection and together with some like proximal causal inference.

Um, methodologies, you can gain slightly better trust on your estimates, but also like we are very down to earth, like we know that candidates that are being selected by the process. I mean, this is a process that generates candidates and candidates selected by the process are not directly translated into like, uh, good repurposing candidates.

As in. Every observational analysis, like the results need to be triangulated from other sources in order to gain like more confidence in the evidence that these specific candidates are promising for the, for, for the disease test. 

Alex: When we think about triangulation, do you believe that technology like large language models can be helpful in this?

For instance, in searching large databases of scientific research and finding candidate articles again, right. That we could use in order to confront our hypothesis with. 

Ehud Karavani: Yes, definitely. So I guess like loud language models that are slightly more grounded into knowledge bases in some sense, but definitely, I mean, if I would have a large language model, um, I don't know, trained on PubMed for like, uh, medical publications and they find in my causal inference analysis that some drug with some active ingredient is beneficial.

And then I go in and I query these, uh, large language models and it tells me that there are some articles about that drug being effective in, I don't know, in mice. So that's really strong evidence. I don't know if it's strong, I mean, evidence is a spectrum, but this like further reinforces your trust that this candidate might be promising, then it should be moved further along for further testing to see if it's really, if it can be really beneficial.

Alex: When talking about causal leave and then about the system that you just described, you emphasize that it's important that those systems are, are efficient. They are written in a certain, in a certain way. Now, we met yesterday for a dinner and you told me that you believe in good code. 

Yeah. Why 

is that? 

Ehud Karavani: If you can trust the code, you can trust the results.

In some sense, the code is God. Because it doesn't matter what you describe in your methods section, and like how well you describe it or not. The code is very definitive about what it does and how it does it. And today's research is inseparable from coding. Software is inseparable from science in some sense, because you cannot do empirical research without statistics, without computation, without writing code.

And in order to gain confidence in the conclusions, You need to have confidence in the methods. So you need to have confidence in code. And so testing code is like the most important thing. And I know like it's easy. It's easier said than done because like when we get to it, like people usually don't often like to test their code.

Not just that it's important. I think like testing the code will allow you to design the code better in the first place. And if the code is designed better, if you know, if you can break it down into compa into, into components, it means you can break down the question. into components. And so those two are really tightly connected in some sense.

Alex: Those principles of modularity, is that something that you also used when designing Causallib? Of course. 

Ehud Karavani: Yes. No, definitely. So, so designing Causalib was, was first of all, a lesson in figuring out how causal inference operates, knowing that you can estimate an effect. And you can also estimate counterfactual outcomes.

Doesn't have to be the same, necessarily. But counterfactual outcomes, like some models can estimate the average counterfactual outcomes. Some can, some estimate the individual level outcomes. Some models Some models model the treatment, some models model the outcomes. And so, figuring those out, like being able to draw a schematics around it, really deepens your understanding about causal methodology, like the methods themselves, the, the theory behind them.

And then when you put it into code, it shines, like you see it. 

Alex: What are the practical? Implications of building a modular system like this for causal inference, specifically, 

Ehud Karavani: I think, and I can speak for causal leap, um, but I think it allows two things. So first of all, if it's clean enough and it's simple enough, then people who want to learn causal inference, causal inference methods can look at the code and learn.

It's also, which I think is one of the strengths causal leap has. It allows you to scale very naturally and very easily cause using a logistic regression and using gradient boosting trees. like a cross validate validation based hyper parameter tuning model for running boosting trees is a manner of like changing a single word in your code.

I'm a very strong proponent of, uh, of modeling the treatment. Like even if you want to end up with a heterogeneous outcome, you need to model the outcome directly. It is still like very important to model the treatment regardless to figure out how the treatment groups behave, like how separable they are, how comparable they are, whether you have like, uh, overlap violations, that, that sort of thing.

And so if you already model the treatment and you also like want to model the outcome, then like, why not combine them? Why not do a doubly robust model? And I think Causalibs allows you to use those components and reuse those components quite easily and do it again, like in a way that scales that, that allows you to make, to create complex models very easily.

And so you live very little residual confounding bias. Again, it's up to the user to close the gap between like the causal estimate and the statistical estimate. But I think Causalibs. It allows you to close the gap between the statistical estimate and the statistical estimator. Really, like, make them really tight and really close.

Alex: Today, you wear an ArXiV t shirt. Yes. Which, I want to take a video of this, an additional one, so the viewers can see it. Because I don't know if it's seen in other cameras. What dictated this choice today? 

Ehud Karavani: Yeah, so I knew it would be, um, video recorded on top of like it's a podcast, but these podcasts are also video, I guess.

I guess like why not make a statement? Like if it's already out there, like why not make use of it? That's like the efficient thing to do in some sense. And so preach what's important for you. And, uh, we spoke about open software, spoke about trustworthy open science. Um, an archive in some way encompasses that.

Alex: What's important for you in the idea of open source, for open science and open software? It's 

Ehud Karavani: important because it can be checked and it can be assessed whether it is trustworthy. And if you wrote the paper in, uh, in a good way, and if you wrote the code in a good way, then people can Build on top of it much more easily much more quicker and we can move faster and we can discover new stuff Much more quickly than we would have otherwise.

What are two books that changed your life? I don't know as a teen I really liked to read uh, I don't know Douglas Adams and Colt Vonnegut and Milan Kundera. I mean those really impacted like Really affected me, but if we're like slightly more on topic Can I have three books? Yes. All right. So, first one will be Gedele Sherbach, by Douglas Hofstadter.

Really, like, the most intelligent book I've ever read, I think. I read it during my bachelor's, it really made a mark. I remember concepts like being used so beautifully there. And it's written so nicely. It's written in such a smart way. It's really a delight to read. And so, especially like in in these days and eras of like large language models.

It's nice to reflect back on how it's written. On like the basic of language, you know, like Chomsky's Lemmas and such and, and, and, and, and the science like behind it, not just like the, the billion parameter machinery that, that might be useful for it, um, the science behind it. A second book, which we just mentioned, um, between us.

Maybe the, the design of everyday things. So when I, when I joined, uh, IBM, my mentor, Ishai, um, he suggested me that book and it's really an eye opener for anything you interact with in your everyday life. Every interface you chat with that window, I don't know if it's a bus station or a fridge or a camera, is so thoughtful about all the little, uh, details or the fine detail and all the broad.

Uh, aspects and, and, and specifics. And so that's a good book that can teach you about life, but it is also highly relevant if you write code because it allows you, if you write code and if you do research, again, like we said, those are really inseparable. So like, it really allows you to, to organize your thought around like the.

Most basically like around the API of code, but also like how you organize research and how your research might interact with the world. Like even, even the basic structure of a research paper, introduction, methods, results, discussion, that's a design choice. It doesn't have to be like that. 

Alex: And the third book?

Ehud Karavani: Right, so I think the third book is slightly more directly related. So I think in the last like 10 years, I think I gained some. Uh, some very specific view about science and statistics and how to do inference for that. And how to do it properly and rigorously and use for the inference for that. And I picked it up little by little, fragments by fragments, reading books, hearing lectures.

Reading Twitter at the time, but then I encountered, uh, Richard's, Richard McElrath's Statistical Rethinking, and he picks like 90 percent of the stuff I learned in a single book, and so it really resonated with me. I think it's an excellent book, again, about rigorous, honest, thoughtful way to apply statistics.

Which puts science first and methodology second and analysis second. And, and breaks away from the usual machinery that we so automatically. Uh, go for, we try to analyze data. I did some very, uh, some, some, some little consulting also work with some hospitals and, and researchers, they'll. PIs, like, they do experiment, they go and gather data, they spend millions of dollars gathering data, recruiting patients, recruiting healthy controls, taking measurements, some of the measurements are done by doctors, by physicians, that's, they bring them for days and they measure them, it's, it's an entire operation, and they capture all those, all All this data, and when the time comes, they treat it like junk.

Alex: Like junk?

Ehud Karavani: Yeah. They don't respect the data generating process. They do a t test in some sense. Now, they were not taught anything better than that, but it's, it's somehow, it's lazy and it's ineffective. It's a waste of money to gather all that data, but not to analyze it properly. Yeah. I agree. And it's also, I think it's like, it's disrespectful to the patients who gave their time and their hope and thinking like things might be better, like they will be the ones who will help make things better.

And then like being used, uh, for a t test, even though there are repeated binary measures in some sense. Right. So I think, I think Richard's book is, is really, uh, pins the point on, on how to. And how to respect the underlying data and analyze it properly. 

Alex: What would, what would be your advice to people who would like to go into causal inference or causal machine learning?

Research and or practice. 

Ehud Karavani: So I think the hardest part for like machine learning researchers going to cause of the infants is identification is realizing that the kitchen sink approach we talked about before, where you just let the model figure it out. It does not exist. So that's like the first thing.

And the hardest thing to understand because people with machine learning background, like they. wouldn't, like, it won't be very difficult for them to understand the algorithms and the models that do causal inference, but understanding that each problem has a structure that needs to be respected, that's like a slightly bigger jump they need to make.

And unfortunately, as I said, like when I started learning causal inference, there weren't a lot of books and resources. But nowadays there are. High level stuff, low level stuff, detailed stuff, less detailed stuff. Unfortunately, I'm not very proficient in them because I didn't grow up with them. Uh, but there are, and I really do think it's, it will be slightly easier for them, uh, in that regard.

And I'm very happy for that. We need more to jump aboard. Who would you like to thank? My family, of course. I mean, yeah, I mean, it's probably a cliche, but I don't know, like, I don't know. First of all, like my parents, they, like, they instilled in me that like learning is important and they put their money where their mouth is, like literally, I don't know, when, when I was studying.

Uh, the university, and I don't know if I needed some help, they provided, they provided help that like allowed me to focus on studying solely, just studying. And I'm grateful. It's, it's, uh, it really did help me, um, study better and understand. More into the present, uh, it's my partner and the mother of my, of my, of my kids.

Um, she's a bedrock. She's the most empathic and caring person. And she helps me with our everyday struggles. I mean, life is much more than just research. And so, and she's there and I'm thankful for her for that, but I guess like more professional, like, right. It's like the Maslow hierarchy, right? We talked about life.

Now we can talk about work. So I think professionally, it will be probably Shashi Money is, is, is, uh, I mentioned him before. He's, he was the, my mentor when, uh, when I arrived at IBM, he is my manager now. Um, Uh, he's a great person, the most, very interesting person, and he, he taught me a lot. He taught me about the importance of coding.

He taught me about design, and, and he really had an impact in how I perceive things. I guess, like, maybe research more generally. Where can people 

learn more about you and your work? 

With the internet being, uh, fractioning more and more, um, I opted for, uh, my own website recently. So it's, uh, at, uh, ehud. co, E H U D, uh, dot C O.

Um, it has, like, all the other places, I mean, at the internet, like, uh, Twitter and BlueSky and GitHub. It also has some teaching materials I upload for causal inference that might be interesting. Uh, for people and my blog, probably most of it is on Medium, but some of it not. So that's also there. What's your message for the Kotlin Python community?

Keep on rocking. I mean, it's been expanding so nicely, and the tooling is becoming more accessible and more accessible. Trustworthy and they're doing a great job, and I'm so happy to see I'm so happy people can can use it as a getaway to to causality and so Just keep doing what they're doing 

Alex: amazing. Oh, thank you so much.

That was a pleasure. 

Ehud Karavani: So grateful for having for having me Thank you very much. 

Alex: Thank you so much