Causal Bandits Podcast

Causal AI, Modularity & Learning || Andrew Lawrence || Causal Bandits Ep. 002 (2023)

November 07, 2023 Alex Molak Season 1 Episode 2
Causal AI, Modularity & Learning || Andrew Lawrence || Causal Bandits Ep. 002 (2023)
Causal Bandits Podcast
More Info
Causal Bandits Podcast
Causal AI, Modularity & Learning || Andrew Lawrence || Causal Bandits Ep. 002 (2023)
Nov 07, 2023 Season 1 Episode 2
Alex Molak

Send us a Text Message.

Support the show

`from causality import solution`
Recorded on Sep 04, 2023 in London, United Kingdom


A Python package that would allow us to address an arbitrary causal problem with a one-liner does not yet exist.

Fortunately, there are other ways to implement and deploy causal solutions at scale.

In this episode, Andrew shares his journey into causality and gives us a glimpse into the behind-the-scenes of his everyday work at causaLens.

We discuss new ideas that Andrew and his team use to enhance the capabilities of available open-source causal packages, how they strive to build and maintain a highly modularized and open platform.

Finally, we talk about the importance of team work and what Andrew's parents did to make him feel nurtured & supported.

Ready?

About The Guest
Andrew Lawrence is the Director of Research at causaLens (https://causalens.com/) Connect with Andrew: 

About The Host
Aleksander (Alex) Molak is an independent ML researcher, educator, entrepreneur and a best-selling author in the area of causality.
Connect with Alex: 

Links

Should we build the Causal Experts Network?

Share your thoughts in the survey

Out-of-the-box insights from digital leaders
Delivered is your window in the minds of people behind successful digital products.

Listen on: Apple Podcasts   Spotify

Support the Show.

Causal Bandits Podcast
Causal AI || Causal Machine Learning || Causal Inference & Discovery
Web: https://causalbanditspodcast.com

Connect on LinkedIn: https://www.linkedin.com/in/aleksandermolak/
Join Causal Python Weekly: https://causalpython.io
The Causal Book: https://amzn.to/3QhsRz4

Show Notes Transcript Chapter Markers

Send us a Text Message.

Support the show

`from causality import solution`
Recorded on Sep 04, 2023 in London, United Kingdom


A Python package that would allow us to address an arbitrary causal problem with a one-liner does not yet exist.

Fortunately, there are other ways to implement and deploy causal solutions at scale.

In this episode, Andrew shares his journey into causality and gives us a glimpse into the behind-the-scenes of his everyday work at causaLens.

We discuss new ideas that Andrew and his team use to enhance the capabilities of available open-source causal packages, how they strive to build and maintain a highly modularized and open platform.

Finally, we talk about the importance of team work and what Andrew's parents did to make him feel nurtured & supported.

Ready?

About The Guest
Andrew Lawrence is the Director of Research at causaLens (https://causalens.com/) Connect with Andrew: 

About The Host
Aleksander (Alex) Molak is an independent ML researcher, educator, entrepreneur and a best-selling author in the area of causality.
Connect with Alex: 

Links

Should we build the Causal Experts Network?

Share your thoughts in the survey

Out-of-the-box insights from digital leaders
Delivered is your window in the minds of people behind successful digital products.

Listen on: Apple Podcasts   Spotify

Support the Show.

Causal Bandits Podcast
Causal AI || Causal Machine Learning || Causal Inference & Discovery
Web: https://causalbanditspodcast.com

Connect on LinkedIn: https://www.linkedin.com/in/aleksandermolak/
Join Causal Python Weekly: https://causalpython.io
The Causal Book: https://amzn.to/3QhsRz4

 You're essentially going to overfit. And then as soon as anything changes slightly, if you intervene, you're representing the completely wrong distribution. Everyone gives 110 percent every day. And I don't think as a company, we'd be as strong without that. Hey, causal bandits, welcome to the causal bandits podcast, the best  podcast on causality and machine learning on the internet. 

Today, we're traveling to London to meet our guest  at school. He loved math, but he hated chemistry. He wanted to become a writer and he built a web browser version of Tamagotchi using Java. He's a runner and a fan of Paul Thomas Anderson's movies,  director of research at causaLens. Ladies and gentlemen, Dr.

Andrew Lawrence.  Let me  pass it to your host, Alex Molak.  Ladies and gentlemen, please welcome Mr. Andrew.  Thank you. Thanks for having me, Alex. Welcome. It's a very unusually warm day in London today, isn't it? Absolutely. It's quite cold in the office though, hence the hoodie.  Aircon's blasting, but yeah. Andrew, how did your adventure with Causality start?

It started after my PhD from the University of Bath. my Ph. D. was focused on Bayesian nonparametrics. So not directly tied to causality, but I think gave me a good foundational knowledge to kind of pick up the topics quickly. I went and did my Ph. D. quite late, so I, I worked in industry for a little over five years after my bachelor's, and decided I was kind of sick of industry.

I wanted to go back to school. Then post master's and Ph. D., I realized I was kind of sick of academia and wanted to go back to industry. So when I was Finishing writing up my thesis at the end of 2019, I started searching for jobs in the London area and  southwest England, really, and causaLens came up, and I applied and really liked everyone I spoke to so I went through a few rounds of technical interviews.

I was kind of my first experience with any causality topics, and then have been learning on the job for four years now. Regarding the topic PhD research, do you find any elements of Bayesian non parametrics and the way you applied those models to real world data, helpful  in learning causality?

Absolutely, yeah. I always say,  maybe like to someone who hasn't learned it at all, that causality is kind of just figuring out like the right way to factor the joint distribution that you're looking at, right? And with the, I was doing like generative modeling with Bayesian non parametric, so specifically Gaussian processes and Dirichlet processes.

And was trying to learn a latent variable model for some high dimensional data. So that you know, you're factorizing it to the,  the latent space, and then I was making certain distributional assumptions  about the latent space there. So I find that that is quite applicable because You can pass that on to Bayesian networks, but it's equally valid to define the join of X and Y as X given Y prior of Y, and also Y given X prior of X.

But what we're trying to do with Causality is really trying to capture what the, the true data generating process is. So if Y is a function of X, you want to factorize that in the, the way that the data is being generated. So what you're saying is that we, in a sense, we look from the distributional point of view.

At the direction of influence or the information flow between two or more variables. Yeah, the information flow. Exactly. What it sounds to me is that the idea of thinking about structures was something that you were able to transfer from your original research to learning causality. 

Yep. Yeah. So I wasn't looking specifically at like structural equation models or anything to that effect. I was specifically looking at conditional dependence or conditional probabilities, really. So I, you know, had no exposure to kind of the higher levels of Perl's ladder of causation.

But would say, like, okay, you know, you sample X from the prior, and that value of X is informing, say, like, the location parameter of, you know, your distribution of Y if you're, you're looking at, say a Gaussian or something.  So, it Was like kind of the structure of the observed distribution you're seeing there, but nothing with like structural equation models or anything like that.

Mm It sounds like the idea of thinking about something that is beyond the data itself.  Yeah, I mean it came quite naturally. I  thought a lot of the  The concepts were quite easy to see the motivation and understand the math behind it. Yeah, this sounds like something that is Different than what today's data culture promotes or what is inherent to today's data culture, where we sometimes  just use complex architectures a little bit blindfolded in the sense that we don't think about the data, where it came from, what's the meaning behind this, but we just try to  apply different complex tools and move and move forward.

What do you think about this,  this path and this direction? Yeah, it clearly is successful in some domains, right? So I think what you're kind of inferring is some deep   learning algorithm where the architecture is chosen in a way that you're going to minimize some in sample error term, right?

And normally the error term, you're choosing is based off of the data type of the target, whatever you're trying to represent. So if it's a classification label, it's some cross entropy term. If it's a continuous one, it might be like a mean squared error or something to that effect, right?

And that's kind of at like a first pass, you know, the main thought going into it. But you just kind of try these different architectures until you minimize your in sample and you might end up overfitting quite a bit. But where that obviously fails is if you're choosing the wrong covariance to predict your target that aren't actually driving how the data is generated, you might fit really well in the training domain.

But as soon as you step outside to any out of sample data, you're going to kind of fall apart. So it's really better to try to find what's driving  your target of interest to actually be able to generalize to unseen data.  And I think, you know, any domain where  You know, you're always going to be predicting out of sample and making these decisions that are high value or safety critical.

So in like a healthcare domain you really can't, you know, just be minimizing your, your in sample error.  We had, we met yesterday for for dinner and I remember you mentioned that  when you look at the distinction between causal and associative models, one of the vectors or one of the dimensions where you compare them is, Is it a creative job as generating text or generating images, or is it something that is safety critical?

That's one of the dimensions when you, when you see the differences between those two models. Yeah, absolutely. So, I mean, obviously with the, the like rise in the generative AI recently with like chat GPT, but also mid journey and like Dolly, you know, on the image side,  that stuff is incredible. You use it and it's insane.

But if you use it where. You know, there's a real risk of errors. Like, there was that lawyer who generated some documentation. I don't know the exact use case, but it essentially hallucinated specific cases that they were citing that just didn't exist, right? And they didn't bother to actually check the case law to see. 

And I believe they lost their job, right? I don't want to say anything, but this was a big failure. But if you're using it to, like, GitHub Copilot or something, to kind of start a class, I think it really helps with just getting past that blank page problem. And same thing on the image side, right? You can still see it Mid Journey and stuff has issues representing people's hands and things.

But if you're just trying to to get a starting point and then an artist could iterate, you know, maybe it reduces the development cycle. And same thing on the writing side.  And even like, you know, Facebook is quite impressive with their facial recognition stuff. You can upload photos and they'll recommend which of your friends they think are in it.

But if they get it wrong, they're not losing users per se, unless they get it wrong consistently.  And there's no issue. Yeah. If it is wrong, the user just corrects it. But if you're telling some company like, okay, this specific cohorts say like 18 to 25 year olds are really influenced by this type of advertising.

Let's have a multimillion dollar marketing campaign on it. And it has zero effect,  you know, they're going to instantly lose trust in your model and not want to, not want to use it in practice. Another dimension where, where causal and associative models fundamentally differ is the dimension or the distinction between prediction and decision making.

Mm hmm. Those problems seem to be just different structurally. Yep. I wanted to ask you about your thoughts about, about this dimension. How do you think about this difference and  are there any specific ways? That thinking about this difference is impacting your work at causal, at causal lens when you work with clients.

Sure, absolutely. So yeah, the question just to maybe summarize is,  uh, blind prediction, which is maybe like sitting at ladder step, run one of like pearls, causal hierarchy, and then higher order decision making, you're saying, right, where you might take some action.  At Causal Lens, we, you know, kind of follow the, the Perl school of causality, so we are working a lot with structural causal models, right?

So, maybe your first question is, let's say blindly predicting. So, we're just passively observing some system we have no control over it and we assume the dynamics of the system are stationary over time, right? So, an associative model can do quite well predicting that, right? But as soon as I go and intervene do some action in that system, so say I rebalance our marketing budget, or we update a manufacturing pipeline, or we change something in our global supply chain, we've now intervened, and you moved away from that observational distribution to an interventional distribution.

So the underlying data generating process has changed, right? So you would never expect your associative model to actually be able to predict there, because the... Yeah, your underlying distribution is now shifted. So, how we do it at Causal Lens is we've developed our own structural causal model, which we call CausalNet.

And we use that for, yeah, all the customer problems. So, a typical workflow is kind of iterating on the underlying graph with the customer, right? So, sometimes they come with a lot of domain knowledge. They've  studied this problem for years, right? And they might know maybe a subsection of the graph, or...

Some domain knowledge on say like the the hierarchy of  variables. So say for example, you're looking at different  Target audiences for for marketing, right? I keep using this example, but say you have age bracket, right? Obviously, there's nothing that can drive someone's age. So that will kind of be at the top of the topological sort of your graph, and then you might have the channel in which you can intervene.

So like Instagram, email, something else. Obviously, you can choose that. That will be further down. And then the response might be like click, click through rate or something. Right? So you can kind of give this ordering, which would help reduce the search space of possible graphs quite substantially. So that's kind of always the first step, right?

We'll, we'll iterate with them, get some of their knowledge, use like a suite of causal discovery algorithms. So, you know, constraint based, score. Based continuous optimization model based like a lin gam or something like we're not very you know, tied to a single method, kind of see what works well.

So some of the  score based exact search works quite well for like all continuous data. But if you have like mixed, so meaning like categorical and, and continuous, you might want to choose a constraint based method where you can choose like a conditional mutual information, say Oracle conditional independence tester  as a way to measure that.

 So yeah, and then it is quite iterative, say, so like  any method you kind of get out, you can never recover more than a Markov equivalence class of graphs, right? So you're not getting a single DAG that you might build a structural causal model from, you're gonna  sample from that space and kind of speak with the customer, see what they think is feasible, what also actually  matches the data, right?

You could say, assuming it's this DAG and it say it was a perfect like linear SCM, what's the actual like error? Of this graph fitting, fitting your observed data. So then we fit a structural causal model,  which we have a few different like backends to do it. So kind of the, the modeling language itself is like agnostic to how we would train it.

So we have like a PyTorch engine, so just stochastic gradient descent. We have a CVXPy, so convex optimization. And that kind of limits some of the  functional dependencies you can have on the edges. And then a pyro engine, so we can do distributional forecasting, or, not forecasting, but, yeah, like, inference on the nodes.

So this, this, these engines are used as I understand,  to learn the functional forms. of the connections between two different variables. Exactly. So say you have a small graph with X and Z driving Y, and you say that there's some linear association between all of them, you'd be learning essentially the weights from X to Y and from Z to Y.

We also have like different aggregations at the node. So right, that, that might be assuming a sum, but you could have a sum by, sum with bias, you could have higher order interaction terms, so you can kind of make it like a polynomial type regression. But... The, the specific, you know,  edge functions and aggregations are, we try to make as agnostic to the choice of engine as possible, right?

With all three of them, you could have a linear dependency. It's just the type of free parameters you might be learning are different. Obviously with the Pyro engine, you want to be estimating a, a distribution. So you're kind of learning you know, variational parameters is kind of what we're doing.

So we're doing like stochastic variational inference. And then from that, we can estimate the posterior at the node. One of the things that you do at Causal Lens is that you take well known algorithms or even open source implementations and you modify them in a certain way that serves your clients better. 

Are there any improvements that you build in those, in this implementation of SCM, for instance? Yeah, so for the SCM, what we were talking about before, right, going from the associative to  Let's say the observational distribution to the interventional one. A very common like technique for unbiased effect estimators is like double ML or de biased machine learning, right?

But that's very specific normally when you, you have a known treatment variable and a known outcome and you want to try to learn, learn a model for what's the effect. Of that T on my say so this is normally, you know, the, the common example is you have some backdoor paths, so you have some confounding between treatment and outcome.

And what you do is you kind of split the data when you're training, right? So you learn to predict the treatment given your confounding. You get the residuals from that. You predict your outcome given your confounding, get the residuals of that. And then you do regression on the out of sample. So you split it in half and then you're doing a residual on residual regression, right?

Yeah, which helps us get rid of potential bias that we might produce as an artifact of the method that we were using. Yep, exactly. So  obviously we'd want to do that at the SCM level as well, right? If you say we're just...  You have each node in your SCM, and it's functional dependency of its parents, right?

So if you just, like, naively iterate through each node, training that function independently of each one, right? So, like, I, I train the function for X, given its parents, and I go and train the function of Y, given its parents. If you have observations from everything, right, you're not necessarily looking at how that information is flowing through.

So, in the case of if I just did that. You know, in the simple case where I'm trying to predict my treatment from the covariate and then I try to predict my outcome given my, not covariate, confounder, sorry, confounder and, and treatment, the effect that you'd end up getting between the treatment and the outcome is going to be biased because you still have the information from the confounding  essentially in your observation of the treatment.

So we kind of developed a predictive Means to train in SCM using like the motivation behind double ML. So as opposed to just iterating,  you know, naively across all of the nodes, training them independently. We build like a training graph, which kind of defines the order in which you have to train the nodes and with which data split to do it.

The only problem with this is it limits some of the dependencies to force to be linear. So the original DML paper, the edge between, you know, your treatment and your outcome. Has to be linear, because that's how, like, you end up removing the bias because they're orthogonal, essentially. There are some methods to extend it with like, with kernel methods, but we haven't looked into that yet.

But that we also I think it works for two of, two of the engines, so it's not like fully agnostic to the actual training engine under the hood. It should kind of be independent of our causal net. Really, it's just giving you... The order in which you should train the nodes, given, given a DAG, and what, what portion of the, the data to use.

Mm  hmm. Earlier, we, we talked about a little bit your background, and you said about your PhD.  In Bayesian parametrics for many people, Bayesian inference is the go to set of methods when it comes to uncertainty quantification, but recently we have another very hot and very popular framework, which is conformal, conformal prediction.

And there's some literature recently showing that. Conformal prediction can be successfully merged with causal methods, improving uncertainty quantification and giving valid intervals.  Is this a direction that you have also explored somehow? Yeah, a little bit. So we've used MAPI, which is like an open source library for conformal prediction.

We don't have it specifically on our SCM, but we've integrated it into a causal impact package. So we have You know, causal impact is like historically this Google paper for like Bayesian state space model. So you want to, your, your observation, all you have is pre intervention and then post. So you only know what happened after you intervened and you want to know what that effect of the intervention is.

And what you do is you look at similar series the, the common cases, maybe like countries, right? So some country enacted some tobacco rule or like changed. Minimum wage. And you might want to look at similar countries or cities based off their demographics that didn't do it and use them to build a synthetic control.

So we have like multiple methods in our causal impact package again, which is focusing on time series, where you have a mix of interventional and observational one based on the SCM. So you train an SCM pre intervention then essentially force, you know, do the do operator. So you're going to break any dependencies going into it, set it, and then predict.

Post and that we've used a few different methods. We have like a quantile regression, just like a simple bootstrapping, and then also conformal prediction to get uncertainty estimations.  But that's kind of all we've looked at so far. Mm hmm, and how does the, this method work comparing to to, to, to other uncertainty quantification methods?

Do you see a visible improvement, a significant improvement?   Hmm, I think,  It's hard to say because we're using it in a time series domain where I think the original implementation is focused more on IID. So there is some recent work on conformal prediction for time series. So like where it's non stationary potentially.

 And I don't think we've explored that enough. So, like speed wise and stuff, it's quite good, but accuracy, I'm not sure. So, you mentioned the, the Synthetic control  framework and before we discussed Perlian graphical, graphical models,  these two different lines of thinking can be, can be complimentary.

But also I can imagine that for somebody who is just starting, this might be a little bit confusing. So I imagine if we put all together like quasi experimental methods and SCM based methods all together in one package. Well, if the API is not very, very clear and it doesn't make distinctions, maybe between those different approaches, it could be confusing.

And so one of the ways that come to my mind,  that comes to my mind, how to solve it would be to modularize the, the package. Mm hmm. And from, from what I know, this is also something that, that you're doing with your team.  Yeah. So the product itself is quite modular. So maybe just to, to give kind of a background,  how it's presented to the user is essentially like a cloud platform that can be deployed on Azure, GCP AWS.

We also have done like on, on site deployments, right? So, some customers, yeah, have their own cloud infrastructure and essentially. It allows data scientists to launch different jobs. So maybe the most standard entry point is kind of like a Jupyter,  uh, lab session that you would launch that gives access to our various packages.

Andrew, we talked about fitting different functions for, for SCMs. Mm hmm. And even when we are in, in the causal domain, in the domain of causal models, we might face some challenges here. So for instance, Imagine that we have some data in a certain range, but then in the,  in the production case,  we expect that we might go beyond this range. 

What are your thoughts about situations like this? Yeah, absolutely. I mean, I think that that's probably a pretty standard case actually, right? It's highly unlikely that your training domain is going to be exactly what you see. I mean, you know, when you're, when you're in production. So one of the things we.

Do is we met, I mentioned before we kind of have these different edge functions that we try to be as agnostic to the underlying training engine as possible. And we have like a suite of family of them that I would say is kind of like shape constraint edges. So, obviously like a linear edge, you know exactly how it's going to extrapolate as you move.

From negative infinity to positive infinity, right? But we have some more complex ones like, so say like a, just a monotonic edge. So it's forcing you to increase monotonically from the training domain so you know it's not going to say like, you know, loop back. It's always going to, as the input's increasing, you're going to increase agnosium, right?

We have like a monotonic with saturation, right? So like you, you're coming up and then as you increase, like you're just marginally increasing, you're still increasing, but. You're not going to, you know, blow up to infinity. We have some like piecewise linear edges, right? So like you're, you're normally fit where the transition points are with the train domain.

And then, you know, as you move away, it's just going to be a linear association  negatively and positively away from the, the training domain. So we find that that actually builds a bit of trust in the models as well. Right. So you can fit some like really deep neural network.  And it behaves super well. The, say, for classification, the syndrome boundary is great within the training domain.

But you have no idea what it's going to look like as you move further away, right? It can loop back on itself.  Wherever there's no data,  you're kind of up to the whims of how it's initialized.  It's essentially random, right? yeah, exactly.  So this, yeah, is quite helpful. We kind of,  it's nice to kind of inspect.

What's happening at the edges as well to see. So, you know, customers might have some intuition of what they think the relationship should look like between say two of the covariates and being able to inspect down at that level of the model is quite helpful and they know, you know, if there's some black Swan event, say like  interest rates or unemployment rate or something is one of your covariates  and that like jumps multiple standard deviations.

You know how the model is going to, going to behave. It sounds like a. Like a way of incorporating expert knowledge in this a little bit less restricted way. So, so you,  you are not saying that this will take particular value or this, there's a particular edge between the nodes, but you say, Hey, this relationship should not, should not be this for sure.

Right. So we say that at least. Yeah, exactly. It's kind of a way to make like a person parsimonious model. So like simplify it a bit.  You know, each edge like you could say maybe like add a specific node, make it some like multilayer perceptron of all of its parents, right? But then  it's really hard to know what it's learning.

So, I think it's a bit better to maybe take away some of the expressiveness  of the model to kind of have some functional forms. Along the edges that, that the user can, can trust. And we actually do, you know, speak to the customer about that. Right. So I mentioned before kind of this iterative causal discovery approach until say we.

We settle on like a some directed acyclic graph or DAG. I might have used that acronym before without defining it  we also do it at the modeling level, right? So just finding like a causal graph that people will accept is you know Just the first step right if you want to build an SCM There's the whole functional dependency of a node given as parents essentially. 

You mentioned causal discovery How many iteration steps do you do typically? In practice when you work with causal discovery algorithms and, and human experts.  I think it's really dependent on  kind of,  it's dependent on like on each business, to be honest with you. So a lot of customers we get some are like quite deep down the causality rabbit hole, right?

They, they realized that this is something they need to do and they might've also like done some of their own homework ahead of time. Right. So sometimes we come. With full graphs of what, what they think it is, right? And then we just are just validating it  given the data and what they think, how much does the underlying, like, multiple algorithms agree with that.

Others, we kind of  iterate on, we come with, say, maybe no domain knowledge, and we try to, like, tease that out, and, you know, that takes a few more iterations. So I mentioned before, kind of the hierarchy, we find that that's quite a easy way to restrict the search space of possible graphs without Actually encoding too much bias or error, right?

If you go down into the,  Oh, I know for a fact there's no edge here or this edge here,  You know, like if you're wrong, that's going to propagate through, through the whole graph, right? But this kind of like hierarchical knowledge about These nodes can affect these nodes, which can affect these that's kind of some of the stuff we do more.

 And I mentioned, yeah, before we don't kind of just use like a single method. So we do have like quite a, a family of causal discovery methods. So,  you know, you get a bit of the iteration from that. So we might try to try to present different, different results and see.  That's very interesting. Many people who are just starting with causality have this fundamental fear that  they might. 

come up with a certain DAG, but they, they,  they might be very, very unsure if this DAG is correct. And then they ask themselves a question. What if, what if this model is, what is if this model is,  is wrong?   Looking from the perspective of this process that you have just described, what would be your advice to those people? 

 I mean,  I guess that's true with any model, right? You normally would do some like validation phase and some, some testing, right? So like you can almost But back test it right for, for maybe like using a term from maybe the more like forecasting domain. So I wouldn't just, you know.  Take all of the data we have, fit it, and then have nothing, right?

So, like, you should really do training, validation, maybe, like, test split, right? So, you can look at, say, like, predictive information. So, if you just care about rung one you can use that,  uh, that DAG and the STM you might build from it and see how well you can predict various nodes within the graph given, given their parents.

If you have any interventional data as well, you can see if The estimated effect you're getting from it matches, matches what you, you might've measured before. This was kind of a a point I meant to make when we were discussing the double ML inspired training routine we have for the SCM. The main point was being able to, you know, normally you care about the treatment on this specific outcome.

We wanted to learn these functional dependencies such that you could pick like any pair within the SCM and say, what's the effect of like Z on Y Exxon. Why anything and then, you know, you'd have an unbiased estimate. It sounds like learning a full SEM where you can basically change your query and say like, I want to understand the effect of this one on this one and then another one on another one.

And you have this flexibility to ask any question about any effect in the model. Absolutely. So you lose a bit of predictive accuracy, right? So like, if you definitely want to just predict as well as possible, like this one.  target variable, you're going to lose some because one, you're looking at less data, doing the data splitting technique.

And two, yeah, you're gonna  not have an optimal say prediction of each node, but you know that if you were to.  break anything going into it, you're going to have a, an unbiased estimate of that, that treatment effect. It sounds like finding a balance between causal bias and, and estimation statistical bias.

Yeah, that's a good way to put it. Just to circle back a bit on your question. Yeah. About maybe the fear of not finding like the perfect dag, right. It really depends on what you want to estimate as well. Right. So you could have.  50 variables, but if you care specifically about  how,  I don't know, this one like  advertising color or language affects like click through rate on like a banner ad or something, potentially  a lot of the structure could be completely like insignificant to that, to that treatment effect, right?

 So some of it is like, you don't necessarily always need to find a full DAG for everything, right? It really depends on what you're, what you're trying to do.  That's a great point. I had a conversation recently with Naftali Weinberger, he's a philosopher of science focused on causality, the intersection of causality and dynamical systems.

And  he has this perspective that causality is scale, is a scale specific phenomenon. So at one time scale, you might have one model. At another time scale, you can have another model. And this also translates to,  to spatial scales. Huh.  And...  So one of the examples that he gave in the podcast was  that we might have a variable that is actually cyclic, right? 

But it might be the case that the cycle is in such a scale, time scale, that we actually don't care  for the interventions that, that we are, that we are doing. I guess actually we can even model it per se, if it is. Cyclic. So, in the elements of causal inference book, I think like one of the chapters in the end is kind of about time series causal graphs.

So as opposed to having a single node for, say, some variable Y, you can have the observations at different points in time, right? So like X in the past could drive Y  more recently in the past and back and forth, right? So if you were to compress it in time, it would look like there's a cycle, but again, it's all the resolution you're measuring at it, like you're saying.

I remember reading, I think the PCMCI paper by Jakob Runge and he did an extension of it called like, PCMCI plus, which allows for instantaneous effects. So the original paper was, you can only have lagged variables driving, you know, the current time.  You might not be able to measure at that resolution, right?

Like, obviously if you could get down to. The nanosecond or faster, everything is going to be lagged. But if you're measuring something at daily, weekly, monthly resolution, the effect of that treatment is going to be observed in your observation, essentially instantaneously. And I, I think you can always  model that really, right?

I don't see an issue, even if there is some really slow, slow cycle, you can actually put it down into like an SCM, you could just build it, the. The underlying framework for it would be one of these time series causal graphs where you might know at At the weekly scale, you are seeing some cyclic effect, but when you just care about predicting the next day, you're not gonna worry about that or anything. 

You did some research and published some work in, in Time Series Causal Discovery.  Can you share a little bit about this with our audience? Yeah, so what motivated that was trying to break some of the underlying assumptions of the methods. So, we didn't propose, like, any new methodology or anything, we just proposed a...

means to generate synthetic data to kind of validate how sensitive some of the methods were to breaking their underlying assumptions. So we looked at like a vector autoregressive version of LinGAM, no tiers. We looked at PCMCI using different like conditional independence testers. And then we're just comparing, I Hamming distance.

So that's a means to measure how far. Your estimated graph is from, from the true one for different scenarios. So,  you know, we would break a linearity assumption. We would break like an assumption on the Gaussianity  of the noise, things like that. And then, yeah, we open source that software just for people to be able to quickly generate time series that, that might, you know, both agree and invalidate some of the assumptions and see how sensitive it is for the algorithms.

A lot of the stuff we were doing. Prior and still now really is,  uh, applying or slightly modifying these methods to work in like real world scenarios or, you know, you might not have  theoretical guarantees  per se, right? So, the PC algorithm famously assumes like no hidden confounding, but I, I'm sure every dataset you're going to have, there's always going to be some variable you're not able to measure that's having an effect on some of your covariates.

It also is assuming a perfect oracle or conditional independence tester. But that's never the case. You always have small sample effects. You're never going to be able to measure with 100 percent  certainty whether or not two variables are conditionally independent given, given some other set.  So, yeah, we kind of, we're just doing internal benchmarking, essentially, which is what motivated that workshop paper.

And we're just trying to test you how sensitive some of the methods were to, to breaking their assumptions.  I believe that this is a very important area of work in causality today,  especially after the papers from Alex Reza and also from your former colleagues your former colleague and your, and our CTO and your CDO,  uh, about, about as I said, as suitability of no tears for causal discovery and, and, and showing that. 

That was more in Alex, Alex paper  showing that there might be certain properties.  In synthetic data that might make the work for algorithms much easier and the challenge is that those properties are not necessarily present in the, in the data outside of the, of the data generating process, the synthetic data generating process.

Yeah, their paper was called just for the, the listeners and viewers was beware the simulated DAG, I think, unfortunately, I guess, like our yeah, nerves workshop paper was just a bunch of simulated DAGs as well, but yeah, we were trying more to, to To break it to see how far we'd have to move away from the ideal case that the algorithms were written for.

 Yeah, I, I quite like their paper and it brings a good point, right? And it's actually somewhat of a problem in general with  the causal discovery literature is there's just not that much ground truth for real world data, right? I think I remember the PCMCI paper looks at climate climate data. So it's time series about like large scale climate events.

But everything is still kind of built by domain experts, right? So, there's not that many real world datasets that,  that we can provide. So a lot of the stuff is just with like, Ernst Renier like, random graphs, and then generating random, like, functional dependencies, random noise, and, and that's what the stuff's benchmarked against.

I think this Clear Conference, this past year, they had Like a, I don't know if it was a dataset track or something specifically for trying to get some more data sets. Yes. It was called to more stuff. Yes. Yeah. It was called the call for a call for the dataset. Yeah. And I, I think that's that's great that we have initiatives like this.

Mm-Hmm. definitely benchmarking in causal discovery specifically, but in causality in general. Mm-Hmm. , it's a very important topic. And if we think back to 2014, 15 to the, so-called ImageNet moment. Mm-Hmm. ,  one of the conditions that made  this moment possible and all those breakthroughs in computer vision possible was the fact that we had some gold standard data sets, right? 

I think there's still errors in that though, right? Yeah, of course. Yes.  And, and we learned with time that they were not as perfect maybe as we, as we wished, but still there was some, something to compare to.  The only thing I would add, I guess, on the paper on the unsuitability of no tiers for causal discovery. 

I guess the original No Tears paper actually doesn't specify that it's for causal discovery. It's just for Bayesian networks. And then people took it and applied it for causal discovery. So the original authors never made some claim that it was like a perfect causal discovery algorithm, right?  Yeah, that's, that's, that's true.

I think also the authors of the Beware of the Simulated DAG paper, they also don't say that the original authors made some claims, but  this doesn't change the fact that. It was a very useful paper, eye opening paper. Now there's another paper that is a, a follow up. We'll link to both papers in the, in the show notes, so you can, you can read them if you're interested in this.

Interestingly, it turned out that in causal inference, we, we, we encounter similar challenges. There's a recent paper by Alicia Kerr from Schaal lab showing that synthetic data used for benchmarking causal inference algorithms Might have similar effects, similar in spirit, let's say.  So that's, that's another challenge.

You also worked on, on a very interesting paper that uses a so called ASTAR algorithm in the causal context. Could you share a little bit about this work? Yeah, so it was a follow on from a couple of recent papers where you're using the ASTAR like pathfinding algorithm. So it's quite a, like, historical algorithm to find yeah, causal graphs for purely like continuous data is kind of how it works.

So, there was a paper called like  the, the algorithm in it was a triplet a star, and then they quickly followed on with one called, I think, a star superstructure and like local a star to make it a bit faster. And at the time when we were working on it, I think it was maybe.  One of the ones that they claimed was, you know, scalable to hundreds of nodes and like converge within, you know, before the heat death of the universe.

So how it works is,  uh, it's an exact search score based method and the score is the Bayesian information criterion for each node. So you have a node and a set of potential parents. So say like it's a really small,  you know, graph with three nodes.  X1, X2, X3. So say you want to figure out where X3 is in the graph, you have a parent graph, which is like the null set, so X3 would be like a source node.

You have X1,  you have X2, or you have X1 and X2. And all three of those are essentially like a  branching graph, right? You can start the null set X1, X2, and then the combination. Obviously,  The size of potential of the parent graph for each node scales immensely as you have a large number of nodes. And you're trying to find the shortest path through that, meaning the path based off of the score.

So how well x3 can be predicted given nothing, given x1, so given x2 or given x1 and x2. And you optimize that parent graph across all the nodes simultaneously, essentially.  And  it assumes linearity and Gaussian additive noise. And yeah, you have to use the BIC, and that's where the, like, theoretical guarantee is that you'll converge to the correct graph if those conditions are met.

And there's no unobserved confounding as well, so you have to have causal sufficiency.   So,  the, the superstructure paper and the local one, the local one allows you to not do it simultaneously and you can parallelize it a bit. And the superstructure uses graphical lasso as an initial set. So, like, what cedes the potential parent graphs. 

In let's say the naive case or the baseline case is that each node can be a function of any of the other ones. But you use graphical lasso as an initial filtering step to kind of get a sparse estimate of these parent sets. So what graphical lasso gives you is an estimate of the precision matrix assuming all of your data is sampled.

from a centered multivariate, multivariate Gaussian. Can you tell our audience a little bit more about what GraphicalLasso is?  Yeah, sure. So the ASTAR superstructure method uses GraphicalLasso as a means to see the set of potential parents. So what  GraphicalLasso is, all it's trying to find is a sparse representation.

of the precision matrix, assuming that your continuous variables, so like large axis sub matrix of variables, is a sample of a zero mean multivariate Gaussian. So this means that this is a method,  in its essence, it's a regularization method over the space of edges? Yeah, you can kind of think of it as like a way to get a sparse skeleton, so like an undirected... 

graph as like a starting point of a causal discovery method.  So, but it's assuming linearity and, and Gaussianity, of course, and that again, causal sufficiency, that all the variables interacting on the system are, are observed. So, yeah, the authors of the, the most recent ASTAR paper, which I think was from 2021, NeurIPS, but I can give you the link for that as well.

They propose using GraphicalASO as an initial step, so beforehand in the original paper, the naive method is that every parent can be a function of, or every node can be a function of all the other nodes. So the parent graphs for all of them are, are quite big. So using GraphicalASO as an initial step, you kind of are pruning the parent graph a priori before running the ASTAR search algorithm, essentially.

So what our paper did was extend that a little bit more but with like domain knowledge from the problem And we looked at a couple different versions of prior knowledge. You can have in causal discovery. There was another nice It wasn't like a literature survey paper But  there's a paper that kind of defined a table of different type of prior knowledge So obvious stuff like oh, we know there's an edge there, but we don't know the direction We know it's a forward directed edge.

We know it's a forbidden edge. We know that it is more likely to be like a, a source node. So it might have less parents or it's more likely to have more parents. And it follows some causal hierarchy, stuff like that. So we wanted to see how to encode that into the ASTAR method. And how we ended up doing it was essentially being able to further filter this parent graph.

So, you know, you're not required to use GraphicalASO as an initial step. But we just looked at like, what's the speed up you get? Using the ASTAR method if you add in one known edge. Or if you add in a forbidden edge. So in a forbidden edge, you can imagine you're You're essentially removing one of the candidate nodes from the parent graph for that specific node.

 One of the cooler things we found, And just to shout out my colleague Steven, he is the first author on that paper, And it was all, you know, majority of his work, so he did a great job on that. We found that if you just defined three tiers, So, in your full data set you say, One node is guaranteed, guaranteed to be a source node.

So again, I gave an example like earlier about age. So if age is in your data set, nothing's driving that, that set. You can't, you can't affect that, right? And one is like your sync nodes. We'll say click through rates in like the advertising case, right? That might be your target, and you're not measuring anything else downstream.

Then you put every other node in the middle tier,  and that... reduces the number of score computations I mentioned before that it uses BIC  by half which is quite substantial, right? So it's not like a big o change it's still on the same, like, order of magnitude but it's quite a big speed up and it's, it's fairly easy to encode this information into the parent graph,  Practically speaking, it's, it's a major change, especially if we think about causal discovery algorithms that might be a little bit slower with bigger graphs.

 bEfore we also discuss this, this idea of, of this fundamental fear of having this wrong DAG,  and I have a feeling that those two threads are, are connecting here a little bit in terms of the meaning of additional information. So even if we don't have full information, we just can exclude or include one edge or ten edges.

This might significantly  reduce the size of the search space. Absolutely, yeah, right. So, in general, I think like discovering Bayesian network is a NP hard problem, right? That's been proven and the space of graphs grows super exponentially with the, with the number of nodes. So anything.  Any domain knowledge you can bring into the problem to reduce the search space helps significantly.

Another thing is, you know, maybe like a course to find approach to use some language from computer vision. Can you kind of look at  clusters of variables find like how those might interact and then drill in at a higher resolution? Or, I mentioned before, you know, for some problems you really don't care about having the whole graph, right?

So you just care about some. Some local structure. One paper I read somewhat recently, I think it was called like DAG FOCI was the method and it was finding what the, like, true drivers are of your target variable of interest, so what's the parents. And it starts with this initial method called like FOCI F O C I, and that's trying to find the Markov boundary of your target variable.

The Markov boundary is defined as the minimum size Markov blanket. So, I don't know if you're familiar with Markov blanket, but what it is is... Is, it's the  parents of the node, the children of the node, and the shared parent of the children. That's defined as like the Markov blanket. And, if you condition on all of those nodes, your variable is conditionally independent of everything else.

So it kind of has all the information you would need to predict. And having the parents is really what you would want if there's some ordering, right? If you have your children, but it's a delayed effect, you can't use that to predict the other thing, right?  So  if you're really, you know, just care about finding like the robust set of features to predict your target and you know that like this graph structure isn't changing in time, but maybe, you know,  some of your variables are a bit non stationary.

So the distribution driving them is changing. You can use some methods like that. So, I mean, I would say, like, you know, it's never a silver bullet. We're not just looking for a single DAG and a single,  like, model for every problem. You really want to use the tools that are applicable for what you're trying to do.

Mm  hmm. Some people have an impression that the, the formalism of DAG, directed high cyclic graphs, might be very, very limiting. Especially given that...  In more complex cases, like marketing or any social phenomenon that we, we might be interested in modeling. We cannot really exclude potential hidden  confounding.

And for, for many people who are only familiar, who are only familiar with their, basically graphs, but not other types of graphs this seems like a major blocker, but we know that identifiability in causal. Terms is not limited necessarily to to DAGs or directed acyclic graphs What is your view on that on that and  how do you see this in practical terms?

Yeah, sure  So yeah I think I mentioned earlier that any like causal discovery method can resolve up to the like a Markov equivalence class of graphs So you normally can encode that into a different mixed graph type. So like a  like a CP DAG completed partial  Directed acyclic graph. There's also maximal ancestral graphs, partial ancestral graphs.

The partial ancestral graph is the, like, normal output of a FCI, which is like fast causal inference, but it's actually a constraint based causal discovery method. And yeah, you can perform inference on these mixed graph types directly, right? You don't need to resolve down to a single, a single DAG.

You can, yeah, figure out. Is the effect of T on Y identifiable from this and what's like in a, a reasonable adjustment site you might need? So yeah, that's also another, another avenue, right? You, you don't need to, to resolve down to a single DAG depending on what you're trying to, to estimate. Yeah, sometimes we might be just lucky and, and just go on with, with a structure that  looks not very promising in the beginning.

But it turns out that from the formal point of view, the calculus point of view, we can actually resolve it.  Yeah, the other thing I would, I would just mention is a lot of the stuff we do, maybe we're in,  we care about being able to identify effects of, of multiple things. So it's not just,  you know, what's the effect if I change T from  true to false, right?

You, you kind of want to take it a next level. So you have, these are the five levers I can pull. What's the optimal intervention? Or set of interventions to perform to say reverse an unfavorable decision. So that's kind of the, the algorithmic recourse problem, right? You have a set of  intervenable variables and you want to know what's the, the set of interventions to essentially get over the decision boundary.

So you're rejected for a mortgage, what you need to do to get over, but you can also look at, at other things, right? Like, not just trying to reverse an unfavorable decision, but say. Maximize some revenue or something, and I can pull these three levers. What, what ones to pull and at which level to pull.

You worked with many different use cases when it comes to applying causality to real world scenarios complex things like supply chain management,  uh, something completely different, like manufacturing,  what were the main challenges that appear in those, in those different cases? And how were they different between, between the scenarios?

Yeah, so for supply chain obviously, you know, you're never going to have sufficiency in that regard, right? So maybe that was the biggest challenge is really that this is a complex system that you're never going to be able to capture all the variables interacting on it.  It's also highly non stationary.

So we had looked at like,  the supply chain of like a component electronics component manufacturer that was quite high dimensional and highly non stationary. And we're trying to figure out.  wHat was driving one of their, one of their KPIs. So, say like, you know, through, throughput of, like, components sold or, or. 

Lead times, stuff like that. And so one, the effects you're seeing aren't always on the same scale. The data is highly non stationary, so we're looking at ways to stationarize the data to then apply it to maybe more standard causal discovery techniques. So like, you're bringing in some tools from time series to like fractional differences, essentially.

So like, If the data is non stationary, you can kind of think of looking at the change of it, right? So like a one step difference is kind of like the first derivative. But sometimes you want to not have it be exactly just the prior one, but some fractional weighting of the prior. So we were doing like a lot of pre processing there to then try to figure out what was actually driving their KPI.

On the manufacturing side, That worked quite well, actually for a lot of stuff out of the box. And I think in general we found things that have an underlying physical system, so like some, you know, factory where it's very constrained. What component is going into which other, which I think is also true on supply chain, but it's at a huge scale, right?

Mm-Hmm. , like not every part can interact with the other. There's still some spatial temporal.  Dependency, but the manufacturing is, is much more constrained. Like, imagine having an assembly line. You really know what's the order of stuff. So you can kind of model the information flow quite well.  For the supply chain, yeah, a lot of the stuff is more  that we've been looking at is a bit more predictive, but a bit more like what interventions  we can do that, you know, have some marginal gain.

Where on the manufacturing side, a lot of the use cases have been. More root cause analysis. So  manufacturing, like maybe some bigger box items or manufacturing specific components that go into another product. You know, what's leading to this anomalous outcome. So like really bad yield rates or something for some component.

 What, what step in the pipeline is causing that?  Some of the stuff we looked at. It seems from what you're saying that  there are many different methods coming from. Many different areas or sub areas of  engineering, computer science, and so on that you are using in your everyday work.  How do you feel, to what extent, the fact that you did your PhD? 

Is helpful for you, for you today  for my PhD specifically, or no, like doing PhD, this experience that the experience of doing PhD, how does it translate to your work? Do you feel that this is something that is substantial to the way you work today? And you see the problems or? Yeah, I think it's helpful in like learning  it, not necessarily the concepts, but how to will.

quickly distill the information. I think that that's more of the skill I got from my PhD and kind of what I see with the team as well. So maybe taking a step back, I can kind of give you like what our philosophy was building up the team. Cause obviously the stuff I've spoken about here. It's not just my work, right?

It's a big group of people, and I'm in a team with, like, machine learning engineers who are a bit more, you know,  strong in the raw software development stuff, but still understand all the concepts, and then we also have some research scientists who, you know, you could say, hey, this looks like a new promising  topic. 

Here's this paper. Does it does it seem like something that would be beneficial to, to add to, to our toolkit. So that's where the PhD I think really helped me as being able to see like a new paper, you know, you look at the abstract, you look at the method, you look at the results and like, you're not reading it top to bottom and you can kind of make a judgment call on both level of effort to implement it and then potential uplift you're going to see from it.

And kind of, that's what I've seen people on the team. So,  Yeah, that's kind of the, the skill we're looking at when we were building the team up. We actually weren't even looking for people with purely causality background. It was more, do they understand the fundamentals of machine learning? Can they explain how the underlying methods work, like how the math works underneath it, and can they like quickly prototype some of this stuff?

So can you see a paper and turn it into. Like a prototype implementation pretty quickly and we can we can test it I think that's kind of the skill gotten from the PhD.  You worked as a In a bunch of technical roles and now you're a director of research How was the transition for you from from the technical? 

Role to a role that involves more  More of like a social interaction and yeah working with people managing people. It's still very technical still  So yeah, there's a line management component, right? So like career progression, one to ones, but there's a lot of you know,  help with like product  roadmap. 

There's how I kind of see my job now is kind of like an unblocker of people. So beforehand, I might have like a specific project where we want to implement like this method. Add it to one of our packages and get it in the monthly release. And now it's a bit more like, okay, I might own a few items from the roadmap.

But I'm not the individual contributor on each one, but like, I'm still connected to know, okay, this person hit a wall looking at this methodology, like what, what do we need to do here?  Oh, this person might've come up with something. So I, I feel like if I'm able to ensure. Three other people on the team are moving quickly  and unimpeded, right?

That's better than a, if like I'm singly focused on one thing, but it's still quite technical. I mean, when you came this morning, I was, you know, finishing addressing PR comments, pull pull requests for those that aren't too familiar. We're doing a release candidate today. We have like a monthly release cadence and I still had to.

Get one of the pull requests in and then release a new tag. So like a new version of this package that would go into the, to the release candidate. So it's not fully hands off.  Yeah, when I joined the company was quite flat. Everyone was essentially a data scientist and then as we grew we kind of  got into specific roles.

So like there was initially a team for like modeling. It was like called modeling R& D and then it kind of morphed more into yeah, general R and D team. And then you had a specific, you know, wing for data scientists. And then we now have like pre sales data scientists or, you know, customer success data scientists.

So depending on where in the customer value. 

You told me that as a child you really liked math and physics and science in general. No chemistry though, I said.  No chemistry.  We know it. 

What was it there that you find so interesting? What motivated you to go this direction?   I don't know. That's a deep question.  It's just seeing stuff  work. I, I think I mentioned at the dinner, like I, I initially did my undergrad in engineering and I worked a bit as like a electrical engineer and then transitioned to a software developer in my previous job before going back to  academia.

maYbe like one of the regrets is I wish I did my undergrad in computer science instead of, Engineering, but when I was doing that, I guess I didn't have much experience in computer science or really saw that it was a viable path. Everyone just said like, Oh, you should be an engineer because you know, you're good at calculus, which I never understood or something.

I don't know. I think what I liked about it, I still remember as a kid doing  science fairs like you see in like the sitcoms in the U. S. right? Kids do.  Like the volcano is like the, the trope, right? But we made, I made like chalk. It was something which you can kind of grow yourself. Did some other experiments.

I just,  I don't know. Math always kind of made sense to me. Like even algebra and stuff early on, just being able to,  to solve for X. I don't know. It was very intuitive. And if you just followed,  if you followed the rules that are defined within that system, you, you'd come to the answer.  Yeah, I think I just, more left brain than right brain to,  to use that stereotype. 

So how did it feel to you when, when you were able to, to find a solution?  I guess rewarding, right?  Like the task is to, I don't know, solve for X when you're in like middle school and then it's you know, do some crazy path integral when you're in like college or something. And I don't know, I guess it's just rewarding, right?

Whenever you can see something to completion and that's how it feels in the job too, right? So,  you know, just getting your pull request merged isn't the end of the story, right? So we we'll do like internal user testing. So that's what the release candidate is, right? It's for like internal testing.

Let's see if this stuff works. Then if it is, it goes into the stable release and it's kind of shipped to customers, right? So the end of the end of the line is actually seeing them use stuff and like.  You know, we work hand in hand with a lot of customers. It's nice knowing that,  you know, their internal data scientists are seeing value in what, what you're delivering and it's working for their, their problem. 

Who would you like to thank?  Well, everyone on my team. So, Francesca, Matteo, and Steven. Also Marcin, Ilya, and Max Elliott. Everyone,  you know, gives 110 percent every day. And I don't think as a company,  We'd be as strong without them. And yeah, I really appreciate everything they do. Also to Max and Darko for their guidance.

I've been working quite closely with Max, our, our CTO very recently.  So, it's just great being able to have his ear and have him, you know, providing input on a lot of different aspects of things. But everyone at the company really, like all the different teams are great. Obviously my parents, if we're gonna go, go way back, they're the ones that nurtured my  education and academic, like, success through the  37 plus years  I've been around, so.

What was one particular thing that your parents did to make you feel supported or nurtured, as you, as you called it?  Yeah, I still remember ever since being a little kid, my mom reading to me every night. And I think that that helped a lot with like, consistency. So having like a kind of a fixed routine like a thirst for knowledge, like always trying to learn more ever since, like I was really little, like from three to, you know, in my preteens, I remember every night we would, we would read something just for even like 20 minutes, 30 minutes, right.

Just to kind of have. That routine and always trying to learn stuff now and still now I always have like some, you know, fiction book on the go. That's kind of my preferred, my preferred you know,  topic or type. Yeah.  It sounds like, like something that also builds a lot of closeness.  Absolutely.

Yeah. I mean, well, my family were always quite close actually. So my dad would be home for dinner every day. My mom was a teacher, so she had quite a nice, Schedule. So she was kind of finished by, you know, early afternoon. But then when I got older, she transitioned to like administrative role at school.

So she kind of worked a bit later, but you know, we always had family meals together. Quite a, quite a supportive  family really the whole time.  Is this something that translated to your own relationship as well? Yeah, of course. So you asked who I would like to thank. And obviously the, the person I like to thank the most is my wife, Leanne.

She's very supportive.  I moved out here to actually be with her. So we've been together for, for quite a while now. I think we've known each other for more than, more than 10 years. We got married right before the pandemic, which was lucky cause we had no wedding insurance or anything. I didn't even think, you know, there was much risk.

All of our family and friends flew up from the U S so that we got married in what March of 2019. And if it was a year later, everyone would have been stuck out here. It would have been  But yeah, I mean, She's great. She's supportive of me going back to school, right? So like, you know, we weren't earning very much while I was doing my master's or her PhD helped support me through my furthering of my education. 

We moved from Bath to London together. So she kind of switched  her job. She's a consultant so she can kind of work from wherever but obviously needed to be at a customer that was London based and not West England based. So yeah, she's just always been here and Supportive of anything I wanted to do for my career, personal life, anything.

What would be your advice to people who are starting with something? Maybe they want to go into causality or maybe just starting with machine learning in general,  and they feel.  Maybe a little bit overwhelmed that you know, there's so much to learn to make those things work  I guess maybe you don't really need to be an expert on everything so there's  Not even for people that are necessarily getting a PhD but there's actually like a quite a nice metaphor which I've seen like in university where they say like, oh this circle is The sum of all knowledge in the universe, right?

And your PhD is like a small dip in like some super hyper focused area that just pushes the boundary of human knowledge further. And, I don't, like, I don't understand everything in ML or in causality. The, the field is too big. I think you kind of look at what, what you're interested in. One of the most general, like, books is Christopher Bishop's, like, Machine Learning Pattern Recognition book, I don't remember the exact title.

I think that one's quite good. You can really just start at the beginning. It's very fundamental stuff like they'll even go through like,  you know, properties of like the Gaussian distribution stuff, right? It doesn't start in the deep end and then kind of see what you're interested in. I think also going,  maybe use case specific or like what area you want to focus in, going that way.

So the group I did my Ph. D. And at the University of Bath, they were very computer vision kind of focused. So a lot of people  were looking at like graphics applications, stuff like that. And maybe tying it back to where the talk was at the beginning was,  you know, some of these association, like correlation based methods work quite well in non safety critical systems.

And some of the stuff being done by NVIDIA, NVIDIA, sorry, and Epic on kind of this generation of.  Graphics or visual elements is very impressive and if you're, you know, really interested in video games or like movies and stuff, you know, you want to focus on, on that type of stuff. If you're really interested in health care, right?

There's specific techniques you, you want to focus on with that. So thinking like drug discovery and things like that, it's really what motivates you. And there's no need to become an expert on everything to go, go where you're going.  What resources would you recommend to people who are just starting with causality? 

Sure. Brady Neal has a really good like tutorial maybe like online course. So it's a set of kind of lectures and slides and stuff that are very approachable. I mentioned it earlier but it maybe is a little more advanced as the elements of causal inference book. That one is, yeah, quite deep and good.

The Book of Why by Judea Pearl. is a bit more narrative, but it kind of puts things in context quite well.  Let's see what else? Well, I think you have a book, right? Maybe  people can can start with that.  But yeah, there's tons of resources online, actually. There's a lot of there's a lot of good  like summer school lectures.

I can't remember specifically, but I know for new joiners like causality, we kind of have a list of recommended reading, which includes,  uh, some. Previous lectures, some, some papers, some textbooks yeah, there's a lot of stuff out there.  What question would you like to ask me?   In all of your work, where have you seen like the, the greatest success in causal methods? 

In terms of? Like outcome for business. Outcome for business.  So the problem is that I don't, I cannot speak about certain, certain things. Broad strokes, like maybe the use case.  So I think one of the, one of the use cases where, where you can see a lot of value,  uh, coming from, from applying causal methods is marketing.

I would say that's definitely one of the, one of the big ones.  And Is that just because of the freedom to intervene and run experiments? That's one thing. It's not always the case, though.  What I think is even more fundamental,  a more fundamental driver of value is that people  in marketing, often by education, they don't look into the structure too much. 

So,  even if you build a structure that is roughly okay, it's  And even if it's not perfect, it might improve the outcomes  really significantly. In general, do you find presenting someone with a causal graph helps them understand it better? So like, it's a bit more intuitive or easier to understand how... The information is flowing through their system or not necessarily,  um, definitely.

So I understand your question is if, if presenting a graph of, of a system representing the, the relationships in the system, if this is helpful for, for someone, yeah, I think, I think DAGs are a great tool for conveying the vision and, and  translating some of the modeling, modeling assumptions.  tO, to people who are not necessarily technical people and even to technical people who are just not familiar with, with causal thinking. 

Because if you throw something into a, a big fully connected, like feed forward network, right? You don't know how all those features are interacting. And really the DAG is saying, like, you're limiting, if you build a structural causal model, you're limiting how. Certain features can interact in some, you're not allowing to interact at all in the model, right?

Yeah, that's that's a great point. I  love a thing that,  I love one thing that Naftali Weinberger told me. He told me like, sometimes, you know, people tend to see a complex, a more complex model and a less complex model, and they assume that a model that looks less complex  also has less assumptions. 

But this is not necessarily the case. So that's maybe like another side of the same coin, where sometimes somebody just says like, Hey, we put everything in this black box model. We don't make any assumptions.  But this is actually not true. You are making very strong assumptions about the structure of the data generating process.

You just make them implicitly. And I think that's one of the,  uh, one of the...  Huge challenges here.  The people many people will tend to think that  this is data driven, which means this is assumption free.  But what we in fact doing, we might do, we might make implicitly very, very strong assumptions. Mm hmm. 

That will impact the way the conditional independence landscape is,  is represented by the model.  And another problem is that we are  limiting our ability to learn, because if this model fails,  what in my experience, most organizations will do, they will either retrain this model or say, like, let's retrain more often,  or they will try to find a more complex architecture.

Increase, increase the capacity of the model, essentially. Yes. Thinking, and the thinking here is that, hey, maybe it was,  uh, not expressive enough to capture all the details in the problem. And let's find, let's find something that is more expressive. And this can be very, very detrimental in, especially in the long run.

Of course, yeah, you're essentially going to overfit. And then as soon as anything changes slightly, if you intervene, you're, you're representing the completely wrong distribution, right? Yeah, so MLOps can be helpful. But, but sometimes, especially if it's a decision making scenario. Maybe the approach like called causal ops.

There's this excellent paper by, by Robert Meyer. When we think about open source packages today we have, we have a  significant ecosystem  of, of causal packages in, in, in Python.  They might not always be very easy to learn in terms of differences in APIs. And all this stuff, but people can generally build pretty advanced systems with them.

What do you feel are the most significant or important contributions that you and your team brought to the, to the image when it comes to your product? Yeah, absolutely. Yeah, first maybe start saying, I think the open source stuff is great. We actually. So what our product is to maybe give the viewers and listeners a brief overview is it's really a cloud platform for decision making.

So what the users presented with when they start is an ability to say, launch a Jupyter lab session. They can use all of our packages, but there's nothing limiting anyone from using open source. So, I spoke a lot earlier about various causal discovery techniques we have implemented and maybe some modifications we made. 

Like, there's nothing stopping anyone from using, like, Huawei's GCastle, right? And then plugging it into other parts of our pipeline. And we even have a template showing users how to, say, maybe discover a graph using GCastle and then take that graph and go into our own structural causal, causal modeling.

I think on the product side of what we do, we try to make our packages as modular and use case agnostic as possible. So there's not just one big like causality import from causality import,  you know, solution. It's, it's very much modular for, for a lot of different techniques. So I think what we try to do is, yeah, understand like the breadth of The, the landscape, what looks quite promising and offer an implementation of that, that has, that's well documented, that's well tested, that's maintained that's updated at like a fast schedule.

I think I mentioned before that we have like a monthly release cadence, so we're really trying to add new functionality, dealing with feature requests from customers, doing improvements, bug fixes add a quick,  we also have like, make sure our packages support Python 3. 8 to 3. 11. The main images on the platform are, are 3.

11, which kind of gives you a bit of speed increase. But on the modularity side, yeah, some of the stuff is like we'll have a, a package of what we call a decision intelligence engines that might take one of our trained structural causal models and, and do something else with it. So one on algorithmic recourse, which I mentioned before, another one on root cause analysis, so there's a few different.

Techniques we have to, to try to find the root cause, whether it be a source node or an intermediate node within the graph.  So  it's modular, but it's kind of plug and play. So there's like a shared API between everything where, you know, if you discovered some model earlier in your, your workflow, you can plug that into one of the other components.

But there's really nothing  stopping you know, a user from bringing in open source stuff. So, we have you know, a set of, metrics that might be like more predictive based. So think of your general error terms, but also graph based. So like structural intervention, distance, SHD, stuff like that. But then we also have some uplift metrics and we, we use like psychic uplift directly, right?

Create open source package for like Keening curve and stuff like that, that people use for yeah, uplift modeling essentially. So,  yeah, I mean, we're definitely not trying to compete with open source or anything. It's really A platform for people to use the tools they're, they're used to and use the ones that we provide as well.

It's really a melting pot for, for what data scientists want to use.  Beautiful. So it sounds like, like there are two sources of flexibility. One is modularity and another one is that you also allow to plug in anything from  the open source domain into the system. So people can build a workflow with, as you said, something that maybe they are familiar with and they really like this method or the API of this package and, and they can seamlessly integrate this into, into, into the workflow.

Yep. Yeah, definitely. So like, if you have a graph already built with NetworkX, you can construct an instance of the graph we might use to build a structural causal model directly from that. So you can call that If you have a graphical causal model already from, from do why it would just really be writing like a small wrapper the platform, you know, if you're in JupyterLab, you can open a terminal window and you can just do pip install, whatever's there, right?

So anything that you're, you're used to you can do it. So we have pre baked images, right? That might come with like  the latest pandas, I think like 2. 1, you know, NumPy, NetworkX, a lot of the stuff. That we use to develop our packages, PyTorch, Pyro, everything I mentioned earlier. But if you want to install something new, it's just a pip install away, basically.

Andrew, what would be your message to Kotlin Python community?  I think the people should you know, it's a big community and we should continue to communicate. I think you're doing a great job putting this podcast out there. I'm always happy to talk to people. We were at NeurIPS this past year, and we also attended CLEAR, and it's just great, you know,  just for like brainstorming and stuff, and just being able to, to meet up with everyone.

I one of the things, I think I spoke to you at the dinner was there was this like latent hierarchical causal discovery paper that,  that spoke about this one thing called rank deficiency. And that was trying to estimate how many latent confounders are really in the data looking at the, the, the deficiency of the rank of the covariance matrix of your observed variables.

This is obviously making some linearity assumption, but I had never seen this applied before. And we, like, took that to put it into, like, another part of our, our product. And that stuff just comes up naturally when you're speaking with other practitioners, other researchers, other developers and stuff.

And I think you know, constant communication is always a A great way to spurn on new, new ideas  is the future causal. Absolutely. Yeah. I think anywhere where you need to, to make some decision Figure out some action, choose an optimal treatment in a safety critical system, a high value system where any errors are detrimental to, to life, to,  to, you know, revenue business.

Like, I think it has to be if Entertainment value or like  Less, you know risk heavy use cases. I think there's there's tons of methods out there that are are totally valid  But yeah, when we really need to apply this stuff in Scenarios that are her make or break life and death. It has to be causal  Where can people find more about you your team and a causal lens in general?

Yeah, so  We we have a research page on our website. So causal lens dot com. We publish our papers there.  We, yeah, tend to you know, our main KPIs are kind of more product driven, so we don't maybe  share as much as everything, but we've attended some conferences where we don't have stuff. We are, we put some talks up and some like.

Learning videos on YouTube as well. I think we're going to do some some more of that  We recently open sourced our like front end package Dara So we're kind of moving a bit more into the open source and want to you know, collaborate with the community. So  Just to give a brief introduction  Dara is kind of like our, our front end app building framework.

So kind of the data scientists will build stuff, you know, maybe directly in code and then want to build an app to ingest some of the stuff, but also for decision making. And part of that, we also open sourced our causal graph library because one of the bigger Dara components is a, is a graph viewer. So yeah, those those are available on our GitHub. I think it's just github. com slash causal lens  the code from that time series causal discovery papers there the code to reproduce the results from Marcus's  on suitability and no tears papers there. So yeah slowly adding stuff  Stuff there as well. These are great resources.

We'll link to them in the in the show notes. Cool, Andrew. It was a pleasure  Me as well. Thank you for having me. Thank you for your time. I really enjoyed the discussion and I'm confident that the community will love this as well. Cool. Yeah. Thank you. It's great meeting you.  Cheers. Thank you for staying with us till the end and see you podcast. 

Who should we interview next?  Let us know in the comments below or email us at hello at causalpython. io. Stay causal. 

(Cont.) Causal AI, Modularity & Learning || Andrew Lawrence || Causal Bandits Ep. 002 (2023)