# Machine Learning Papers Accepted to ICML 2020

The 37th International Conference on Machine Learning is an annual event taking place virtually this week. Professor David Blei is the General Chair of the conference for the larger machine learning research community.

Professor Elias Bareinboim presented a tutorial entitled “Towards Causal Reinforcement Learning,” where he discussed a new approach for decision-making under uncertainty in sequential settings. The tutorial is based on his observation, in his own words, “Causal inference (CI) and reinforcement learning (RL) are two disciplines that have evolved independently and achieved great successes in the past decades, but, in reality, are much closer than one would expect. Many new learning problems emerge whenever we understand and formally acknowledge their connection. I believe this understanding will make them closer and more applicable to real-world settings.”

In particular, Bareinboim explained that CI provides a set of tools and principles that allows one to combine structural invariances about the environment with data to reason about questions of counterfactual nature — i.e., what would have happened had reality been different, even when no data about this imagined reality is available. On the other hand, RL is concerned with efficiently finding a policy that optimizes a specific function (e.g., reward, regret) in interactive environments under sampling constraints, i.e., regarding the agent’s number of interactions with the environment. These two disciplines have evolved with virtually no interaction between them. Bareinboim noticed that even though they operate under different aspects of reality (an “environment,” as usually said in RL jargon), they are two sides of the same coin, namely, counterfactual relations.

His group at Columbia, the Causal Artificial Intelligence Lab, has been investigating in the last years the implications of this observation for decision-making in partially observable environments. In the first part of his tutorial, Bareinboim introduced the basics of CI and RL, and then explained how they fit the larger picture through a CRL lens. In particular, he used a newly proved result (joint with collaborators at Stanford University), the Causal Hierarchy Theorem (CHT), to discuss the different types of “interactions” an agent could have with the environment it is deployed, namely, “seeing”, “doing”, and “imagining.” He then puts the literature of RL — on- and off-policy learning — as well as the CI type of learning — called do-calculus learning — under the same formal umbrella. After being able to formalize CI and RL in a shared notation, in the second part of the tutorial, he notices that CRL opens up a new family of learning problems that are both real and pervasive, but weren’t acknowledged before. He discusses many of these problems, including the combination of online & offline learning (called generalized policy learning), when and where the agent should intervene in a system, counterfactual decision-making, generalizability across environments, causal imitation learning, to cite a few.

Bareinboim’s program is to develop a principled framework for designing causal AI systems that integrate observational, experimental, counterfactual data, modes of reasoning, and knowledge, which he believes, will lead to “a natural treatment to human-like explainability and rational decision-making.” If you are interested in joining the effort, he is actively recruiting graduate students, postdoctoral scholars, and trying to find collaborators who believe in Causal AI. The tutorial’s material and further resources about CRL can be found here: https://crl.causalai.net.

Below are links to the accepted papers and brief research descriptions.

Causal Effect Identifiability under Partial-Observability

Sanghack Lee *Columbia University,* Elias Bareinboim *Columbia University*

One of the central challenges in the data sciences is to compute the effect of a novel (never experience before) intervention from a combination of observational and experimental distributions, under the assumptions encoded in the form of a causal graph. Formally speaking, this problem appears under the rubric of causal effect identification and has been studied under various conditions in the theory of data-fusion (Survey: Bareinboim & Pearl, PNAS 2016; Video: Columbia Talk). Most of this literature, however, implicitly assumes that every variable modeled in the graph is measured throughout the datasets. This is a very stringent condition. In practice, the data collected in different studies are usually not consistent, i.e., each dataset measures a different set of variables.

For concreteness, consider a health care researcher who wants to estimate the effect of exercise (X) on cardiac stroke (Y) from two existing datasets. The first dataset is from an experimental study of the effect of exercise (X) on blood pressure (B) for different age groups (A). The second dataset is from an observational study that investigated the association between BMI (C), blood pressure (B), and stroke (Y). Currently, no algorithm can take these two datasets over {X, Y, A, B, C} and systematically combine them so as to identify the targeted causal effect — how exercise (X) affects cardiac stroke (Y).

In this paper, the researchers studied the causal effect identifiability problem when the available distributions encompass different sets of variables, which they refer to as causal effect identification under partial-observability (GID-PO). They derived properties of the factors that comprise a causal effect under various levels of abstraction (known as projections), and then characterized the relationship between them with respect to their status relative to the identification of a targeted intervention. They establish a sufficient graphical criterion for determining whether the effects are identifiable from partially-observed distributions. Finally, building on these graphical properties, they developed an algorithm that returns a formula for a causal effect in terms of the available distributions, whenever this effect is found identifiable.

One metaphor used throughout the paper is of a jigsaw puzzle, where the targeted effect is comprised of pieces and each dataset contains chunks, where each chunk is one or more pieces merged together. The task is to combine the chunks so as to solve the puzzle, even though not all puzzles are solvable. They infer that the identifiability problem under partial-observability is NP-complete.

“This conjecture is particularly surprising since many existing classes of identifiability problems have been entirely solved in polynomial time, the decision version of the problem, not an estimation,” said Sanghack Lee, an associate research scientist in the CausalAI Lab. “The class of time complexity of an identification problem may rely crucially on the lack of variables in the data sets, not the number of available data sets nor the size of a causal graph.”

Efficient Identification in Linear Structural Causal Models with Auxiliary Cutsets

Daniel Kumor *Purdue University*, Carlos Cinelli *University of California, Los Angeles*, Elias Bareinboim *Columbia University*

Linear regression and its generalizations have been the go-to tool of a generation of data scientists trying to understand the relationships among variables in multivariate systems, and the workhorse behind countless breakthroughs in science. Modern statistics and machine learning techniques allow one to scale regression up to thousands of variables simultaneously. Despite the power entailed by this family of methods, it only explains the association (or correlation) between variables, while it remains silent with respect to causation. In practice, a health scientist may be interested in knowing the effect of a new treatment on the survival of its patients, while an economist may attempt to understand the unintended consequences of a new policy on a nation’s gross domestic product. If regression analysis doesn’t allow scientists to answer their more pressing questions, which technique or framework could legitimize such inferences?

The answer lies in the tools of causal inference, specifically in the methods developed to solve the problem of “identification”. In the context of linear systems, formally speaking, identification asks whether a specific causal coefficient can be uniquely computed from the observational covariance matrix and the causal model. There exists a rich literature on linear identification, which includes a general algebraic solution (based on Groebner bases) that is doubly-exponential. In practice, and given this hardness result, the most common approach in the field is known as the method of instrumental variables (IVs), which looks for a particular substructure within the covariance matrix.

The research develops a new polynomial-time algorithm for linear identification that generalizes IV and all known efficient methods. The machinery developed underlying this algorithm unifies several disparate methods developed in the literature for the last two decades. Perhaps surprisingly, it turns out that this new method is also able to subsume identification approaches based on conditioning, some of which have high identification power but undetermined computational complexity, with certain variants shown to be NP-Hard.

Finally, this paper also allows one to efficiently solve for total effects based on a new decomposition strategy of the covariance matrix. It is still an open problem whether there exists an algorithm capable of identifying all causal coefficients in polynomial time.

Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach

Junzhe Zhang *Columbia University*, Elias Bareinboim *Columbia University*

This paper investigates the problem of learning decision rules that dictate what action should be taken at each state given an evolving set of covariates (features) and actions, which is usually called Dynamic Treatment Regimes (DTRs). DTRs have been increasingly popular in decision-making settings in healthcare.

Consider the treatment of alcohol dependence. Based on the condition of alcohol-dependent patients (S1), the physician may prescribe medication or behavioral therapy (X1). Patients are then classified as responders or non-responders (S2) based on their level of heavy drinking within the next two months. The physician then must decide whether to continue the initial treatment or to switch to an augmented plan combining both medication and behavioral therapy (X2). The primary outcome Y is the percentage of abstinent days over a 12-month period. A DTR here is a sequence of decision rules x1 = f1(s1), x2 = f2(x1, s1, s2) that decides the treatment X so as to optimize the expected outcome Y.

In this paper, the researchers studied how to find the optimal DTR in the context of online reinforcement learning. It has been shown that finding the optimal DTR requires Ω(|X ⋃ S|) trials, where |X ⋃ S| is the size of the treatment and state variables. In many applications, the domains of X and S could be significantly large, which means that the time required to ensure appropriate learning may be unattainable. They develop a novel strategy that leverages the causal diagram of the underlying environment to improve this sample complexity, leading to online algorithms that require an exponentially smaller number of trials. They also introduce novel methods to derive bounds over the underlying system dynamics (i.e., the causal effects) from the combination of the causal diagram and the observational (nonexperimental) data (e.g., from the physicians’ prior decisions). The derived bounds are then incorporated into proposed online decision-making algorithms.

“This result is surprising since most of the existing methods would suggest discarding the observational data when it is plagued with the unobserved confounding, and the causal effect is not uniquely identifiable,” said Junzhe Zhang, a research assistant in the CausalAI lab. “We show that the performance of online learners could be consistently improved by leveraging abundant, yet biased observational data.”

Stochastic Flows and Geometric Optimization on the Orthogonal Group

Krzysztof Choromanski *Google Brain Robotics, *Valerii Likhosherstov* University of Cambridge, *Jared Q Davis* Google Research, *David Cheikhi *Columbia University, *Achille Nazaret* Columbia University, *Xingyou Song* Google Brain, *Achraf Bahamou* Columbia University, *Jack Parker-Holder* University of Oxford, *Mrugank Akarte* Columbia University, *Yuan Gao* Columbia University, *Jacob Bergquist* Columbia University, *Aldo Pacchiano* UC Berkeley, *Vikas Sindhwani* Google, *Tamas Sarlos* Google, *Adrian Weller* University of Cambridge, Alan Turing Institute*

This paper presents a new class of stochastic, geometrically driven optimization algorithm for optimization problem defined on orthogonal group. The objective is to maximize a function F(x), where F takes arguments from matrix manifold. The paper solves this problem by computing and discretizing the on-manifold gradient flows and showed an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory.

The paper compares FLOPS needed for classification model on MISNT and CIFAR-10 dataset for our algorithm with state-of-the-art algorithms. The algorithm had substantial gains in time complexity as compared to other methods without much loss in accuracy. The paper also shows effectiveness of the algorithm for variety of continuous Reinforcement learning tasks like Humanoid from OpenAI gym.