Modelling acute myeloid leukaemia in a continuum of differentiation states.

Here we present a mathematical model of movement in an abstract space representing states of cellular differentiation. We motivate this work with recent examples that demonstrate a continuum of cellular differentiation using single cell RNA sequencing data to characterize cellular states in a high-dimensional space, which is then mapped into ℝ 2 or ℝ 2 with dimension reduction techniques. We represent trajectories in the differentiation space as a graph, and model directed and random movement on the graph with partial differential equations. We hypothesize that flow in this space can be used to model normal and abnormal differentiation processes. We present a mathematical model of hematopoeisis parameterized with publicly available single cell RNA-Seq data and use it to simulate the pathogenesis of acute myeloid leukemia (AML). The model predicts the emergence of cells in novel intermediate states of differentiation consistent with immunophenotypic characterizations of a mouse model of AML.


Introduction
The recent advance of single-cell RNA-sequencing (scRNA-Seq) technologies has enabled a new, high-dimensional definition of cell states. In contrast to conventional gene expression analyses based on measuring the average levels in a tissue or given cell population, single-cell analysis captures the cellular heterogeneity and provides resolution at the level of individual cells within the tissue or cell population. This level of resolution coupled with genome-wide gene expression provides an unprecedented opportunity to quantitatively probe cellular behaviour, cellular variation and dynamics in a wide range of biological contexts.
There are on the order of 20,000 protein encoding genes that compose the transcriptome, which constitute a R 20,000 dimensional space. Therefore, the configuration of the transcriptome at a point in time can be represented as a coordinate vector in space. When a cell expresses genes, it 'moves' in this high-dimensional gene expression phenotype space. Over time, the sequence of locations in the space of a given cell creates a trajectory. Dimension reduction techniques are commonly used to map the larger space into a lower dimensional space, for instance, R 2 or R 3 , at which point the cells are clustered based on a similarity metric and recategorized. This process has revealed a continuum of cell phenotypes, with intermediate states connecting canonical cell states. The most prominent example of this process is in haematopoietic cell differentiation.
Normal haematopoiesis is long thought to occur through stepwise differentiation of haematopoietic stem cells following a tree-like hierarchy of discrete multipotent, oligopotent and then unipotent lineage-restricted progenitors ( Figure 1A). The classical model of haematopoiesis considers differentiation as a stepwise process of binary branching decisions, famously represented as a potential landscape by Waddington (1957). However, this model is based on bulk characterization of prospectively purified immunophenotypic cell populations. Recent advances in scRNA-Seq technologies now allow resolution of single-cell heterogeneity and reconstruction of differentiation trajectories which have been applied to a number of different cellular systems, from haematopoiesis to breast endothelial cell differentiation (Bach et al., 2017;Hamey, Nestorowa, Wilson, & Göttgens, 2016;Nestorowa et al., 2016;Velten et al., 2017). These efforts have led to the new view that haematopoietic lineage differentiation occurs as a continuous process, which can be mapped into a continuum of cellular and molecular phenotypes ( Figure 1B). Haematopoietic malignancies such as acute myeloid leukaemia (AML) arise from dysregulated differentiation and proliferation of haematopoietic stem cells and progenitor cells upon accumulation of oncogenic genetic mutations and/or epigenetic alterations. Therefore, characterizing disordered haematopoiesis based on discretely defined phenotypic populations can be challenging. Moreover, 'discrete' phenotypic cell populations are in fact highly heterogeneous in terms of functional capacity and gene expression profiles. It is now possible to view pathologic haematopoiesis through a continuum of cellular and molecular phenotypes and capture the heterogeneity, differentiation plasticity and dysregulated gene expression associated with malignant transformation.
This new view of biology forces us to reconsider the mathematical approaches we use to model cell states and behaviours. Instead of building mathematical models which identify discrete cell populations and assign mathematical rules for their evolution and interactions, we may now consider a continuum of cellular states and model movement between these states in aggregate as a flow of mass on a structured graph. Modelling differentiation in this manner reduces the number of parameters and thus the complexity of the mathematical model by representing many cell populations and states in a single variable. At the same time, this increases biological resolution of the system by characterizing an infinite number of sub-states in a continuum representation. Here we consider a model of haematopoietic cell differentiation and associated disorders as a flow and transport process in a continuous differentiation space as a test system for a more general approach of modelling the temporal evolution of a continuum of cell states.
This article is structured as follows: first, we review the state of the art of dimension reduction methods that are used to construct and define haematopoietic differentiation spaces that can be represented as graphs, including a review of Schienbinger et al. 's method for modelling transport on a graph from reduced dimension gene expression data (2017). Then we introduce our partial differential equation (PDE) model of flow and transport on a graph, and illustrate the model on simple 'Y'-shaped graph with symmetric and asymmetric differentiation. We then calibrate our model to a graph constructed from publicly available scRNA-Seq data of normal haematopoiesis. Finally, we use our model to simulate abnormal haematopoietic cell differentiation processes observed during the pathogenesis of AML, a form of aggressive hematologic malignancy. We conclude with a brief discussion of prior literature on modelling differentiation as a continuum, and the limitations and potential future applications of this modelling approach.

Construction of a differentiation continuum
In order to describe the entire modelling process, in this section, we briefly describe methods for reducing the dimension of high-dimensional scRNA-data, before reviewing pseudotime reconstruction techniques, and conclude this section by examining a technique from Schiebinger et al. (2017) for construction of a directed graph that represents haematopoietic differentiation space. While the focus of this paper is not dimension reduction techniques or pseudotime reconstruction, we summarize some of these techniques that are most relevant to our modelling approach, without advocating for one over another. We should emphasize that this is a review of already existing algorithms; the novel work begins in Section 3. The relationship between time and pseudotime within a mathematical model of cell differentiation is analogous to the relationship between age-structured and stage-structured models in ecology. Cell differentiation data yield information about cells at various stages of differentiation, but generally do not provide time-specific data. A pseudotime model is one that considers the differentiation stage of a cell population instead of the time in which a cell is in a certain state.
In Figure 2, we lay out the steps required for going from high-dimensional data to construction of the PDE model. Section 2.1 will review various dimension reduction techniques, including a more thorough discussion of the technique used in our application, diffusion mappings. Section 2.2 summarizes techniques such as Wishbone and Wanderlust that are available for pseudotime reconstruction given dimension reduced data. Finally, Section 2.3 will give an overview of the technique presented in Schiebinger et al. (2017) for construction of a directed graph that indicates how cell populations evolve in pseudotime. Figure 2. Flowchart of our modelling process: this chart organizes the steps taken toward constructing the PDE model. First, high-dimensional data such as single-cell RNA-sequencing (scRNA-Seq) are represented in two-or three-dimensional space through one of many dimension reduction techniques. Then, temporal events (pseudotime trajectories) are inferred from the dimension reduced data. We then use the reduced dimension representation and pseudotime trajectories to model flow and transport in the reduced space. Data is from Nestorowa et al. (2016).

Dimension reduction techniques
A broad range of techniques have been developed to provide insight into interpretation of high-dimensional biological data. These techniques provide a first step in our approach to modelling the evolution of cell states in a continuum and play a critical role in characterizing differentiation dynamics. We note that the application of different data reduction techniques, clustering methods and pseudotime ordering on the same data set will produce different differentiation spaces on which to build a dynamic model. We will use one particular dimension reduction approach as an example, but our framework allows one to select from a variety of approaches. In this section, we provide a brief description of a subset of such techniques to give the reader a sense of the field.
Several techniques have been developed to interpret the high-dimensional differentiation space, including principal component analysis (PCA), diffusion maps and t-distributed stochastic neighbour embedding (t-SNE). Each of these methods maps highdimensional data into a lower dimensional space. As discussed in this section, different techniques produce different shapes and differentiation spaces, and so some techniques are better suited to certain data sets than others. For instance, one commonly used dimension reduction technique is PCA, a linear projection of the data. While PCA is computationally simple to implement, the limitation of this approach lies in its linearity -the data will always be projected onto a linear subspace of the original measurement space. If the data show a trend that does not lie in a linear subspace -for instance, if the data lie on an embedding of a lower dimensional manifold in Euclidean space that is not a linear subspace -then this trend will not be efficiently captured with PCA (Khalid, Khalil, & Nasreen, 2014).
In contrast, diffusion mapping and t-SNE , as well as a variant of t-SNE known as hierarchical stochastic neighbour embedding (HSNE), are nonlinear dimension reduction techniques. t-SNE, introduced by van der Maaten and Hinton (2008), is a machine learning dimension reduction technique that is particularly good at mapping high-dimensional data into a two-or three-dimensional space, allowing for the data to be visualized in a scatter plot.
Given a data set in R n : X = {x 1 , x 2 , . . . x n }, we can transform the Euclidean distances between two points into a probability distribution. Intuitively, this distribution gives the probability that data point x j is a neighbour of point x i , where the probability of being a neighbour of x i has a Gaussian distribution (van der Maaten & Hinton, 2008): The t-SNE algorithm aims to find a map from the data set to two-or three-dimensional Euclidean space that minimizes the Kullback-Leibler divergence between the probability distributions in the original and reduced space. This optimization problem is often solved using gradient descent methods.
In van Unen et al. (2017), a new technique for examining high-dimensional mass cytometry data, known as HSNE is presented. Mass cytometry allows for the examination of several cellular markers on samples made up of vast quantities of cells. These data sets are truly 'big' in the sense that they are very large (a sample for each cell) as well has highdimensional. Therefore, pre-existing dimension reduction techniques are not optimal for mass cytometry data. HSNE, as suggested by its name, is hierarchical by nature, allowing for refinement in the level of detail. HSNE ultimately constructs a hierarchy of subsets of the dataset X: The hierarchy begins with the data set itself (X = L 1 ). A weighted k-nearest neighbour (kNN) graph is constructed on the data set, and individual points, or 'landmarks', are selected from each node on the graph to represent the data set at the next, coarser, level, L 2 . This process is repeated as desired. These subsets can each be embedded in lower dimensional space. This hierarchical embedding scheme allows the user to view the data at different resolutions, from a broad overview (level L n ) to a more refined understanding of cell types associated with markers (intermediate levels). Starting with a certain subset O ⊂ L s , the user is able to 'drill in' to the data by selecting a subset G ⊂ L s−1 . Thus, HSNE is an approach that is useful for data that require different levels of detail at different scales. An illuminating graphical representation of the HSNE process can be found in van Unen et al. (2017) (Figure 1).
Diffusion maps work by taking advantage of the relationship between heat diffusion and random walk Markov chains. Let X be a data set of size n. The diffusion map algorithm begins by considering a kernel function on pairs of data points; this function must be symmetric and nonnegative. The Gaussian kernel k(x, y) = e − x−y 2 is one popular choice. Similar to the conditional probability defined in Equation (1), the kernel k(x, y) is used to specify the probability of going from x to y in one step of a random walk on the data, found by normalizing the kernel to ensure the random walk probabilities integrate to 1: .
By letting the number of steps in this random walk go to infinity, we can consider the stationary distribution p t of the Markov chain. This stationary distribution is used to formulate a new metric on the data space, known as the diffusion distance: Intuitively speaking, the diffusion distance between two points will be low if there are many paths in the random walk that connect them, and high if there are few. Because it is computationally expensive to repeatedly compute the diffusion distance between each pair of points, it is easier to map data points to a new Euclidean space using the function φ : X → R n defined as The Euclidean distance in this space, known as the diffusion space, is then equivalent to the diffusion distance in the data space. It can be demonstrated that the linearly independent eigenvectors of the diffusion matrix (the transition matrix associated with the aforementioned Markov Chain) form a basis for the diffusion space. Therefore, by opting to keep the k-eigenvectors corresponding to the k largest eigenvalues, we obtain a map from the original data to a k-dimensional subspace of the diffusion space that most efficiently captures the structure of the data; this map is called the diffusion map. A more in-depth explanation can be found in Coifman et al. (2005). Each of these dimension reduction methods has strengths and weaknesses depending on the question(s) being asked of the data. Moreover, each method will produce a distinctly different shape in the lower dimensional representation. Therefore, the choice of dimension reduction technique is a critical step in analysing any high-dimensional data set. For the purpose of analysing cell transition probabilities and inferring trajectories within the reduced space, Nestorowa et al. (2016) and others have chosen to use diffusion mapping to analyse cell differentiation.

Pseudotime ordering of differentiation states
For data without temporal information, pseudotime methods are available to infer a sequence of biological states from single time point data. Diffusion mapping can be used to infer a 'diffusion pseudotime' (Haghverdi, Büttner, Wolf, Buettner, & Theis, 2016;Nestorowa et al., 2016). In particular, Haghverdi et al. (2016) develop an efficient diffusion pseudotime approach by rescaling the diffusion components by a weighted distance in terms of the eigenvalues, derived by considering a random walk according to a transition matrix that specifies the probability of transitioning from any single cell to another in an infinitesimal amount of time. Alternative pseudotime approaches include Wishbone (Setty et al., 2016) that uses shortest paths in a kNN graph constructed in diffusion component space to construct an initial ordering of cells, TASIC (Rashid, Kotton, & Bar-Joseph, 2017) that is able to incorporate time information and identify branches and incorporate time information in single-cell expression data by considering it as developmental processes emitting expression profiles from a finite number of states, and Monocle (Qiu, Hill, et al. 2017;Qiu, Mao, et al. 2017) that fits a principal graph (Mao, Wang, Goodison, & Sun, 2015) and uses a reversed graph embedding technique which simultaneously learns a low-dimensional embedding of the data and a graphical structure spanning the dataset.
Finally, when the data are collected at multiple time points, the transition rates between the nodes can be obtained after partitioning the cell data. For instance, Schiebinger et al. (2017) employ graph clustering (Levine et al., 2015;Shekhar et al., 2016) and optimal transport (OT) methods to understand the dynamics in the reduced space of cell data. We describe the OT method in an effort to provide a clear distinction between the OT algorithm and our modelling approach. Schiebinger et al. (2017) have proposed a model and algorithm for constructing a directed graph oriented in pseudotime given temporal data. The OT algorithm itself is a classical problem studied in the mathematical area of Transportation Theory, which aims to optimally transport and allocate resources given certain cost functions. Schiebinger et al. (2017) apply this theory to a time series of reduced dimension single-cell gene expression profiles.

Optimal transport
The time series is made up of a sequence of samples {S 1 , . . . , S n }, at different times t i for i ∈ {1, . . . , n}. Suppose that each sample consists of points in R m . A distributionP t i is defined by each sample S i . For each set A ⊂ R m : where δ x represents a Delta Distribution centred at x: Together, as a sequence, these inferred distributions {P t i } form what is known as an 'empirical developmental process'. The goal is then to determine, as closely as possible, what the true underlying Markov developmental process P t is by examining what are known as transport maps between pairsP t i−1 andP t i . A transport map for a pair ( Thus, given a function c(x, y) that represents the cost to transport some unit mass from x to y, the goal is to minimize Schienbinger et al. further refine this algorithm by including a growth term in their transport plan to allow for cellular proliferation between time points. This differs from the classical OT algorithm in that the classical OT algorithm is formulated with conservation of mass in mind. OT can thus be used to estimate the ancestors and descendants of a set of cells. Cells are clustered using the Louvain-Jaccard community detection algorithm on the reduced dimension gene expression data in 20-dimensional space. Schienbinger et al. thus identified 33 cell nodes, which were then used as starting populations from which developmental trajectories could be analysed. These can be thought of as nodes on a graph visualized with force-directed layout embedding, and edges represent the motion in pseudotime.
In the following section, we assume that the first two steps in Figure 2 have been completed by one of the methods described above. In other words, we start with samples in high-dimensional space, we map the data to a lower dimensional space and then we produce pseudotime trajectories in this lower dimensional space. In the final step, we model the differentiation process in continuous (pseudo)-time and (reduced-dimensional) space using PDEs.

Modelling on the differentiation continuum
To illustrate our modelling technique, we assume that we have constructed a simple branched manifold or graph situated in the differentiation space. This graph is not a set of discrete nodes, rather, the graph and its edges represent a continuum of canonical states and intermediate states of differentiation. Assuming that the graph and the temporal evolution on the graph is obtained by any one of the various data analysis techniques summarized in Section 2 including OT (Schiebinger et al., 2017), diffusion pseudotime methods (Haghverdi et al., 2016;Nestorowa et al., 2016), Wishbone (Setty et al., 2016), TASIC (Rashid et al., 2017) and Monocle (Qiu, Hill, et al. 2017;Qiu, Mao, et al. 2017), we develop a PDE model that describes the dynamics in this differentiation continuum. Cell differentiation models in the continuous space have been developed in Gwiazda, Grzegorz, and Marciniak-czochra (2012) and Doumic, Marciniak-Czochra, Perthame, and Zubelli (2011) that extend the discrete multicompartment models Lo et al., 2008;Marciniak-Czochra, Stiehl, Ho, Jäger, & Wagner, 2009;Stiehl & Marciniak-Czochra, 2011).

PDE model on a graph
Let us define the graph G obtained in the differentiation continuum space. We comment that although we can consider a cell distribution on the actual reduced space, we further reduce our model on a graph that makes it convenient to employ the biological insights from the classical discrete models. The node set of G is denoted as {v k } n v k=1 , where n v is the total number of nodes, and the edge of the graph connecting in the direction from the ith to the jth node as e ij . We also introduce an alternate description of the graph with respect to the edge that is more convenient for describing the PDE model. If the total number of nontrivial edges is n e , we take {e k } n e k=1 with the index mapping I : J → {1, . . . , n e } on the set of nontrivial edges (i, j) ∈ J , and the end points in the direction of cell transition as {a k } n e k=1 and {b k } n e k=1 , respectively. We remark that n e k=1 {a k , b k } = {v k } n v k=1 . We denote u(x, t) as the cell distribution on the graph G at the differentiation continuum space location x ∈ G and time (or pseudotime) t. Thus, we follow the dynamics of the cell density at x ∈ G. We annotate the cell distribution on each edge e k as u k (x, t) such that u(x, t) = {u k (x, t)} n e k=1 , and model the cell density by an advection-diffusion-reaction equation (Evans, 2010) as where x is a one-dimensional variable parameterized on each edge e k from a k to b k . The advection coefficient V k (x) models the cell differentiation and the transition between the different cell types, that is, the nodes. The transition rate per unit time (e.g. day −1 ) or pseudotime can be taken as V k (x) computed using the periods of cell differentiation. For instance, V k (x) can be computed by smoothly interpolating the speed of cell differentiation from the multicompartment discrete models as where c n is the differentiation rate of cell type v n and φ is an interpolation function. 1 Cell proliferation and apoptosis can be modelled by the reaction coefficient R k (x). Similar to V k , if only the proliferation at the discrete cell types are available, we interpolate as where r n is the growth rate at node v n . In addition to natural proliferation and apoptosis, this term can also model abnormal tumorous cell growth or the effect of targeted therapy by localized Gaussian or Dirac-delta functions centred at the location of the corresponding cell type on the graph.
The diffusion term represents the instability on the phenotypic landscape of the cells that should be taken account into when modelling the macroscopic cell density. In particular, we consider the diffusion term in Equation (2) in such form that is appropriate to model the dynamics on a graph that is reduced from a higher dimensional narrow domain. It involves two parameters D k (x) and w k (x) describing the magnitude of cell fluctuation and the width of the narrow domain around the edge, respectively. Considering the phenotypic fluctuation of the cell density as a random process subject to Brownian motion with magnitude σ , the diffusion term becomes D k = σ 2 and w k = 1 (Evans, 2010). In addition, the width or the area of the cross section of the narrow domain that is vertical to the projecting edge can be taken as w k (x), which is called Fick-Jacobs equation for deterministic PDEs (Valdes & Guzman, 2014;Zwanzig, 1992) and can be similarly derived for stochastic PDEs (Cerrai & Freidlin, 2017;Freidlin & Hu, 2013). w k (x) can be measured as the length of maximal fluctuation in the vertical direction along the graph.
In addition to the governing equation on the edges, the boundary condition at the nodes are critical when describing the dynamics on the graph. The boundary condition on the cell fate PDE model can be classified into three types, the initial nodes that do not g. the most differentiated cells, and the intermediate nodes, On the intermediate nodes v n ∈ N T , mixed boundary conditions can be imposed to balance the cell inflow and outflow as where j] , and b I [i,n] is the right end point of the edge between nodes i and n. Similarly, a I [n,j] is the left end point of the edge between nodes n and j. In addition, continuity conditions are taken as Dirichlet boundary conditions as follows: for a fixed n. The cell outflow boundary conditions on the final nodes v n ∈ N F are imposed as reflecting boundary conditions [j,n] ) for all (i, n) and (j, n) in J . Similarly this can be imposed on the initial nodes v n ∈ N I as (∂/∂x)u(a I[n,j] ) = α n , (n, j) ∈ J or u(a I[n,j] ) = α n , (n, j) ∈ J , depending on whether the model describes the cell inflow flux or a prescribed density. In particular, the prescribed value when u(a I [n,j] ) represents the density of stem cells, one can model the discrete stem cell state as a separate ordinary differential equation (ODE) and impose its solution as the boundary condition at a I [n,j] (Doumic et al., 2011;Gwiazda et al., 2012). This approach makes it possible to distinguish stem cell proliferation into the division that remains as stem cell and the one that differentiates to a matured cell.

Example on a Y-shaped graph
To illustrate our approach, we apply the PDE model given in Equation (2) to a simple Y-shaped graph. This example is motivated by cell differentiation data that reveal multiple branching procedures in the continuous space (Haghverdi, Buettner, & Theis, 2015;Moris, Pina, & Arias, 2016;Rizvi et al., 2017;Velten et al., 2017), therefore we assume the simplest case that the differentiated cells have two different cell fates with one branching. For instance, assume that the cell data projected onto the first two diffusion components, DC1 and DC2, are as in Figure 3A and the temporal direction of cell differentiation is from left to right, as indicated by the arrows in the Figure. We define the Y-shaped graph with four nodes v 1 = (−1, 0), v 2 = (0, 0), v 3 = (1, 1) and v 4 = (1, −1), and three edges e 1 = e I(1,2) = v 1 v 2 , e 2 = e I(2,3) = v 2 v 3 and e 3 = e I(2,4) = v 2 v 4 . This corresponds to the set of nontrivial edges J = {(1, 2), (2, 3), (2, 4)} and index mapping I on J as I(1, 2) = 1, I(2, 3) = 2 and I(2, 4) = 3 that yields the end points of the edges a k and b k as v 1 = a 1 , v 2 = b 1 = a 2 = a 3 , v 3 = b 2 and v 4 = b 3 . For simplicity, we assume that the edges are straight lines and parameterize the variables on each edge as e 1 (x) = (x − 1, 0), e 2 (x) = (x, x) and e 3 (x) = (x, −x), so that x ∈ [0, 1]. When there is possibility for confusion, we use subscripts on the x-variables to specify which edge is parameterized. So, for example, x 2 parameterizes e 2 . Then, the PDE model on each parameterized edge e k can be written as We consider the case that the cells transfer from v 1 to v 2 in n T = 5 unit time, differentiate into each cell type with proportion p and 1−p, and accumulate at DC1 = 1, where the cells are fully differentiated. 2 Here, we simplify the differentiation rate to be constants assuming that the single branching Y graph lies locally and close enough in the differentiation space that the differentiation rate does not change. Then, where V 2 and V 3 reflect the accumulation at cell types v 3 and v 4 (x = 1). Also, we assume that the differentiation process is subject to fluctuations such as trans-differentiation (crosslineage) and de-differentiation (stem state reversion) that is modelled as Brownian motion with a constant variance σ so that D k = σ 2 = 0.02. Also, the maximal fluctuation in the vertical direction of the edge is assumed to be a constant that is independent of x and w 1 = 2w 2 = 2w 3 so that the fluctuation in the vertical direction reduces by half in e 2 and e 3 . w k cancels out in the diffusion term in Equation (4). Figure 3 plots the two examples of symmetric differentiation p = 0.5 and asymmetric differentiation p = 0.25. In this example, to demonstrate our model focusing on the cell differentiation and branching, we assume that the proliferation is zero as R k = 0 (see Appendix for the detail of modelling R k ). The boundary type of the nodes are classified, according to our description above, as N I = {v 1 }, N T = {v 2 } and N F = {v 3 , v 4 }. Thus, we impose the gluing boundary condition, as in Equation (3) at v 2 , as with continuity conditions u 1 (b 1 ) = u 2 (a 2 ) = u 3 (a 3 ). In addition, an inflow boundary condition is imposed at v 1 , and reflecting boundary conditions at the end nodes v 3 and v 4 as u 1 (a 1 , t) = (1/ √ 0.08π) exp[−(−(1/n T )t) 2 /0.08], ∂u 2 (b 2 )/∂x 2 = 0 and ∂u 3 (b 3 )/∂x 3 = 0. The Dirichlet condition of u 1 (a 1 , t) is given to resemble the transition of a certain cell state to fully differentiated cells from the initial distribution Simulations of this simple model are shown in Figure 4, where densities on edges e 2 and e 3 are plotted in different colours. We see that an initial cell distribution concentrated near the cell state v 1 moves to the right as the cells differentiate, branches at v 2 and becomes absorbed at the fully differentiated cell states v 3 and v 4 . In the symmetric case, when p = 0.5, the density is the same on each of the two branches to the right of v 2 , so that the two curves are plotted on top of each other. When p = 0.25, the density profile is not symmetric: more cells move along the upper branch than on the lower branch. This provides a simple illustration of the mathematical details of our modelling framework, which we apply to more complicated graph structure derived from data as follows.

Simulation results
In this section, we employ the framework developed in Section 3.1 to mouse haematopoietic stem and progenitor cell (HSPC) data in Nestorowa et al. (2016). See Appendix for details, including the model parameters and simulation codes.

Model of normal adult haematopoiesis
To calibrate our model, we first apply it to normal haematopoietic cell differentiation trajectories identified in Nestorowa et al. (2016). Nestorowa et al. characterize early stages in haematopoiesis with 12 cell types, shown in Table 1 and Figure 4, including E-SLAM (CD48−CD150 +CD45+EPCR+), long-term HSCs (LT-HSCs), short-term HSCs (ST-HSCs), lymphoid-primed multipotent progenitors (LMPPs), multipotent progenitors (MPPs), megakaryocyte-erythroid progenitors (MEPs), common myeloid progenitors (CMPs), and granulocyte-macrophage progenitors (GMPs). We consider these 12 cell types as the 12 nodes, v k , in our graph, and add 51 edges to model the haematopoietic   cell hierarchy (see Figure 1A) and pseudotime computed in Nestorowa et al. (2016) (see Figure 5A). This graph represents a continuum of canonical and intermediate states of haematopoietic differentiation with nodes and edges, respectively. The spatial variable in our PDE model represents the differentiation state of the cell. The coloured and labelled clustered cell data and the corresponding graph are shown in Figure 2. The location of the nodes on the graph is not chosen to be identical to the data, but for an illustrative purpose to represent DC2 and DC1/DC3. The edges are chosen according to the pseudotime progression from the E-SLAM and HSCs (nodes 1-3) to the progenitor cells (nodes 9-12).
The parameters of the PDE model of cell differentiation under normal conditions are chosen to reproduce the distribution of cell types from Nestorowa et al. (2016) at the initial and final pseudotime ( Figure 5C). Considering the data in Nestorowa et al. (2016) grouped by sorting gate of LT-HSC, HSPC, and progenitor cells, we denote the subsets of nodes that are representative of each group as I 1 = {1, 2, 3} for HSC, I 2 = {4, . . . , 8} for HSPC and I 3 = {9, . . . , 12} for progenitor cells, where we also take N I = I 1 , N T = I 2 and N F = I 3 . The reference distribution is computed by counting the relative number of cells in each cluster at the initial and final pseudotime. The initial and final cell distribution is concentrated on nodes 1-3 of I 1 and 9-12 of I 3 , respectively.
The distribution of cells in the remaining states, represented by nodes 4-8 of I 2 , goes from 0 at time t = 0 to positive at time t = 2, and reduces at t = 4. We remark that the ratios of the number of cells in each node are used to compute the advection coefficients V k in Equation (A3), where we take the drift V I [i,j] from cell type i to another cell type j to be proportional to the ratio plotted in Figure 5C. For instance, the outflow from v 5 to nodes 9-12 is taken to be proportional to the reference distribution at pseudotime t = 4. With the ratios fixed, we assume a constant parameter that represents the differentiation rate on each node, and find the values that reproduce the given cell data by simple root-finding algorithms such as the secant method. The range of the values are 0 ≤ V k ≤ 3. The detailed procedure is explained in Appendix.
The diffusion coefficient is taken as D k = D I(i,j) = 10 −2 within either subsets of nodes i, j ∈ I 1 or i, j ∈ I 3 , and D k = 10 −3 on the other edges. The magnitude D k = 10 −2 corresponds to the phenotypic fluctuation of 2.5456 × 10 −2 in the diffusion space and D k = 10 −3 takes into account of the increased average distance between the nodes that yields smaller diffusion coefficient due to relatively smaller fluctuation. We assume that the proliferation of the progenitor cell nodes are a constant as r n = 1.3648 at t ≤ 2 and r n = 0.4 at t > 2 for n ∈ I 2 ∪ I 3 , where the proliferation rate reflects the increment of cell number from HSCs to progenitor cells in the data. Also, the proliferation at the HSC nodes is assumed to be negligible compared to progenitor cells as r n = r n∈I 2 ∪I 3 × 10 −2 for n ∈ I 1 (Passegué, Wagers, Giuriato, Anderson, & Weissman, 2005). See Appendix for the model parameters and detailed discussion.
For the implementation, we discretize the system using a fourth-order finite difference method and 100 grid points on each edge, and a third-order Runge-Kutta method in time with time step 10 −4 . Figure 5C compares the solution to the PDE in the normal condition to the reference distribution. The initial condition of the PDE is taken as the initial reference distribution, and we compute the solution up to time t = 5. The solution at t = 4 is similar to the reference distribution at final pseudotime. Also, the solution at t = 2 is close to the distribution of the remaining cells excluding the initial and final cells. Figure 5B shows the cell distribution on the graph from time t = 0 to t = 5. We observe that the cell density shifts from the initial nodes 1-3 representing HSCs, to nodes 9-12 representing progenitor cells.

Acute myeloid leukaemia
AML results from aberrant differentiation and proliferation of transformed leukaemiainitiating cells and abnormal progenitor cells. Parallel to the hierarchy of normal haematopoiesis, malignant haematopoiesis has also been considered to follow a hierarchy of cells at various differentiation states although with certain levels of plasticity (see Figure 6). Given the aberrant differentiation and plasticity associated with the pathology of AML, modelling in a continuous differentiation space offers the advantage over discrete models that all pathological and intermediate cell states can be captured. With our model calibrated to data obtained from normal haematopoietic differentiation trajectories, we now model the progression of AML using a genetic knock-in mouse model that recapitulates somatic acquisition of a chromosomal rearrangement, inv(16)(p13q22)  (16)-driven AML, the proliferation R k (x) connecting the nodes 3, 5 and 11 is increased and the flow toward the node 11, V k (x) and D k (x) for k = I(i, 11) is blocked. (Liu et al., 1993,9), commonly found in approximately 12 % of AML cases. Inv(16) rearrangement results in expression of a leukaemogenic fusion protein CBFβ-SMMHC, which impairs differentiation of multiple haematopoietic lineages at various stages (Castilla et al., 1999;Kuo, Gerstein, & Castilla, 2008;Kuo et al., 2006).
Our previous studies using the inv(16) AML mouse model demonstrate that expression of CBFβ-SMMHC leukaemogenic fusion protein results in expansion of preleukaemic haematopoietic stem and progenitor populations susceptible to transformation into leukaemia-initiating cells which can initiate and propagate AML. Most notable was the increase in abnormal myeloid progenitors, which had an MEP-like immunophenotype and a CMP-like differentiation potential (Kuo et al., 2006). Further separation of MEPs with additional phenotypic markers (Pronk et al., 2007) show a predominant increase in pre-megakaryocyte/erythroid (Pre-Meg/E) population (ranging from 5 to 12 fold) accompanied by impaired erythroid lineage differentiation ( Figure 6A) (Cai et al., 2016). This refined phenotypic Pre-Meg/E population consists partly of the CMP and MEP populations using conventional markers (Akashi, Traver, Miyamoto, & Weissman, 2000;Nestorowa et al., 2016).
The simulation of inv(16)-initiated AML pathogenesis is motivated by our previous observations that AML is preceded by expansion of preleukaemic myeloid progenitor cells, particularly the Pre-Meg/E and MEP-like populations with impaired differentiation. These abnormal progenitors are predisposed to subsequent cooperating events necessary to transform to overt AML (Cai et al., 2016;Castilla et al., 1999;Kuo et al., 2006). To simulate AML pathogenesis, we increase the proliferation rate of MEP (node 11) by 10 times, that is, r I [i,11] = 10r normal , to reflect the abnormal expansion of MEP-like cells (ranging from 5 to 12 fold based on previous data) (Cai et al., 2016;Kuo et al., 2006). Here, r normal is the value that is used in the normal condition in Section 3.1. Thus, the proliferation is assumed to be maximal at the MEP node, R k (v 11 ) and the proliferation of intermediate cells that are phenotypically close to MEP, that is, R I [i,11] (x) near x = v 11 , also increase. Also, the flow to the MEP is blocked by taking zero advection coefficient on the edge that is connected to v 11 , i.e. V I(i,11) (x) = 0. We also lower diffusion by 10 as D I(i,11) (x) = D normal /10 to model the phenotypic fluctuations and imperfect differentiation block involved in AML pathogenesis. The differentiation block is imperfect because there is a continuum of leukaemic cell phenotypes (states).
In addition, the proliferation rate of LT-HSC and ST-HSC (nodes 3 and 5), that is, r 3 and r 5 , is increased by 2.5 times as 2.5r normal ( Figure 6B). We model the induction of the leukaemogenic fusion protein CBFβ-SMMHC resulting from the chromosome inversion inv(16) (p13q22) as the 'start of AML'. In this murine model of AML, inv(16) is the initial founder event that results in differentiation block and expansion of abnormal progenitors, which are predisposed to subsequent cooperating events necessary to transform to overt AML (Cai et al., 2016;Castilla et al., 1999;Kuo et al., 2006). The approach used here directly models the sequence of events observed during leukaemia initiation. Finally, we denote t AML as the effective time that the aforementioned 10-fold proliferation change in MEP and other abnormal differentiation and proliferations due to AML are observed. The other parameters except the ones described in this section follow the ones from Section 4.1. Figure 7 shows the total number of cells in each cell type in the normal and AML conditions starting at t = 4. In the normal condition, the CMP, MEP and LMPP cells dominate the population after t ≥ 4. However, in the AML case, the MEP cells increase up to 100 times of the normal condition after a single psuedotime and dominate the population. Figure 7C plots the number of cells in each cell type separately, where we can observe the increasing number of cells not only in MEP, but also in the intermediate cell types, 4-8. Figure 8 compares the cell distribution on the graph between the normal and the AML case. In the AML case, the peak is shown on the edges near MEP cells.
The continuum of intermediate cell types, represented as numbers of cells along the edges of the graph are plotted in Figure 9. The cell distribution in the normal case at t = 1 and t = 3 shows the cell population moving on the edges from HSCs to progenitors states. Under normal haematopoiesis, we observe the flow of cells along the continuum from a stem cell like state to a progenitor state, with an even distribution of all types of progenitor cells. However in the AML case, we predict the emergence of novel intermediate cell types, including a mixed CMP-MPP3 and CMP-MEP cell type. These indeterminate cells may exhibit phenotypic and/or functional properties of both cell types on either side of the edge (node i and/or node j). This cell state may be unstable, phenotypically plastic, may be in an abnormal state or process of differentiation, or perhaps even undergoing a selection pressure to induce transformation. Of note, this prediction of a mixed CMP-MEP cell type echoes the biological observation that abnormal myeloid progenitors seen during AML progression exhibit an MEP-like immunophenotype with a CMP-like functional readout (Kuo et al., 2006). This mixed identity/functionality coincides with a strong differentiation block toward erythrocyte and megakaryocytes (Cai et al., 2016).
This highlights the advantage of modelling pathologic conditions in a continuum of cell states as the phenotypic properties and the differentiation processes are often abnormal during pathogenesis. This approach also circumvents the limitations of varying phenotypic definitions used in different studies in the literature (e.g. MEP vs. Pre-Meg/E) and the varying degree of heterogeneity within phenotypically defined cell populations in health and in disease.
We also simulated AML starting at different time points from t = 1 to t = 6. Since our initial condition assumes that the cells have not yet developed to MEP, the total number of cells is maximized when the AML occurs after a critical amount of cells have differentiated into an MEP state. Figure 10 shows the results of model simulations, where we observe that the number of cells is maximal at later times when AML is started at t = 3. From these simulations, we infer that the short-and long-term evolution of AML may depend on the state and composition of the haematopoietic landscape at the time of AML initiation.

Discussion
We present a mathematical model of movement in an abstract space representing states of cellular differentiation. We represent trajectories in the differentiation space as a graph and model directed and random movement on the graph with PDEs. We demonstrate our modelling approach on a simple graph and then apply our model to haematopoiesis with publicly available scRNA-Seq data. We calibrate the PDE model to pseudotime trajectories in the diffusion map space and use the model to predict the early stages of pathogenesis of AML.
A more traditional approach for modelling the process of cell differentiation is to use a discrete collection of ODEs that describe dynamics of cells at n different maturation stages and the transition between those stages (cf. Lander et al., 2009;Lo et al., 2008;Marciniak-Czochra et al., 2009;Stiehl & Marciniak-Czochra, 2011). These discrete models are also referred to as 'multicompartmental models', and are based on the biological assumption that in each lineage of cell precursors there are discrete steps in the maturation process that are followed sequentially (cf. Lord, 1997;Uchida, Fleming, Alpern, & Weissman, 1993).
This view of the differentiation process being discrete does not capture biological observations that indicate that cell differentiation is more likely a continuous process, and that Figure 9. The continuum of cell states can be visualized as the density of cells along the 51 edges of the graph (rows) connecting node i (left) to node j (right) for all nodes i,j. Cell distribution (log 10 scale) on the edge comparing the normal condition and AML. In addition to an accumulation of MEP cells, novel intermediate cell states emerge resulting from the differentiation block and increased proliferation rate resulting from AML. These novel cell states are indicated with white arrows and generally fall between the CMP, MPP and MEP canonical cell states. The presented edges in the first row (t < 4) are lexicographically ordered with respect to the left end (a n ) to visualize which nodes are the differentiating cells departing from and with respect to the right end (b n ) in the second row (t > 4) to visualize which nodes are the arriving cells differentiated into. maturation may, in fact, even be decoupled from cell division (cf. Dontu, Al-Hajj, Abdallah, Clarke, & Wicha, 2003;Doumic et al., 2011). A number of mathematical models have been created that aim to capture the continuous process of cell differentiation (Adimy, Crauste, & Ruan, 2005;Alarcon, Getto, Marciniak-Czochra, & Vivanco, 2011;Bélair, Mackey, & Mahaffy, 1995;Colijn & Mackey, 2005;Doumic et al., 2011;Doumic-Jauffret, Kim, & Perthame, 2010;Gwiazda et al., 2012;Pujo-Menjouet, Crauste, & Adimy, 2004).
For example, in Doumic et al. (2011), the authors present a model of cell differentiation that assumes that the dynamics of differentiated precursors can be approximated by a continuous maturation model. The model is created by extending the multicompartment discrete system of Marciniak-Czochra et al. (2009). The authors provide a careful comparison that shows that the continuous structured population model is not a mathematical limit of the discrete multicompartment model. In particular, the dynamics of the continuous model allow commitment and maturation of cell progenitors to be a continuous process that can take place between cell divisions. They do show, however, that there is overlap in model dynamics with a particular choice of maturation rate. In Gwiazda et al. (2012), the authors subsequently developed a continuous model that can be viewed as a generalization that admits both the continuous model of Doumic et al. (2011) and the discrete model of Marciniak-Czochra et al. (2009) as special cases. In Prokharau, Vermolen, and García-Aznar (2014), the authors develop a PDE-based continuous model of cell differentiation that allows cells to differentiate into an arbitrary number of cell types. A particular differentiation trajectory can be determined by any number of parameters, such as biochemical factors, the current differentiation state or just by a random variable, so their approach allows differentiation to be either a deterministic or a stochastic process.
The modelling approach we present differs from previous cell differentiation models in that it is centred on capturing cell differentiation dynamics that take place within a space that has been created via a dimension reduction transformation of high-dimensional data. Within that reduced space, our model assumes that maturation and differentiation take place along a continuous trajectory. (The dimension reduction outcomes on the data sets we tested indicate that the trajectory will, in fact, be continuous.) Cells can differentiate along an arbitrary number of paths with an arbitrary number of end states, all of which are determined by the data set and dimension reduction technique employed. Thus, the reduced differentiation space is not predetermined, but is generated as a function of the dimension reduction technique and the data set of interest.
Although methods exist to characterize differentiation trajectories, such as OT (Schiebinger et al., 2017) and diffusion pseudotime methods (Haghverdi et al., 2016), an advantage of our approach is the ability to use a mathematical model to predict the outcomes of abnormal trajectories and to perturb the system mathematically with the model. We use this advantage of the mathematical model to simulate and explore AML pathogenesis based on immunophenotypic characterization of a mouse model for inv(16) AML. Our simulation results are consistent with the evolution of inv(16)-driven AML and predict dynamics in canonical cell populations as well as cells in novel, intermediate states of differentiation. The intermediate cell states such as CMP-MEP seen in our simulation is consistent with previous observations that CBFβ-SMMHC expressing phenotypic MEP cells confer CMP-like progenitor cell activity (Kuo et al., 2006). Given the phenotypic plasticity and aberrant differentiation occurring during leukaemia evolution, it is particularly informative to model cell dynamics in a continuum of differentiation space.
The novelty and power of this modelling approach is the ability to capture and predict dynamics of many interconnected cell types. We now consider a continuum of cellular states, and model movement between these states in aggregate by representing many cell populations and states in a single variable. This approach increases biological resolution of the system by characterizing an infinite number of sub-states in a continuum representation and allows us to make predictions with one equation and very few model parameters, which can be directly calibrated to experimental data, for example, with time-series cell differentiation experiments. These data could be used in place of the inferred pseudotime methods to construct more realistic differentiation trajectories, as well as estimate parameters such as the transport rates between locations in the differentiation space. We note that this is not equivalent to rates of cellular differentiation, since this allows inference of transition between intermediate states of differentiation which may not be directly calculated from differentiation assays which rely on specific lineage markers.
A limitation of our approach is that it does not include physical properties of the living biological system, such as the cellular microenvironment, which is known to play a critical role in the transformation of cell state and function. Furthermore, we recognize and acknowledge that cellular state transition dynamics as represented as a projection in a lowdimensional space is an approximation of the dynamics in the original high-dimensional space. Moreover, the dynamics observed and predicted in the lower dimensional space critically depend on the method of dimension reduction. This logic motivates our use of diffusion maps as the method to construct the differentiation space.
In addition, our current model assumes that the cell properties of the intermediate cell types change linearly between the node cell types. Although it is reasonable to assume that the overall cell properties in the macro scale changes linearly depending on the distance in the phenotypic space when no other information in between is given, our future work involves using the expression levels of the intermediate cells that are related to cell dynamics, e.g. cell cycle, differentiation and proliferation, to develop more appropriate models for the intermediate cells. A limitation of the Nestorowa et al. (2016) data set is that it includes only stem and committed progenitor cells, and lacks a population of fully differentiated cells (e.g. erythrocytes, platelets, B-cells, T-cells, etc.), which yields an incomplete differentiation trajectory. Although we note that the stem and progenitor cell populations are the leukaemia-initiating cell populations most immediately relevant to the pathogenesis of inv(16)-driven AML (Cai et al., 2016). Data sets covering the full spectrum of differentiation trajectory during normal and abnormal (AML) haematopoiesis will enable modelling of differentiation blocks occurring at later stages of differentiation.
However, despite these limitations, we contend that this kind of analysis is a critical and valuable first step toward understanding the evolution of the higher dimensional system, and that low-dimensional approximations have value, particularly when predictions in the lower dimensional space can be experimentally validated. We postulate that when dynamics in low-dimensional representations are sufficiently characterized, they may eventually be used as a surrogate for high-dimensional data, thus reverting the trend of 'big data' back down to more informative 'small data'.
We note that our modelling approach can be applied to any data set or manifold shape. As more normal and abnormal cellular state transitions are characterized at single-cell resolution, we may apply similar computational and modelling methods to those systems. We emphasize our modelling approach is general and is not tailored or adapted to haematopoiesis in particular. Future applications of this approach may be useful to model the effects of therapies which target specific states of differentiation or the differentiation process itself, including other hematologic malignancies.

Notes
1. The interpolation function can be taken, for instance, as a linear function φ(c i , c j ) = (c j − c i )(x − a k )/(b k − a k ) + c i , where k = I(i, j). This assumes that the cell property changes linearly in terms of the distance in the diffusion component space (Doumic et al., 2011;Gwiazda et al., 2012). In addition, the values of V I(n,j) (x) near x = v n will take into account of the ratio of cells that branch out to different cell types v j , while the values of V I(i,n) (x) consider the ratio of cells that are flowing in from different cell types v i . 2. Using the notation in Appendix, γ 3 = p and γ 4 = 1 − p.
x ∈ e k , k=1,...,51 , ( A 1 ) where u k is the solution projected on the edge e k as u k (x, t) . = u(x, t)| x∈e k and {e k } 51 k=1 are the 51 edges connecting the 12 nodes {v n } 12 n=1 as in Figure 2B. We assume that the edges are unit length as e k = [a k , b k ] = [0, 1] and find the coefficients in Equation (A1) that are scaled to the unit length edge.
The total number of cells can be computed as ρ(t) . = 51 k=1 e k u k (x, t) dx, and we compute the number of cells in the nth cluster as Alternatively, since the boundary of the cell types are not distinctive, one can compute it as a weighted sum along the edges adjacent to the node n with linear weight functions such as ω(x) = −x + 1 and 1 − ω(x) along the entire edge.
To obtain the transfer rate between the cell nodes, we assume three discrete psuedotimes at those three sorted groups starting from LT-HSC to HSPC, and finally to progenitor cells. As remarked in Section 4.1, we consider subsets of nodes I 1 = {1, 2, 3} as HSC, I 2 = {4, . . . , 8} as HSPC and I 3 = {9, . . . , 12} as progenitor cell group. This follows the cell data in Nestorowa et al. (2016) that is classified with ComBat from the SVA package using the sorting gate of LT-HSC, HSPC and progenitor, and then processed with diffusion mapping initialized from a subpopulation of LT-HSC to the progenitor cells of different lineage of erythroid, granulocyte-macrophage and lymphoid. Accordingly, we consider three discrete psuedotimes considering LT-HSC (t 0 ), HSPC (t 1 ) and progenitor (t 2 ) and compute the number of cells in each node that is summarized in Table A1. We comment that diffusion pseudotime is not a physical time unit (i.e. days) and that the differentiation process is modelled based on the inferred pseudotime trajectories with the following mapping of pseudotimes t 0 = 0, t 1 = 2 and t 2 = 4. The time mapping procedure can be refined with time-series differentiation assay data. The transfer rates between the nodes are taken from the ratios at each psuedotime. Pietras, Warr, & Passegué, 2011). Moreover, the abnormal proliferation of cancerous cells with cell cycle λ and apoptosis of the differentiated cells with rate d at expression level x * can be modelled with a localized Gaussian function with variance as R k (x) = (ln(2)/λ) exp[−(x − x * ) 2 / ] and R k (x) = −d exp[−(x − x * ) 2 / ], respectively. The choice of localized Gaussian function assumes that the centre x * is the location in the diffusion space that most closely resembles the 'prototypical', or 'ideal' cell type identity.
With this choice, the total number of cells in each node ρ n (t 0 ) computed as in Equation (A2) is similar to the given ratios γ 0 n . The boundary condition defined as in Equation (3) around the node Table A2. Summary of the required data and corresponding parameters.
Biological meaning and parameters V k (x) Cell differentiation rate c k , branching ratio γ k R k (x) Growth rate r k D k (x) Phenotypic fluctuation σ k , w k Note: In our simulation, V k and R k are estimated fromρ k in Table A1. Figure A11. Change in the total number of cells ρ(t) in percentage with respect to the model parameters, diffusion D k , advection V k and reaction R k . We test the cases where the coefficients change their values by −10%, −1%, 1% and 10%. The results are sensitive to the reaction and advection coefficients particularly in the AML condition. On the other hand, the results are less dependent on the diffusion coefficient. Figure A12. Number of intermediate cells with respect to the model parameters, diffusion D k , advection V k and reaction R k . The results are computed by varying the coefficients by −10%, −1%, 1% and 10%. Although the result varies from the reference case (0%), the overall trend of the cell dynamics is observed to be similar.