Creating a Chain Event Graph#

Example 1: Using a Stratified Dataset#

This example builds a Chain Event Graph (CEG) from a discrete dataset showing results from a medical experiment. The dataset used is symmetrical, built from a rectangular dataset. These CEGs are known as stratified in the literature.

The Agglomerative Hierarchical Clustering (AHC) algorithm is used to maximise the log marginal likelihood score of the staged tree/CEG model to determine its stages. The package functions under a Bayesian framework and priors can be supplied to the AHC algorithm to override the default settings,

The example medical.xlsx dataset contains 4 categorical variables; Classification, Group, Difficulty, Response.

Each individual is given a binary classification; Blast or Non-blast. Each group is rated on their experience level; Experienced, Inexperienced, or Novice. The classification task they are given has a difficulty rating of Easy or Hard. Finally, their response is shown; Blast or Non-blast.

Firstly, a staged tree object is created from a data source, and calculate the AHC transitions.

from cegpy import StagedTree, ChainEventGraph
import pandas as pd

dataframe = pd.read_excel("medical.xlsx")
dataframe
Classification Group Difficulty Response
0 Blast Experienced Easy Blast
1 Non-blast Experienced Easy Non-blast
2 Non-blast Experienced Hard Blast
3 Non-blast Experienced Hard Non-blast
4 Blast Experienced Easy Blast
... ... ... ... ...
10979 Blast Novice Easy Non-blast
10980 Blast Novice Easy Blast
10981 Non-blast Novice Easy Blast
10982 Blast Novice Easy Non-blast
10983 Non-blast Novice Hard Non-blast

10984 rows × 4 columns

# Descriptive statistics for the dataset 
dataframe.describe()
Classification Group Difficulty Response
count 10984 10984 10984 10984
unique 2 3 2 2
top Non-blast Novice Easy Blast
freq 5493 7389 5494 5863

The AHC algorithm is executed on the event tree, and the nodes are assigned a colour if they are found to be in the same stage as each other. Note that the calculate_AHC_transitions method is only available from the StagedTree class and not the EventTree class.

Effectively, nodes in the same stage share the same parameter set; in other words, the immediate future of these nodes is identical. Note that singleton stages are not coloured in the staged tree and its corresponding CEG to prevent visual cluttering.

When the CEG is created, equivalent nodes (precisely, those whose complete future is identical) in a stage will be combined to compress the graph.

staged_tree = StagedTree(dataframe)
staged_tree.calculate_AHC_transitions();

Once the AHC algorithm has been run to identify the stages, a CEG can be created by passing the StagedTree object into the ChainEventGraph class. When the ChainEventGraph is created, it automatically generates the CEG from the StagedTree object. The process of generation compares nodes that are in the same stage to determine if they are logically compatible with one another. Once the graph has been constructed, and nodes combined, the probabilities of passing down any given edge are displayed.

Like the StagedTree, the graph can be displayed using the create_figure method as shown below.

from IPython.display import Image

chain_event_graph = ChainEventGraph(staged_tree)
chain_event_graph.create_figure()
../_images/c78d8f69d4bb6313cd74747bfc1088b76e7eff384dfa7f222e175096a474b2aa.png

The tree has now been compressed into a Chain Event Graph. The graph represents the system encoded in the data. All paths start at the root node w0, (which represents an individual entering the system), and terminate at the sink node w (which represents the point at which an individual exits the system).

Example 2: Chain Event Graph from Non-Stratified Dataset#

This example builds a Chain Event Graph (CEG) from a asymmetric dataset. In simple words, a dataset is asymmetric when the event tree describing the dataset is not symmetric around its root. The class of CEGs built from asymmetric event trees is said to be non-stratified. Note that, technically, a CEG is also said to be non-stratified when the order of events along its different paths is not the same, even though its event tree might be symmetric. Whilst such processes can also be easily modelled with the cegpy package, for this example we focus on non-stratified CEGs that are built from asymmetric event trees/datasets.

Asymmetry in a dataset arises when it has structural zeros or structural missing values in certain rows; in other words, the sample space of a variable is different or empty respectively, depending on its ancestral variables. So logically, certain values of the variable will never be observed for certain configurations of its ancestral variables, irrespective of the sample size.

In this example, we consider the falls.xlsx dataset. Here, by interventional design, individuals who are not assessed are not offered referral or treatment. In this case, we would observe individuals in our dataset who are not assessed, going down the ‘Not Referred & Not Treated’ path with probability 1. This is not helpful, and so we choose to condense the tree and remove this edge. The zero observations for non-assessed individuals for the categories of ‘Referred & Treated’ and ‘Not Referred & Treated’ are both structural zeros.

from cegpy import EventTree
import pandas as pd

dataframe = pd.read_excel("falls.xlsx")
dataframe
HousingAssessment Risk Treatment Fall
0 Community Not Assessed Low Risk NaN Fall
1 Community Not Assessed High Risk NaN Fall
2 Community Not Assessed Low Risk NaN Don't Fall
3 Community Not Assessed Low Risk NaN Don't Fall
4 Community Not Assessed Low Risk NaN Fall
... ... ... ... ...
49995 Community Not Assessed Low Risk NaN Don't Fall
49996 Community Not Assessed Low Risk NaN Don't Fall
49997 Community Not Assessed Low Risk NaN Don't Fall
49998 Community Not Assessed Low Risk NaN Fall
49999 Community Not Assessed Low Risk NaN Fall

50000 rows × 4 columns

Note: When looking at the description of the dataset, the total count in the Treatment column is not equal to the counts for the other columns. This is the giveaway that the dataset is non-stratified. Extreme care must be taken to ensure that the dataset really is non-stratified, and doesn’t simply have sampling-zeros or sampling missing values. The package has no way of distinguishing these on its own unless the user specifies them.

dataframe.describe()
HousingAssessment Risk Treatment Fall
count 50000 50000 3250 50000
unique 4 2 3 2
top Community Not Assessed Low Risk Not Referred & Not Treated Don't Fall
freq 45211 42505 1768 34737

The end result of this is that in the EventTree shown below, paths such as S0 -> S2 -> S7 -> S18 skip the Treatment variable.

event_tree = EventTree(dataframe)
event_tree.create_figure()
../_images/9dfad13dbb1d3217bd0d08c913ed69d028f7829af85e4af2304fc0a32eb0b184.png

As in the stratified medical example, after initial checks on the dataset, and confirmation that the EventTree looks as expected, the next step is to identify the stages. For this, we use the StagedTree class, which first creates the EventTree internally, ready for the user to run a clustering algorithm on it. In this example we use the .calculate_AHC_transitions() method, which executes the agglomerative hierarchical clustering (AHC) algorithm on the EventTree. The package functions under a Bayesian framework and priors can be supplied to the AHC algorithm to override the default settings.

The resultant CEG has been reduced from the tree representation to a more compact graph.

from cegpy import ChainEventGraph, StagedTree

st = StagedTree(dataframe)
st.calculate_AHC_transitions()

ceg = ChainEventGraph(st)
ceg.create_figure()
../_images/e94e83032fc47ff652e7cbd097f84ae38d00e0fefdbf9d6890d6862cf0b962e9.png

As a CEG is a probabilistic model of a series of events, it may be desirable to view a CEG sub-graph when some or all of the variables are known. This can be especially true for graphs with lots of variables, which can balloon in size. In cegpy, this is done by using the ChainEventGraphReducer which is covered on the next page.