Creation of a Staged Tree#

EventTree Class#

The first starting point in constructing a Chain Event Graph (CEG) is to create an event tree describing the process being studied. An event tree is a directed tree graph with a single root node. The nodes with no emanating edges are called leaves, and the non-leaf nodes are called situations.

In this example we work with a data set which contains 4 categorical variables; Classification, Group, Difficulty, and Response.

Each individual is given a binary classification; Blast or Non-blast. Each group is rated on their experience level: Experienced, Inexperienced, or Novice. The classification task they are given has a difficulty rating of Easy or Hard. Finally, their response is shown: Blast or Non-blast.

We begin by importing the data set and initializing the EventTree object, as shown below:

from cegpy import EventTree
import pandas as pd

df = pd.read_excel('../../data/medical_dm_modified.xlsx')
print(df.head())

#initialize the event tree
et = EventTree(df)
  Classification        Group Difficulty   Response
0          Blast  Experienced       Easy      Blast
1      Non-blast  Experienced       Easy  Non-blast
2      Non-blast  Experienced       Hard      Blast
3      Non-blast  Experienced       Hard  Non-blast
4          Blast  Experienced       Easy      Blast

In order to display the EventTree, we can use the method create_figure(). The numbers above the edges of the event tree represent the number of individuals who passed through the given edge.

et.create_figure()
../_images/11d6dbfc9e521b5cd4ebc4bfef581bd3342905f2b16bf65ee009b7d950344653.png

StagedTree Class#

In an event tree, each situation is associated with a transition parameter vector which indicates the conditional probability of an individual, who has arrived at the situation, going along one of its edges. In order to create a CEG, we first need to elicit a staged tree. This is done by first partitioning situations into stages, which are collections of situations in the event tree whose immediate evolutions, i.e. their associated conditional transition parameter vectors, are equivalent. To indicate this symmetry, all situations in the same stage are assigned a single colour.

Identification of the stages in the event tree can be done using any suitable model selection algorithm. Currently, the only available selection algorithm in cegpy is the Agglomerative Hierarchical Clustering (AHC) algorithm (Freeman and Smith, 2011).

In order to create a staged tree in cegpy we first initialize a StagedTree object from the dataset and then run the AHC algorithm using the create_AHC_transitions method, as displayed below. The output of the AHC algorithm is a dictionary containing the following information:

  • Merged Situations - a list of tuples representing the partition of the nodes into stages

  • Log Likelihood - the log likelihood of the data under the model selected by AHC

from cegpy import StagedTree

st = StagedTree(df)
st.calculate_AHC_transitions()
{'Merged Situations': [('s1', 's2'),
  ('s20', 's18'),
  ('s10', 's12'),
  ('s7', 's4', 's3', 's5', 's8', 's6'),
  ('s9', 's11'),
  ('s17', 's19', 's16'),
  ('s0',),
  ('s13',),
  ('s14',),
  ('s15',)],
 'Log Likelihood': -30091.353114865367}

Within cegpy, singleton stages, i.e. stages containing a single situation, are coloured white, leaves and their corresponding sink node are coloured in light-grey. Running AHC on our data set results in the following staged tree.

st.create_figure()
../_images/06b024b289ece89bc51da4e7f0a53b4b94f2f4b09fdb464e9a642e47503cdf86.png

Custom Hyperstages#

cegpy allows the user to specify which situations are allowed to be merged by the AHC algorithm. This is done by specifying a hyperstage (Collazo et al., 2017) which is a collection of sets such that two situations cannot be in the same stage unless they belong to the same set in the hyperstage. Under a default setting in cegpy, all situations which have the same number of outgoing edges and equivalent set of edge labels are in the same set within the hyperstage. The default hyperstages of a given tree can be displayed by accessing the hyperstage property, which returns a list of lists, where each sublist contains situations belonging to the same hyperstage.

st.hyperstage
[['s0',
  's9',
  's10',
  's11',
  's12',
  's13',
  's14',
  's15',
  's16',
  's17',
  's18',
  's19',
  's20'],
 ['s1', 's2'],
 ['s3', 's4', 's5', 's6', 's7', 's8']]

In this example, situations \(s_1\) and \(s_2\) belong to the same hyperstage. Each of them has three emanating edges with labels Experienced, Inexperienced, and Novice. However, stages \(s_6\) and \(s_15\) belong to different hyperstages. They both have two emanating edges, yet different labels: Easy, Hard and Blast, Non-blast.

We can specify a different hyperstage at the point of running the AHC algorithm by passing a list defining the hyperstage partition as a parameter to the calculate_AHC_transitions method, for example:

new_hyperstage = [
    ['s0'], 
    ['s3', 's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 
    's13', 's14', 's15', 's16', 's17', 's18', 's19','s20'],
    ['s1', 's2'],
]
st.calculate_AHC_transitions(hyperstage=new_hyperstage)
{'Merged Situations': [('s1', 's2'),
  ('s20', 's18'),
  ('s10', 's12'),
  ('s7', 's4', 's3', 's5', 's8', 's6'),
  ('s9', 's11'),
  ('s17', 's19', 's16'),
  ('s0',),
  ('s13',),
  ('s14',),
  ('s15',)],
 'Log Likelihood': -30091.353114865367}

Structural and sampling zeros / missing values#

The package, by default, treats all blank and NaN cells as structural missing values, i.e. data that is missing for a logical reason. However, sometimes these might occur due to sampling limitations; sampling missing values. We may also not observe a certain value for a variable in our data set (given its ancestral variables) not because that value is a structural zero but because of sampling limitations, in which case we are dealing with sampling zeros.

Consider the following example of the falls.xlsx data set which provides information concerning adults over the age of 65, and includes four categorical variables as given below with their state spaces:

  • Housing Assessment: Living situation and whether they have been assessed, state space: {"Communal Assessed", "Communal Not Assessed", "Community Assessed", "Community Not Assessed"};

  • Risk: Risk of a future fall, state space: {"High Risk", "Low Risk"};

  • Treatment: Referral and treatment status, state space: {"Not Referred & Not Treated", "Not Referred & Treated", "Referred & Treated"};

  • Fall: the outcome, state space: {"Fall", "Don’t Fall"}.

df = pd.read_excel('../../data/Falls_Data.xlsx')
df.head()
HousingAssessment Risk Treatment Fall
0 Community Not Assessed Low Risk NaN Fall
1 Community Not Assessed High Risk NaN Fall
2 Community Not Assessed Low Risk NaN Don't Fall
3 Community Not Assessed Low Risk NaN Don't Fall
4 Community Not Assessed Low Risk NaN Fall
et = EventTree(df)
et.create_figure()
../_images/9dfad13dbb1d3217bd0d08c913ed69d028f7829af85e4af2304fc0a32eb0b184.png

Observe that this process has structural asymmetries. None of the individuals assessed to be low risk are referred to the falls clinic and thus, for this group, the count associated with the _Referred & Treated’}$ category is a structural zero:

df[df.Risk == "Low Risk"]['Treatment'].value_counts()
Treatment
Not Referred & Not Treated    1396
Not Referred & Treated         170
Name: count, dtype: int64

Moreover, for individuals who are not assessed, their responses are structurally missing for the Treatment variable:

# Missing values in each column
print(df.isna().sum())

# Missing values for Treatment are structural, 
# they are missing due to the lack of assessment:
df[df.HousingAssessment.isin([
    'Community Not Assessed', 'Communal Not Assessed'
])]['Treatment'].isna().sum()
HousingAssessment        0
Risk                     0
Treatment            46750
Fall                     0
dtype: int64
46750

In cegpy any paths that should logically be in the event tree description of the process but are absent from the dataset due to sampling limitations would need to be manually added by the user using the sampling zero paths argument when initialising the EventTree object. Further, not all missing values in the dataset will be structurally missing.

How to distinguish between structural and sampling missing values?#

e.g. Falls example: Suppose that some individuals in communal establishments who are not formally assessed but are known to be high risk were actually either "Not Referred & Treated" or "Not Referred & Not Treated" but that these observations were missing in the falls.xlsx dataset due to sampling limitations. All the other blank/NaN cells are structurally missing.

idx = (df.HousingAssessment == 'Communal Not Assessed') & (df.Risk == 'High Risk')
df[idx]
HousingAssessment Risk Treatment Fall
67 Communal Not Assessed High Risk NaN Fall
72 Communal Not Assessed High Risk NaN Fall
95 Communal Not Assessed High Risk NaN Fall
102 Communal Not Assessed High Risk NaN Fall
132 Communal Not Assessed High Risk NaN Fall
... ... ... ... ...
49065 Communal Not Assessed High Risk NaN Fall
49087 Communal Not Assessed High Risk NaN Fall
49135 Communal Not Assessed High Risk NaN Don't Fall
49461 Communal Not Assessed High Risk NaN Fall
49905 Communal Not Assessed High Risk NaN Fall

436 rows × 4 columns

To demarcate the difference between structural and sampling missing values, a user can give different labels to the structural and sampling missing values in the dataset and provide these labels to the struct_missing_label and missing_label arguments respectively when initialising the EventTree or StagedTree object.

In our example, we can replace the NaN values for the Treatment variable among the considered subset of data with a new label, e.g. samp_miss:

df.loc[idx, 'Treatment'] = 'samp_miss'

Next step is to tell the EventTree or StagedTree object about these missing value arguments as shown below. This will generate a new path along Communal Not Assessed', High Risk’, `missing’)}$:

et2 = EventTree(df,
    missing_label='samp_miss',
)
et2.create_figure()
../_images/a80bc6a84667d37317ee29e3abb93e34acc5b301791a6f62aa4cb819e133b28c.png

How to add sampling zeros?#

e.g. Falls example: Suppose that some individuals in the community who were assessed and high risk were referred and not treated. Suppose that our observations are still the same as in the falls.xlsx dataset. Here, by design, this was allowed, but was not observed in the dataset. So we need to add this value in manually as a path ("Community Assessed", "High Risk", "Referred & Not Treated"). We also need to add in the values that follow it: i.e. ("Community Assessed", "High Risk", "Referred & Not Treated", "Fall") and ("Community Assessed", "High Risk", "Referred & Not Treated", "Don't Fall").

In cegpy any paths that should logically be in the event tree description of the process but are absent from the dataset due to sampling limitations would need to be manually added by the user using the sampling zero paths argument when initialising the EventTree or StagedTree object. No changes need to be made to the dataset, as shown below:

st2 = StagedTree(df,
    sampling_zero_paths=[
        ('Community Assessed', 'High Risk', 'Referred & Not Treated'),
        ('Community Assessed', 'High Risk', 'Referred & Not Treated', 'Fall'),
        ('Community Assessed', 'High Risk', 'Referred & Not Treated', "Don't Fall")
])
st2.calculate_AHC_transitions()
st2.create_figure()
../_images/76efbf044b45fbfd1bf2e09ae7849dbe8116f46a596764c5b60457be03aca088.png