Formal Verification

Formalization of the Conversion: Tree Models to Hierarchical Coloured Petri Nets (HCPN)

This section documents the mathematical foundation used by pyRuleAnalyzer to translate machine learning models (DT, RF, GBDT) into formal representations amenable to verification and structural analysis. The formalization employs Hierarchical Coloured Petri Nets (HCPN), which extend flat CPNs with module composition, substitution transitions, port-socket bindings, and fusion places. This enables a natural, layered representation where individual decision trees are reusable leaf modules composed into larger ensemble structures.

1. Formal Definitions

1.1 Coloured Petri Net

A Coloured Petri Net (CPN) is the standard 9-tuple:

\[CPN = (\Sigma, P, T, A, N, C, G, E, I)\]

Element	Description
\(\Sigma\)	Colour sets. The data types carried by tokens: \(\text{VEC} = \mathbb{R}^d\) (feature vectors), \(\text{LABEL}\) (finite set of class labels), \(\text{PROB} = [0,1]^{\|C\|}\) (probability vectors, used by RF), \(\text{SCORE} = \mathbb{R}\) (real-valued scores, used by GBDT), and \(\text{STAGE} = \{1, 2, \dots, K{+}1\} \subset \mathbb{N}\) (stage counter, used by GBDT to enforce sequential accumulation).
\(P\)	Places. The set of places within the net.
\(T\)	Transitions. The set of transitions within the net.
\(A\)	Arcs. Directed connections between places and transitions.
\(N\)	Node function. Maps each arc to its source and destination nodes, defining the net topology.
\(C\)	Colour function. Assigns colour sets to places (e.g. \(C(P_{in}) = \text{VEC}\), \(C(P_{out}) = \text{LABEL}\)).
\(G\)	Guards. Boolean expressions on transitions: \(G(t_i) = \bigwedge_{j \in C_i} (x_{feat} \text{ op } \theta_j)\).
\(E\)	Arc expressions. Define how tokens flow. Input arcs: \(E(P_{in} \to t_i) = x\). Output arcs: \(E(t_i \to P_{out}) = v_i\).
\(I\)	Initialization. The initial marking of the net’s places.

1.2 CPN Module

A CPN module extends a flat CPN with an interface for hierarchical composition:

\[CPN\_module = (CPN, T_{sub}, P_{port}, PT)\]

where:

Element	Description
\(CPN\)	A non-hierarchical Coloured Petri Net \(CPN = (\Sigma, P, T, A, N, C, G, E, I)\) as defined in Section 1.1.
\(T_{sub} \subseteq T\)	Substitution transitions. A subset of transitions that are placeholders for entire sub-modules. When a substitution transition is “refined”, it is replaced by the internal net of the referenced sub-module.
\(P_{port} \subseteq P\)	Port places. A subset of places that serve as the module’s interface with its parent. Tokens cross module boundaries exclusively through port places.
\(PT : P_{port} \to \{In, Out, In/Out\}\)	Port-type function. Assigns a direction to each port place: In (tokens flow into the module), Out (tokens flow out), or In/Out (bidirectional).

1.3 Hierarchical Coloured Petri Net

An HCPN composes multiple CPN modules into a hierarchical structure:

\[HCPN = (S, SM, PS, FS)\]

where:

Element	Description
\(S\)	Modules. A finite set of CPN modules. Each module is a self-contained net with its own places, transitions, and port interface.
\(SM : T_{sub} \to S\)	Submodule function. Maps each substitution transition (across all modules) to the module it represents. The resulting module hierarchy must be acyclic (i.e. it forms a directed acyclic graph of module references).
\(PS\)	Port-socket relation. For each substitution transition \(t \in T_{sub}\) and each port place \(p\) of \(SM(t)\), \(PS\) assigns a socket place \(p'\) in the parent module such that \(C(p) = C(p')\). Sockets are the “attachment points” through which the parent communicates with the sub-module.
\(FS \subseteq 2^P\)	Fusion sets. Each element of \(FS\) is a nonempty set of places (possibly from different modules) that are identified as a single logical place. All places in a fusion set share the same colour set and marking at all times. This allows multiple modules to share a common place without explicit arcs between them.

2. Model-Specific Formalizations

Each model type has its own dedicated page with the complete HCPN formalization and a numerical induction example demonstrating correctness:

3. Rule Extraction Algorithm

The conversion follows the flow implemented in rule_classifier.py, which constructs the HCPN module hierarchy:

Prior Extraction (GBDT only): Queries scikit-learn’s prior estimator to compute the initial score \(s_0\) (log-odds for binary, centred log-priors for multiclass). This value becomes the initial token in the channel module’s \(P_{accum}\) place.
Traversal: Recursively traverses each tree structure from root to each leaf node. At every internal node, the left child receives a <= predicate and the right child receives a > predicate over the same feature and threshold. Each complete root-to-leaf path produces a rule that becomes a guarded transition within a \(s_{DT}\) leaf module.
Predicate Construction: Concatenates the accumulated split decisions along the path into a boolean conjunction that becomes the transition guard \(G(t_i)\).
Module Assembly: Maps the extracted rules into the HCPN hierarchy:
- DT: A single \(s_{DT}\) module with all rules as transitions.
- RF: A top module \(s_{RF}\) with \(N\) substitution transitions, each bound to a \(s_{DT}\) instance. The fusion set shares \(P_{in}\).
- GBDT: A 3-level hierarchy: \(s_{GBDT} \to s_{channel} \to s_{DT}\), with \(|C|\) channel instances and \(K\) tree instances per channel.
Rule Encapsulation: Maps each path into a Rule object. For DT/RF, the rule stores the class label and class distribution (sample counts at the leaf). For GBDT, the rule additionally stores the raw leaf_value, the learning_rate, and the class_group, from which the effective contribution \(\eta \cdot v\) is computed.

Note

By converting ML models into an HCPN representation, it becomes possible to apply state-space analysis methods at each hierarchical level — detecting unreachable rules within leaf modules, verifying synchronization properties across ensemble compositions, and transforming the model’s “black box” into a transparent, formally verifiable logical structure.