RuleClassifier
- class pyruleanalyzer.RuleClassifier(rules, algorithm_type='Decision Tree')
- adjust_and_remove_rules(method)
Adjusts and removes duplicated rules from the rule set based on the specified method.
This method analyzes the current rule set to identify duplicates. It creates generalized rules by merging sibling nodes (soft) or representative rules for inter-tree duplicates (hard).
- static calculate_structural_complexity(rules: List[Rule], n_features_total: int) Dict[str, Any]
Computes a normalized complexity score and other interpretability metrics for a rule set.
This method calculates a comprehensive set of metrics, including: - A novel ‘Structural Complexity Score’ based on depth balance and attribute usage. - Traditional metrics like rule counts and depth statistics.
The primary score combines two dimensions: 1. Depth Balance (D_bal): Ratio of mean rule depth to max depth.
D_bal = mean_rule_depth / max_depth (Values closer to 1 indicate a balanced tree structure).
Normalized Attribute Usage (A_norm): Measures feature diversity across rules. A_norm = sum(unique_attributes_per_rule) / (total_rules * n_features_total) (Higher values indicate the model leverages a wider range of features).
The final ‘complexity_score’ (D_bal * A_norm) rewards models that are both structurally balanced and diverse in feature usage.
- Parameters:
rules (List[Rule]) – A list of Rule objects to analyze.
n_features_total (int) – The total number of available features in the dataset.
- Returns:
A dictionary containing detailed complexity metrics.
- Return type:
Dict[str, Any]
- classify(data, final=False)
Classifies a single data instance using extracted rules.
This method delegates the classification logic. If ‘final’ is False (using initial rules) and the native function is compiled, it uses the high-performance in-memory function. Otherwise, it falls back to iterative evaluation.
- Parameters:
data (Dict[str, float]) – The instance to classify.
final (bool) – If True, uses final_rules (post-analysis).
- Returns:
Predicted class label (int).
List of votes (Random Forest only).
Class probabilities (Random Forest only).
- Return type:
Tuple[int, List[int]|None, List[float]|None]
- static classify_dt(data, rules)
Classifies a single data instance using extracted rules from a decision tree. Iterates through the list until the first rule is satisfied.
- static classify_gbdt(data, rules, init_scores, is_binary, classes)
Classifies a single data instance using GBDT additive scoring.
For each class group, the method sums the init score plus the contribution of the first matching rule in each tree. Binary classification uses sigmoid; multiclass uses argmax.
- Parameters:
data (Dict[str, float]) – Instance data.
rules (List[Rule]) – All GBDT Rule objects (init + tree rules).
init_scores (Dict[str, float]) – Init scores per class group.
is_binary (bool) – Whether this is binary classification.
classes (List[str]) – List of class label strings.
- Returns:
Predicted class label (int).
List of matched rules (one per tree per class group).
None (kept for API consistency).
- Return type:
Tuple[int, List[Rule], None]
- static classify_rf(data, rules)
Classifies a single data instance using extracted rules from a Random Forest. Uses soft probability averaging (like sklearn) when class distributions are available, otherwise falls back to hard majority voting.
- Parameters:
data (Dict[str, float]) – Instance data.
rules (List[Rule]) – List of rule instances.
- Returns:
(Predicted Label, Votes List, Proba List, Matched Rules List)
- Return type:
Tuple
- compare_initial_final_results(file_path)
Compares the classification performance of the initial and final rule sets.
Delegates to DTAnalyzer, RFAnalyzer, or GBDTAnalyzer. If
execute_rule_analysiswas called earlier in the same session, the cached_analyzeris reused; otherwise a fresh one is created.- Parameters:
file_path (str) – Path to the CSV file used for evaluation.
- compare_initial_final_results_dt(df_test=None, target_column_name=None, file_path=None)
Delegates to DTAnalyzer. Kept for backward compatibility.
- Parameters:
df_test – Ignored (kept for signature compatibility).
target_column_name – Ignored (kept for signature compatibility).
file_path (str, optional) – Path to the CSV test file.
- compare_initial_final_results_rf(df_test=None, target_column_name=None, file_path=None)
Delegates to RFAnalyzer. Kept for backward compatibility.
- Parameters:
df_test – Ignored (kept for signature compatibility).
target_column_name – Ignored (kept for signature compatibility).
file_path (str, optional) – Path to the CSV test file.
- compile_tree_arrays(rules: list | None = None, feature_names: list | None = None) None
Compile rules into numpy tree arrays for vectorized prediction.
After calling this method, predict_batch(X) becomes available. This is called automatically by update_native_model but can also be called explicitly after rule removal to refresh the arrays.
- Parameters:
rules (List[Rule], optional) – Rules to compile. Defaults to final_rules if available, else initial_rules.
feature_names (List[str], optional) – Ordered feature names matching the columns of X that will be passed to predict_batch. If None, they are inferred from the rules.
- custom_rule_removal(rules)
Placeholder for custom rule removal logic.
By default, this method performs no operations and returns the rule set unchanged. It is intended to be overwritten via set_custom_rule_removal.
- static display_metrics(y_true, y_pred, correct, total, file=None, class_names=None)
Computes and displays classification performance metrics.
Calculates accuracy, precision, recall, F1 score, specificity, and displays the confusion matrix (using class names if available).
- Parameters:
y_true (List[int]) – True class labels.
y_pred (List[int]) – Predicted class labels.
correct (int) – Count of correct predictions.
total (int) – Total count of predictions.
file (Optional[TextIO]) – Output file object.
class_names (Optional[List[str]]) – List of class names for display.
- edit_rules()
Starts an interactive prompt in the terminal to allow manual editing of rules.
The user can list, select, and modify the conditions or the class of a rule. CRITICAL: Updates the native compiled model immediately upon saving changes.
- execute_rule_analysis(file_path, remove_duplicates='none', remove_below_n_classifications=-1)
Executes a full rule evaluation and pruning process on a given dataset.
This method: - Applies optional duplicate rule removal (iteratively until convergence). - Recompiles the native Python model for speed. - Runs evaluation using the appropriate algorithm via DTAnalyzer/RFAnalyzer/GBDTAnalyzer. - Optionally removes rules used less than or equal to a given threshold. - Tracks and prints redundancy metrics by type.
- Parameters:
file_path (str) – Path to the CSV file containing data for evaluation.
remove_duplicates (str) – Strategy (“soft”, “hard”, “custom”, “none”).
remove_below_n_classifications (int) – Threshold for pruning low-usage rules.
- execute_rule_analysis_dt(file_path, remove_below_n_classifications=-1)
Delegates to DTAnalyzer. Kept for backward compatibility.
- Parameters:
file_path (str) – Path to the CSV file.
remove_below_n_classifications (int) – Minimum usage count threshold.
- execute_rule_analysis_rf(file_path, remove_below_n_classifications=-1)
Delegates to RFAnalyzer. Kept for backward compatibility.
- Parameters:
file_path (str) – Path to the CSV file.
remove_below_n_classifications (int) – Minimum usage count threshold.
- export_to_binary(filepath: str = 'model.bin') None
Export the compiled tree arrays to a compact binary file.
- File format (all little-endian):
- Header:
4 bytes magic: b’PYRA’ 1 byte version: 1 1 byte algorithm_type: 0=DT, 1=RF, 2=GBDT 2 bytes n_features (uint16) 2 bytes n_classes (uint16) 2 bytes n_trees (uint16) 4 bytes default_class (int32)
- For GBDT additionally:
1 byte is_binary (bool) 2 bytes n_gbdt_classes (uint16) For each gbdt_class:
4 bytes class_label (int32)
- For each gbdt_class:
8 bytes init_score (float64)
- Feature names:
- For each feature:
2 bytes name_len (uint16) N bytes name (utf-8)
- For each tree:
4 bytes n_nodes (int32) For GBDT:
1 byte is_init (bool) 4 bytes class_group (int32, only if not init) If is_init:
8 bytes init_score (float64) continue to next tree
feature_idx: n_nodes * 4 bytes (int32) threshold: n_nodes * 8 bytes (float64) children_left: n_nodes * 4 bytes (int32) children_right:n_nodes * 4 bytes (int32) value:
DT: n_nodes * 4 bytes (int32) RF: n_nodes * n_classes * 8 bytes (float64) GBDT: n_nodes * 8 bytes (float64)
- Parameters:
filepath (str) – Output file path.
- export_to_c_header(filepath: str = 'model.h', guard_name: str = 'PYRULEANALYZER_MODEL_H') None
Export the compiled tree arrays as a standalone C header file.
The generated header contains const arrays suitable for Arduino / embedded targets. It includes a
predict(const float *features)function that traverses the trees and returns the predicted class.- Parameters:
filepath (str) – Output file path.
guard_name (str) – Include-guard macro name.
- export_to_native_python(feature_names=None, filename='examples/files/fast_classifier.py')
Generates a standalone Python file with the decision logic.
For Decision Trees, it exports a single nested if/else function. For Random Forests, it exports multiple functions (one per tree) and a voting aggregator.
- Parameters:
feature_names (List[str], optional) – Kept for compatibility.
filename (str) – Output filename.
- find_duplicated_rules(type='soft')
Identifies nearly identical rules within the same decision tree context.
This method searches for rule pairs that: - Have the same class label. - Share all conditions except the last one (same path parent). - Differ only in the final condition boundary (e.g., v1 <= 5 vs v1 > 5).
Such pairs are considered duplicates because they imply the split at that boundary was unnecessary for the final classification outcome.
- find_duplicated_rules_between_trees()
Identifies semantically similar rules between different trees in the forest.
This method compares rules across the full rule set to find groups that: - Use the same set of variables, values, and logical operators. - Belong to the same target class.
Optimization: Uses cached ‘parsed_conditions’ and sorting to ensure condition order doesn’t affect detection (A and B == B and A).
- Returns:
A list of groups, where each group is a list of similar rules.
- Return type:
List[List[Rule]]
- static generate_classifier_model(rules, class_names_map, algorithm_type='Random Forest')
Instantiates a RuleClassifier from extracted rules and saves it.
Optimization: This method now passes the Rule objects directly to the constructor, avoiding the intermediate string serialization that caused floating-point precision loss.
- Parameters:
- Returns:
The initialized classifier.
- Return type:
- static get_gbdt_rules(model, feature_names, class_names)
Extracts rules from a trained GradientBoostingClassifier into standard Rule objects.
Each leaf in each boosting tree becomes a Rule with GBDT-specific metadata (leaf_value, learning_rate, contribution, class_group). An additional init rule (no conditions) is created per class group to represent the initial score from the prior estimator.
- Parameters:
model (GradientBoostingClassifier) – A fitted sklearn GBDT model.
feature_names (List[str]) – Feature names used during training.
class_names (List[str]) – Class label strings.
- Returns:
List of all Rule objects (init rules + tree rules).
Dict mapping class_group -> init_score.
Whether the model is binary classification.
List of class label strings.
- Return type:
Tuple[List[Rule], Dict[str, float], bool, List[str]]
- static get_rules(tree, feature_names, class_names)
Extracts human-readable decision rules from a scikit-learn DecisionTreeClassifier.
This method traverses the tree structure to generate logical condition paths from root to leaf.
Optimization: It builds both the string representation (for display) and the parsed tuple representation (for calculation) simultaneously. This avoids re-parsing strings later and preserves exact floating-point precision.
- Parameters:
tree (DecisionTreeClassifier) – A trained scikit-learn decision tree model.
feature_names (List[str]) – A list of feature names corresponding to the tree input features.
class_names (List[str]) – A list of class names corresponding to output labels.
- Returns:
A list of extracted Rule objects.
- Return type:
List[Rule]
- static get_tree_rules(model, feature_names, class_names, algorithm_type='Random Forest')
Extracts rules from a trained scikit-learn model (Decision Tree or Random Forest).
For Decision Trees, this returns one list of rules. For Random Forests, it returns a list of lists (one list of rules per estimator).
- Parameters:
model (Union[DecisionTreeClassifier, RandomForestClassifier]) – The trained model.
feature_names (List[str]) – List of feature names.
class_names (List[str]) – List of class names.
algorithm_type (str) – Type of model; either ‘Decision Tree’ or ‘Random Forest’.
- Returns:
The extracted rules.
- Return type:
- static load(path)
Loads a saved RuleClassifier model from a pickle (.pkl) file.
- Parameters:
path (str) – Path to the .pkl file.
- Returns:
The loaded classifier instance.
- Return type:
- classmethod load_binary(filepath: str) RuleClassifier
Load a RuleClassifier from a binary file created by export_to_binary().
The returned object supports predict_batch() but NOT rule-level operations (no Rule objects are reconstructed).
- Parameters:
filepath (str) – Path to the .bin file.
- Returns:
A classifier ready for predict_batch().
- Return type:
- static new_classifier(train_path, test_path, model_parameters, model_path=None, algorithm_type='Random Forest')
Orchestrates the creation of a new classifier from scratch.
Pipeline: 1. Load and Process Data (using robust CSV loader). 2. Train (or Load) Scikit-Learn Model. 3. Evaluate Scikit-Learn Model (Benchmark). 4. Extract Rules from Tree(s). 5. Generate RuleClassifier.
- Parameters:
train_path (str) – Path to training CSV.
test_path (str) – Path to testing CSV.
model_parameters (dict) – Arguments for the sklearn classifier.
model_path (Optional[str]) – Path to existing .pkl model to skip training.
algorithm_type (str) – ‘Random Forest’, ‘Decision Tree’, or ‘Gradient Boosting Decision Trees’.
- Returns:
The final rule-based model.
- Return type:
- static parse_conditions_static(conditions)
Parses condition strings into structured tuples (variable, operator, value).
- Parameters:
conditions (List[str]) – List of condition strings (e.g., “v1 <= 0.5”).
- Returns:
A list of parsed conditions.
- Return type:
List[Tuple[str, str, float]]
- parse_rules(rules, algorithm_type)
Parses the input rules into a list of Rule objects.
- Parameters:
rules – The raw rules input (list, dict, or string).
algorithm_type (str) – The algorithm type.
- Returns:
A list of parsed Rule objects.
- Return type:
List[Rule]
- predict_batch(X: ndarray, feature_names: list | None = None, use_final: bool = True) ndarray
Vectorized batch prediction over a numpy array.
This is the high-performance prediction method. It uses the pre-compiled tree arrays (from compile_tree_arrays) and numpy vectorized traversal.
- Parameters:
X (np.ndarray) – Input data, shape (n_samples, n_features). Columns must be ordered to match the feature_names used during compile_tree_arrays (or the feature_names argument here).
feature_names (list, optional) – Feature names corresponding to columns of X. If provided and different from compiled order, X columns are reordered accordingly.
use_final (bool) – If True, uses arrays compiled from final_rules. If False, uses arrays compiled from initial_rules.
- Returns:
Predicted class labels, shape (n_samples,), dtype int.
- Return type:
np.ndarray
- predict_batch_proba(X: ndarray, feature_names: list | None = None) ndarray
Vectorized batch probability prediction.
- Parameters:
X (np.ndarray) – Input data, shape (n_samples, n_features).
feature_names (list, optional) – Feature names for column reordering.
- Returns:
- Predicted probabilities, shape (n_samples, n_classes).
For DT, this is a one-hot array. For RF, this is the averaged probability across trees. For GBDT binary, shape (n_samples, 2) with sigmoid probabilities. For GBDT multiclass, shape (n_samples, n_classes) with softmax probabilities.
- Return type:
np.ndarray
- static process_data(train_path, test_path, is_test_only=False)
Loads and preprocesses training and testing datasets from CSV files.
Handles header detection automatically. If no header is found, columns are named v1, v2, …, class. Performs Label Encoding on categorical features to ensure all data is numeric for the classifier.
- Parameters:
train_path (str) – Path to training CSV (ignored if is_test_only=True).
test_path (str) – Path to testing CSV.
is_test_only (bool) – If True, skips loading training data and returns None for train artifacts.
- Returns:
Tuple containing (X_train, y_train, X_test, y_test, class_names, target_column, feature_names).
- set_custom_rule_removal(custom_function)
Allows the user to override the rule removal logic by employing their own implementation.
This enables the injection of external logic to handle rule pruning or duplicate detection according to specific domain needs.
- Parameters:
custom_function (Callable[[List[Rule]], Tuple[List[Rule], List[Tuple[Rule, Rule]]]]) – A callback function that takes a list of Rule instances as an argument and returns a tuple containing: 1. A new list of rules after processing (filtered). 2. A list of pairs of rules identified as duplicates/removed.
- update_native_model(rules_to_compile)
Compiles a list of Rules into an optimized in-memory Python function.
This method rebuilds the decision tree structure from linear rules and uses exec() to create a predict(sample) function that runs at native Python speed, bypassing slow Rule object iteration.
- Parameters:
rules_to_compile (List[Rule]) – The rules to be compiled.