DDOS

This guild will go through the process of pruning a random forest model using the DDoS evaluation dataset (CIC-DDoS2019) repository.

Prerequisites

For this example we will be using the UDPLag.csv dataset, it is a large dataset that contains realistic traffic data with multiclass targeting and it’s included in the CSV-03-11.zip, downloadable from the repository above.

To start off, follow the instructions specified in the prerequisites section of the usage guide.

Prepare Your Dataset

As specified in the usage guide, the RuleClassifier new_classifier method expects a dataset split into two files: train.csv and test.csv. The dataset must also be formatted with the following characteristics:

  • Each row represents a single sample.

  • The last column is the target class label.

  • All other columns are feature values.

  • All values and classes must be non-infinite numbers, so make sure to include an encoder in your pipeline if you have string data.

We can use the following script to apply an encoder to the string columns, remove infinites and split the data, be sure to adapt it to your current pipeline as needed:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# Load the dataset
df = pd.read_csv("UDPLag.csv")

# Encoding string columns into integers
string_cols = ['Flow ID', ' Source IP', ' Destination IP', ' Timestamp', 'SimillarHTTP']

# Dropping entries with infinite values
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

feature_encoder = OrdinalEncoder().fit(df[string_cols].astype('str'))
df[string_cols] = feature_encoder.transform(df[string_cols].astype('str'))

label_encoder = LabelEncoder().fit(df[' Label'])
df[' Label'] = label_encoder.transform(df[' Label'])

label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)

# Split into features and target
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25
)

# Save to CSV files without index
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)

Training the Forest and Extracting Its Rules

The new_classifier method from RuleClassifier will train a scikit-learn model, extract its rules and create a new RuleClassifier instance. It expects the path to the newly created CSV files, an algorithm type (can be either “Decision Tree” or “Random Forest”) and the model parameters to be used in the respective scikit-learn model.

from pyruleanalyzer import RuleClassifier

# Define the model parameters
model_params = {"max_depth": 5, "n_estimators": 100}

# Create a RuleClassifier instance
classifier = RuleClassifier.new_classifier(
    train_path="train.csv",
    test_path="test.csv",
    model_parameters=model_params,
    algorithm_type="Random Forest"
)

Pruning

With the RuleClassifier instance in hands, we can now execute a rule analysis with the execute_rule_analysis method, which will refine the forest by removing duplicate rules. This method expects the test.csv file, a duplicate removal method (which can be either “soft”, removing duplicate rules in a single tree, “hard”, deleting duplicate rules in distinct trees, only applicable to random forest models, “custom”, that will use a custom function previously defined with the set_custom_rule_removal method, or “none”, that will not remove any rules). You may also optionally specify rule removal based on classification count, which will remove rules that classify n or fewer entries with the remove_below_n_classifications parameter (disabled by default).

classifier.execute_rule_analysis(
    file_path="test.csv",
    remove_duplicates="soft"
)

Since this is a large dataset and the algorithm goes through many iterative steps to ensure no new duplicate rules are accidentally created during pruning, it may take a longer time to fully complete the analysis, specially if you use the “hard” removal method.

Using the model

To use the refined model to classify new entries we can use the classify method with the final parameter set to True, this will force the RuleClassifier instance we just trained to use the rule set generated after pruning. If your dataset didn’t include a header row you must name your features as “v{column}” where column is the column index in the csv.

# Replace with actual values of your dataset
sample = {"Flow ID": "172.16.0.5-192.168.50.4-35468-49856-17", " Source IP": "172.16.0.5", ..., " Inbound": 1}

encoded_sample = feature_encoder.transform(sample)

predicted_class, votes, probabilities = classifier.classify(encoded_sample, final=True)

actual_class = label_encoder.inverse_transform(predicted_class)