Augmenting Zeek-Based Detection with Machine Learning to Identify Beaconing and C2 Recon Traffic¶

Command-and-control (C2) traffic is a hallmark of many cyberattacks, enabling remote adversaries to maintain persistence and control over compromised systems. In this study, we propose a supervised machine learning (ML) approach to augment Zeek network traffic logs for detecting C2-related behaviors at the connection-level, specifically leveraging labeled reconnaissance traffic as a proxy for malicious C2 activity. This can be used to automate the recognition & handling of suspicious connections without requiring manual interference.

Dataset¶

The data consists of ~2 million MITRE ATT&CK labeled network entries from Zeek, an open sourced network security monitoring (NSM) tool used for packet inspection and analysis.

The Zeek log data used in this analysis from UWF along with subsequent work on this data can be viewed & downloaded here: https://datasets.uwf.edu/

The 'label' column reflects whether the Zeek log was associated with beaconing/C2 behavior vs. normal activity. We deployed machine learning to detect such traffic based on Zeek's native fields (duration, bytes, conn_state, service, etc.). While this analysis uses labeled data for supervised learning, in real-world SOCs, this could be used to flag suspicious connections for analyst review or feed into SIEM enrichment.

Column Explanations (Relevant for EDA & Modeling)¶

Column Meaning Usage
duration Duration of the connection (seconds) Used in features list
orig_bytes Bytes sent from source to destination Used in features list
resp_bytes Bytes sent from destination to source Used in features list
orig_pkts Packets sent from source to destination Used in features list
resp_pkts Packets sent from destination to source Used in features list
src_ip, dest_ip IPs involved in the connection Not used in features list
src_port, dest_port Network ports used Used in features list
service Application-layer protocol identified (e.g., HTTP, FTP) Used in features list (encoded)
protocol Transport-layer protocol (TCP/UDP) Used in features list (encoded)
conn_state Zeek connection state (S0, SF, REJ, etc.) Used in features list (encoded)
history Sequence of connection flags (SYNs, acks, etc.) Not used in features list
mitre_attack_tactics 'none' for benign, 'Reconnaissance' for attack Used to make 'label'
datetime, ts Timestamp information Not Used in features list
label 0 for benigin (mitre_attack_tactics = 'none'), 1 for attack (mitre_attack_tactics = 'Reconnaissance') Y variable

Methods & Results Discussion¶

We used this labeled Zeek log dataset containing benign and reconnaissance traffic to build a supervised classification pipeline. After cleaning and extracting features such as packet and byte counts, durations, ports, and protocol indicators, we split the data into training and test sets using stratified sampling. Numerical features were scaled where necessary, and categorical network metadata (e.g., protocol, service, connection state) was one-hot encoded. Three models were trained: Logistic Regression, Random Forest, and XGBoost with hyperparameters selected for class imbalance handling. Results showed that Random Forest performed best, achieving a ROC AUC of 0.9999, followed closely by XGBoost at 0.9996, while Logistic Regression achieved 0.9910.

The Random Forest also attained perfect precision, recall, and F1-score, indicating its robustness in classifying early-stage C2 behavior, demonstrating the feasibility of integrating ML into network intrusion detection workflows.

Further research can aim to target network traffic at the host-level, aggregating entries by individual source IP addresses.

In [70]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.metrics import roc_curve

import warnings
warnings.filterwarnings('ignore')

# filenames
f1 = 'part-00000-5b4f5c3f-e8a9-4020-8fa1-e8985f7c27f3-c000.csv'
f2 = 'part-00000-95e0a460-e7c5-4b35-8367-c2e6fbbcf9e1-c000.csv'

df = pd.concat([pd.read_csv(f1), pd.read_csv(f2)], ignore_index=True)
print(df.shape)  
df.columns
(2000981, 23)
Out[70]:
Index(['resp_pkts', 'service', 'orig_ip_bytes', 'local_resp', 'missed_bytes',
       'protocol', 'duration', 'conn_state', 'dest_ip', 'orig_pkts',
       'community_id', 'resp_ip_bytes', 'dest_port', 'orig_bytes',
       'local_orig', 'datetime', 'history', 'resp_bytes', 'uid', 'src_port',
       'ts', 'src_ip', 'mitre_attack_tactics'],
      dtype='object')
In [3]:
df.isnull().sum()
Out[3]:
resp_pkts                    0
service                  48335
orig_ip_bytes                0
local_resp                   0
missed_bytes                 0
protocol                     0
duration                132372
conn_state                   0
dest_ip                      0
orig_pkts                    0
community_id                 0
resp_ip_bytes                0
dest_port                    0
orig_bytes              132372
local_orig                   0
datetime                     0
history                  18267
resp_bytes              132372
uid                          0
src_port                     0
ts                           0
src_ip                       0
mitre_attack_tactics         0
dtype: int64
In [4]:
df.replace([float('inf'), float('-inf')], pd.NA, inplace=True)
df.dropna(inplace=True)
In [5]:
# Binary label: 1 = attack, 0 = benign
df['label'] = df['mitre_attack_tactics'].apply(lambda x: 1 if x == 'Reconnaissance' else 0)

# Preview distribution
print(df['mitre_attack_tactics'].unique())
print(df['label'].value_counts())
['Reconnaissance' 'none']
label
0    1432660
1     414720
Name: count, dtype: int64

"Reconnaissance" refers to a phase in the cyber attack cycle where the attackers are gathering information about a network. The Dataset includes about a 3 to 1 split of benign vs attack or 'recon' logs.

Reconnaissance typically includes:¶

Port scanning (e.g., using Nmap), Identifying open ports and services on a host, IP sweeps, Checking which hosts are alive on a network, service banner grabbing, probing for software versions, OS fingerprints, DNS queries or reverse lookups, mapping domain names and internal assets.

Zeek Logs Indicating Recon Activity in this dataset:¶

Many short-lived connections from a single IP, Sequential attempts across multiple ports or hosts, Unusual patterns in connection states (conn_state like S0, REJ), Frequent failed connections or low duration, Small payloads with high frequency.

In [7]:
# Visualize durations
sns.histplot(data=df, x='duration', hue='label', log_scale=True)

# Byte distributions
sns.boxplot(data=df, x='label', y='orig_bytes')
sns.boxplot(data=df, x='label', y='resp_bytes')

# Protocol usage
print(df.groupby('protocol')['label'].value_counts(normalize=True).unstack())
label            0         1
protocol                    
tcp       1.000000       NaN
udp       0.775509  0.224491
No description has been provided for this image

Benign traffic (label 0) spans a wide range of resp_bytes, with a large concentration in the 500–30,000 byte range — especially some big spikes around ~15k and ~25k bytes. Which suggests normal user activity often involves substantial data being returned (e.g., web pages, downloads, etc.).

Reconnaissance traffic (label 1) has: A much tighter cluster, mostly below 25,000 bytes & more narrowly distributed and often very small payloads.

Recon traffic often sends out probes or requests but doesn't receive large responses — consistent with port scans, pings, or failed requests. Benign traffic produces richer responses, with larger and more varied resp_bytes. This tells you resp_bytes is a discriminative feature and can help the model separate recon from normal activity.

In [57]:
# Add log-transformed versions of skewed features
for feature in numerical_features:
    df[f'log1p_{feature}'] = np.log1p(df[feature])

# Plot violin plots for log-transformed features
log_features = [f'log1p_{f}' for f in numerical_features]

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 12))
axes = axes.flatten()

for i, feature in enumerate(log_features):
    sns.violinplot(data=df, x='label', y=feature, ax=axes[i], inner='quartile', linewidth=1)
    axes[i].set_title(f'Log Distribution of {feature[6:]} by Label')  # Strip 'log1p_' prefix

# Remove empty subplot if needed
fig.delaxes(axes[-1])
plt.tight_layout()
plt.show()
No description has been provided for this image

duration, resp_bytes, and resp_pkts show heavier tails in normal traffic (label = 0), while malicious flows tend to be more compact.

orig_bytes and orig_pkts are more centered in the malicious class (label = 1), suggesting attackers might be using more consistent packet sizes and counts, which is typical in beaconing/C2.

In [10]:
# Protocol usage by label
protocol_dist = df.groupby('protocol')['label'].value_counts(normalize=True).unstack()
protocol_dist.plot(kind='bar', stacked=True, figsize=(8, 4))
plt.title('Protocol Usage by Label')
plt.ylabel('Proportion')
plt.xlabel('Protocol')
plt.legend(title='Label (0=Benign, 1=Recon)')
plt.tight_layout()
plt.show()

# Connection state usage by label
conn_state_dist = df.groupby('conn_state')['label'].value_counts(normalize=True).unstack()
conn_state_dist.plot(kind='bar', stacked=True, figsize=(8, 4))
plt.title('Connection State Usage by Label')
plt.ylabel('Proportion')
plt.xlabel('Connection State')
plt.legend(title='Label (0=Benign, 1=Recon)')
plt.tight_layout()
plt.show()

# Network services by label
top_services = df['service'].value_counts().head(10).index
service_dist = df[df['service'].isin(top_services)].groupby('service')['label'].value_counts(normalize=True).unstack()
service_dist.plot(kind='bar', stacked=True, figsize=(8, 4))
plt.title('Network Services by Label')
plt.ylabel('Proportion')
plt.xlabel('Service')
plt.legend(title='Label (0=Benign, 1=Recon)')
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Protocol Usage:¶

TCP (Transmission Control Protocol): Reliable, connection-oriented, ensures packet delivery. Used in most web traffic (e.g., HTTP, HTTPS, SSL). Less common for beaconing/C2 due to overhead and visibility in session-based monitoring.

UDP (User Datagram Protocol): Connectionless, fast, minimal overhead. Used by services like DNS, NTP, and DHCP. More associated with malicious traffic in the data because attackers often exploit lightweight UDP for covert channels, port scanning, and beaconing due to its lack of session tracking and ease of spoofing.

Connection States:¶

These are from Zeek logs, reflecting the outcome of a connection attempt:

SF (State: Complete handshake, data sent both ways): Indicates a successful bidirectional session. Frequent in both benign and malicious traffic, but attackers completing connections for C2 or data exfiltration will also show as SF.

SHR (State: SYN seen from originator, followed by a reset from responder): Often means the target rejected the connection attempt (e.g., firewall or closed port). Highly associated with reconnaissance activity like scanning, where many SYNs are sent but very few connections are accepted.

S0 (State: SYN sent, no response): Common in scanning or failed attempts. Associated with reconnaissance, but slightly less than SHR when the target outright rejects instead of ignoring the connection.

Services:¶

These refer to the application-layer protocols detected:

DHCP (Dynamic Host Configuration Protocol): Assigns IP addresses in a network. Not usually attacker-controlled but may show up in anomalies if rogue devices request or manipulate DHCP responses to blend in or perform network mapping.

DNS (Domain Name System): Resolves domain names to IPs. Top malicious service in the data — commonly used in DNS tunneling, beaconing, and C2. Attackers abuse it to exfiltrate data via DNS queries or maintain stealthy communication.

NTP (Network Time Protocol): Synchronizes system clocks. Malicious use can arise when bots check in with precise timing or use NTP as a timing channel.

SSL (Secure Sockets Layer): Encrypted TCP sessions (e.g., HTTPS). Often benign, but can hide C2 traffic, making it harder to inspect. In C2 activity, attackers may use SSL to encrypt payloads.

In [11]:
corr = df[numerical_features + ['label']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=False, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()
No description has been provided for this image
In [12]:
# One-hot encode protocol, conn_state, service
df = pd.get_dummies(df, columns=['protocol', 'conn_state', 'service'], drop_first=True)

features = [
    'duration', 'orig_bytes', 'resp_bytes',
    'orig_pkts', 'resp_pkts', 'src_port', 'dest_port',
    'protocol_udp', 'conn_state_SF', 'conn_state_SHR',
    'service_dns', 'service_ntp', 'service_ssl'
]
In [67]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Scale for models that need it
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define models
models = {
    "Logistic Regression": LogisticRegression(class_weight='balanced', max_iter=1000),
    "Random Forest": RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42),
    "XGBoost": xgb.XGBClassifier(scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum(), use_label_encoder=False, eval_metric='logloss'),

}

# Train and evaluate
results = []
best_model_name = None
best_auc = 0
best_y_prob = None

for name, model in models.items():
    print(f"\n- {name} -")
    
    # Use scaled data for models that require it
    if name in ["Logistic Regression", "SVM", "KNN"]:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_prob = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]

    print(classification_report(y_test, y_pred))
    auc = roc_auc_score(y_test, y_prob)
    if auc > best_auc:
        best_auc = auc
        best_model_name = name
        best_y_prob = y_prob
    print(f"ROC AUC: {auc:.4f}")

    results.append({
        "Model": name,
        "ROC AUC": auc,
        "Precision": classification_report(y_test, y_pred, output_dict=True)['1']['precision'],
        "Recall": classification_report(y_test, y_pred, output_dict=True)['1']['recall'],
        "F1 Score": classification_report(y_test, y_pred, output_dict=True)['1']['f1-score']
    })

# Show summary comparison table
comparison_df = pd.DataFrame(results).sort_values(by="ROC AUC", ascending=False)
print("\n Model Comparison:")
print(comparison_df)

fpr, tpr, _ = roc_curve(y_test, best_y_prob)
plt.plot(fpr, tpr, label=f'Best Model ROC: ({best_model_name})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Best Performing Model')
plt.legend()
plt.show
- Logistic Regression -
              precision    recall  f1-score   support

           0       1.00      0.95      0.98    358165
           1       0.86      1.00      0.92    103680

    accuracy                           0.96    461845
   macro avg       0.93      0.98      0.95    461845
weighted avg       0.97      0.96      0.96    461845

ROC AUC: 0.9910

- Random Forest -
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    358165
           1       1.00      1.00      1.00    103680

    accuracy                           1.00    461845
   macro avg       1.00      1.00      1.00    461845
weighted avg       1.00      1.00      1.00    461845

ROC AUC: 0.9999

- XGBoost -
              precision    recall  f1-score   support

           0       1.00      0.99      1.00    358165
           1       0.98      1.00      0.99    103680

    accuracy                           0.99    461845
   macro avg       0.99      1.00      0.99    461845
weighted avg       0.99      0.99      0.99    461845

ROC AUC: 0.9996

 Model Comparison:
                 Model   ROC AUC  Precision  Recall  F1 Score
1        Random Forest  0.999942   0.999085     1.0  0.999542
2              XGBoost  0.999575   0.976519     1.0  0.988120
0  Logistic Regression  0.991035   0.858555     1.0  0.923895
Out[67]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image

Summary:¶

The results show that all three models performed well, with Random Forest achieving the highest accuracy and ROC AUC. Specifically, Random Forest reached a ROC AUC of 0.9999 and perfect precision, recall, and F1-score for both benign and reconnaissance traffic. XGBoost followed closely with a ROC AUC of 0.9996 and strong classification metrics. Logistic Regression, while slightly less accurate, still delivered a ROC AUC of 0.9910 and perfect recall for the reconnaissance class. Overall, the models—especially Random Forest—demonstrated high effectiveness in detecting reconnaissance traffic in Zeek logs.

In [ ]: