Detecting Insider Threats Using Unsupervised Learning on User Behavior Logs¶

Project Objective:¶

Insider threats can be a serious vulnerability within a large company. With hundreds to thousands of employees, all with varying access, there should be monitoring at the user level to identify and flag possible internal threats.

This Project uses unsupervised anomaly detection (Isolation Forest) to flag user behaviors that significantly deviate from the norm and may indicate insider threats.

Isolation Forest assigns an anomaly score to each data point based on how easily it can be isolated in a random forest of binary trees. Each tree splits features at random values, and points that are far from clusters (i.e., anomalies) tend to be isolated quickly — meaning they require fewer splits (shallower tree depth). In contrast, normal points lie within dense regions and require more splits (deeper tree depth) to isolate. Thus, the average path length over all trees is used to compute an anomaly score, where shorter paths indicate higher anomaly.

We will also use DBSCAN and KMEANS to compare with the Isolation Forest

Data¶

The Insider Threat Test Dataset By Software Engineering Institute is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data. - https://insights.sei.cmu.edu/library/insider-threat-test-dataset/

Lindauer, Brian (2020). Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1

Downloaded on kaggle here: https://www.kaggle.com/datasets/mrajaxnp/cert-insider-threat-detection-research/data?select=users.csv

Here’s a brief overview of each file used:

File Description
logon.csv User login/logout events (useful for detecting off-hours access or multiple logins)
file.csv File access activities (can reveal unusual copying or exfiltration attempts)
decoy_file.csv Accesses to honeypot files that normal users shouldn't open often (strong threat indicator)
device.csv Use of USB/storage devices (critical for identifying potential data exfiltration)
users.csv Static metadata for each user (department, role, etc.) for context or profiling

Finding Insider Threats Within the Data¶

Possible examples of insider threats include employees who excessively access a large variety of files, interact heavily with removable media, or access decoy files planted to detect malicious behavior. Unusual spikes in file activity, USB usage, and decoy interactions from a single user can signal data exfiltration or reconnaissance attempts.

Based on the above assumptions, we engineered the following features:

Feature Meaning Why it may be suspicious
logon_count Total logon events Excessive logons could indicate scanning behavior or persistence attempts.
failed_logons Failed logon attempts Brute force or privilege escalation attempts.
unique_pcs Number of unique PCs accessed Wide movement across machines could indicate lateral movement.
file_access_count Number of files accessed Sudden spike might indicate data collection.
removable_write_count Files written to removable drives Strong exfiltration signal.
removable_read_count Files read from removable drives Preparatory behavior for exfiltration.
unique_files Unique files accessed Diversity might indicate broad searching.
usb_inserts / usb_removals USB usage events Physical exfiltration risk.
unique_devices Unique devices used More devices = more suspicious.
decoy_file_accesses Accesses to honeypot files Highly suspicious, often direct signal of insider threat.

Methods & Results¶

This project leveraged the CERT Insider Threat Dataset (r6.2) to detect anomalous user behavior through unsupervised machine learning. We used five core logs — logon.csv, file.csv, device.csv, decoy_file.csv, and users.csv — to engineer behavioral features per user. These included logon frequency, failed logons, device usage, file access patterns, decoy file interactions, and removable media activity. The data was aggregated to one row per user, with missing values imputed as zero. Feature scaling was applied using StandardScaler.

We evaluated three unsupervised anomaly detection methods:

Isolation Forest: Trained twice with contamination=0.5, and again without specifying contamination. Anomaly scores were thresholded using a statistical rule: mean + 2.5 standard deviations. This flagged 165 anomalous users. Isolation Forest measures how easily an instance is isolated in a tree structure; fewer splits imply higher anomaly.

KMeans Clustering: We ran KMeans with k=4 (chosen via silhouette score and elbow method). Distances to centroids were calculated and users exceeding mean + 2.5σ distance were marked anomalous, resulting in 203 flagged users.

DBSCAN: Used with eps=1.5 and min_samples=5, DBSCAN identified 51 users in noise cluster -1 as anomalies. While effective in identifying outliers in sparse regions, its sensitivity to density parameters limited its detection breadth.

Among these, Isolation Forest provided the most interpretable and consistent results by integrating all features and dynamically identifying outliers based on data structure rather than arbitrary thresholds. The most anomalous users exhibited unusually high decoy file access, removable media usage, and file activity. All findings were visualized using t-SNE projections and centroid-distance plots, confirming spatial separation of anomalies from typical users. We measured the most anomalous user's features based on their z-scores, which found significant evidence of suspect behaviors (e.g., z-score = 10 for total number of unique files accessed).

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from scipy.stats import zscore


# Load datasets (CSVs are extracted and in notebook directory)
logon = pd.read_csv("logon.csv")
file_access = pd.read_csv("file.csv")
device = pd.read_csv("device.csv")
decoy = pd.read_csv("decoy_file.csv")
users = pd.read_csv("users.csv")


# Rename for consistency
users = users.rename(columns={'user_id': 'user'})
decoy = decoy.rename(columns={'pc': 'decoy_pc'})

# ─── Logon features ───────────────────────────────────────
logon['datetime'] = pd.to_datetime(logon['date'])
logon_features = logon.groupby('user').agg(
    logon_count=('id', 'count'),
    failed_logons=('activity', lambda x: (x == 'Logon-Failure').sum()),
    unique_pcs=('pc', 'nunique')
)

# ─── File access features ─────────────────────────────────
file_features = file_access.groupby('user').agg(
    file_access_count=('filename', 'count'),
    removable_write_count=('to_removable_media', 'sum'),
    removable_read_count=('from_removable_media', 'sum'),
    unique_files=('filename', 'nunique')
)

# ─── Device usage features ────────────────────────────────
device_features = device.groupby('user').agg(
    usb_inserts=('activity', lambda x: (x == 'Connect').sum()),
    usb_removals=('activity', lambda x: (x == 'Disconnect').sum()),
    unique_devices=('file_tree', 'nunique')
)

# ─── Decoy access features ────────────────────────────────
# No 'user' column, so we need to infer from matching PC
decoy_access = decoy.merge(logon[['user', 'pc']], left_on='decoy_pc', right_on='pc', how='left')
decoy_features = decoy_access.groupby('user').agg(
    decoy_file_accesses=('decoy_filename', 'count')
)

# ─── Combine all ───────────────────────────────────────────
df = users[['user']].drop_duplicates().set_index('user')

df = df.join([logon_features, file_features, device_features, decoy_features])
df.fillna(0, inplace=True)
In [4]:
# relevant features for visualization
eda_features = [
    'logon_count', 'failed_logons', 'unique_pcs',
    'file_access_count', 'unique_files',
    'removable_write_count', 'removable_read_count',
    'usb_inserts', 'usb_removals', 'unique_devices',
    'decoy_file_accesses'
]

# Set up the figure
fig, axes = plt.subplots(nrows=6, ncols=2, figsize=(16, 24))
axes = axes.flatten()

for i, feature in enumerate(eda_features):
    sns.violinplot(data=df, y=feature, ax=axes[i], inner='quartile', linewidth=1)
    axes[i].set_title(f'Distribution of {feature}')
    axes[i].set_xlabel('')
    axes[i].set_ylabel(feature)

# Hide any unused subplots
for j in range(len(eda_features), len(axes)):
    axes[j].set_visible(False)

plt.tight_layout()
plt.show()
No description has been provided for this image

These violin plots show the distribution of each behavioral feature across all users¶

These EDA visuals help confirm that outlier detection methods (like Isolation Forest, KMeans distances, or DBSCAN) are appropriate, as these feature distributions reflect classic insider threat patterns — rare, high-severity behaviors: decoy_file_accesses, removable_read/write_count, usb_inserts, and usb_removals all have extreme outliers, which are strong indicators for anomaly detection.

Below we will run the Isolation Forest at contamination = 5% (and later set our own anomaly threshhold using Z-score).

In [6]:
# ─── Step 1: Scale the data ────────────────────────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# ─── Step 2: Fit Isolation Forest ──────────────────────────
model = IsolationForest(contamination=0.05, random_state=42)
df['anomaly'] = model.fit_predict(X_scaled)

# Anomaly = -1 for outliers, 1 for normal
df['anomaly'] = df['anomaly'].map({1: 0, -1: 1})

# ─── Step 3: Visualize with t-SNE ──────────────────────────
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
df['x'] = X_tsne[:, 0]
df['y'] = X_tsne[:, 1]

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='x', y='y', hue='anomaly', palette={0: 'green', 1: 'red'})
plt.title('t-SNE Projection of Users — Anomalies Highlighted')
plt.xlabel('t-SNE X')
plt.ylabel('t-SNE Y')
plt.legend(title='Anomaly')
plt.show()

# Show top users with highest anomalies for investigation
suspicious_users = df[df['anomaly'] == 1].sort_values(by=df.columns.difference(['x', 'y', 'anomaly']).tolist(), ascending=False)
No description has been provided for this image
In [7]:
print(df.shape[0])
print(suspicious_users.shape[0])
4000
200

Out of 4000 total users, 200 anomalies were detected (contamination = 5%)¶

In [9]:
# Fit Isolation Forest
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(df)

# Compute anomaly scores (lower = more anomalous)
anomaly_scores = model.decision_function(df)  # higher = less anomalous
outlier_scores = -anomaly_scores              # higher = more anomalous

# Optional: Add to dataframe
df['anomaly_score'] = outlier_scores

# Z-score normalization for interpretability
z_scores = df.drop(columns='anomaly_score').apply(zscore)

# Get the most anomalous user
most_anomalous_index = outlier_scores.argmax()
top_user_id = df.index[most_anomalous_index]
top_user_z = z_scores.loc[top_user_id]

# Print ranked features by z-score (most suspicious behaviors first)
print(f"Top anomalous user: {top_user_id}")
print(top_user_z.sort_values(ascending=False))

# Extract top user's z-scores and drop NaNs
top_user_z = z_scores.loc['JDM1042'].dropna()

# Sort by z-score for better visual order
top_user_z_sorted = top_user_z.sort_values()

# Plot
plt.figure(figsize=(10, 6))
top_user_z_sorted.plot(kind='barh', color='crimson')
plt.title("Anomalous Behavior Profile: User JDM1042")
plt.xlabel("Z-score")
plt.axvline(x=0, color='black', linewidth=0.8)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Top anomalous user: JDM1042
unique_files             10.498718
file_access_count         8.051325
removable_read_count      7.934316
removable_write_count     7.053283
usb_removals              6.411179
usb_inserts               6.405038
anomaly                   4.358899
decoy_file_accesses       2.206429
x                         2.060832
unique_devices            2.018946
logon_count               1.800654
unique_pcs               -0.115865
y                        -0.594173
failed_logons                  NaN
Name: JDM1042, dtype: float64
No description has been provided for this image

The biggest anomalie, JDM1042, exhibits multiple correlated behaviors consistent with insider threat patterns: excessive file access, use of removable media, access to honeypots, and lots of USB activity — all far beyond the statistical norms of peers.

This kind of z-score explanation translates a black-box anomaly detection into a clear, security-relevant story, which is what makes it compelling for cybersecurity analysts.

In [11]:
# ---- Feature preparation ----
features = df.drop(columns=['anomaly'], errors='ignore')  # remove old anomaly score if present
X = features.fillna(0).copy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ---- Elbow method & silhouette score ----
inertias = []
silhouettes = []

print("Evaluating k for KMeans...")
for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
    sil_score = silhouette_score(X_scaled, km.labels_)
    silhouettes.append(sil_score)
    print(f"k={k}, inertia={km.inertia_:.2f}, silhouette={sil_score:.4f}")

# plot of elbow and silhouette
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(range(2, 8), inertias, marker='o')
plt.title('KMeans Elbow Method')
plt.xlabel('k')
plt.ylabel('Inertia')

plt.subplot(1, 2, 2)
plt.plot(range(2, 8), silhouettes, marker='o', color='green')
plt.title('Silhouette Score by k')
plt.xlabel('k')
plt.ylabel('Score')
plt.tight_layout()
plt.show()

# ---- Fit KMeans with chosen k (e.g., 4) ----
kmeans = KMeans(n_clusters=4, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# ---- Distance to cluster center (anomaly scoring) ----
centroids = kmeans.cluster_centers_
distances = np.linalg.norm(X_scaled - centroids[df['cluster']], axis=1)
df['distance_to_centroid'] = distances

# ---- Top anomalies ----
top_anomalies = df.sort_values('distance_to_centroid', ascending=False).head(10)
print("Top 10 anomalous users by distance to centroid:")
print(top_anomalies[['distance_to_centroid', 'cluster']])

# ---- t-SNE Visualization ----
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
tsne_result = tsne.fit_transform(X_scaled)
df['x'] = tsne_result[:, 0]
df['y'] = tsne_result[:, 1]

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='x', y='y', hue='cluster', palette='tab10')
plt.title("t-SNE Clustering View")
plt.legend()
plt.show()

#DBSCAN comparison ----
db = DBSCAN(eps=1.5, min_samples=5)
df['dbscan_label'] = db.fit_predict(X_scaled)
print("DBSCAN label distribution:")
print(df['dbscan_label'].value_counts())
Evaluating k for KMeans...
k=2, inertia=32513.22, silhouette=0.5765
k=3, inertia=22667.46, silhouette=0.5524
k=4, inertia=18734.14, silhouette=0.3591
k=5, inertia=12406.75, silhouette=0.3990
k=6, inertia=9597.74, silhouette=0.4206
k=7, inertia=8383.62, silhouette=0.4151
No description has been provided for this image
Top 10 anomalous users by distance to centroid:
         distance_to_centroid  cluster
user                                  
SDH2394             11.966077        2
TAM3048             11.566825        1
DNS1758             11.446726        1
MTS0465             10.156047        1
RCF0044              9.814102        1
BPD2437              9.763080        1
CCB3055              9.212245        1
EIR1046              9.168279        1
HBB1759              9.167169        1
FZB3046              9.115166        1
No description has been provided for this image
DBSCAN label distribution:
dbscan_label
 1    3133
 2     579
 0      73
 3      70
-1      51
 4      50
 6      21
 5      16
 7       7
Name: count, dtype: int64
In [12]:
# DBSCAN – label -1 is anomaly
dbscan_anomalies = (df['dbscan_label'] == -1).sum()
print(f"DBSCAN anomalies (label -1): {dbscan_anomalies}")
DBSCAN anomalies (label -1): 51
In [13]:
# Compute distances to assigned cluster centroid
distances = np.linalg.norm(X_scaled - kmeans.cluster_centers_[kmeans.labels_], axis=1)

# Define anomaly threshold based on statistics (e.g., mean + 2.5 * std)
threshold = distances.mean() + 2.5 * distances.std()

# Flag anomalies
kmeans_anomalies = df[distances > threshold]
print(f"KMeans anomalies detected (2.5 SD from mean): {len(kmeans_anomalies)}")
KMeans anomalies detected (2.5 SD from mean): 203
In [14]:
# Fit Isolation Forest without contamination
iso = IsolationForest(random_state=42)
iso.fit(X_scaled)

# Get raw anomaly scores (lower = more anomalous)
scores = -iso.decision_function(X_scaled)  # flip sign for consistency (higher = more anomalous)

# Compute threshold: mean + 2.5 * std
threshold = scores.mean() + 2.5 * scores.std()

# Flag anomalies
iso_anomalies = df[scores > threshold]
print(f"Isolation Forest anomalies (2.5σ): {len(iso_anomalies)}")

#
Isolation Forest anomalies (2.5σ): 165

For comparison with othe models, we implemented the Isolation Forest like a statistical model, applying a standard deviation cutoff to its unsupervised anomaly scores, rather than letting it auto-select the 5% contamination percentage. We chose 2.5 instead of 3 standard deviations to 'err on the side of caution'.

In [ ]: