Detecting Insider Threats Using Unsupervised Learning on User Behavior Logs¶
Project Objective:¶
Insider threats can be a serious vulnerability within a large company. With hundreds to thousands of employees, all with varying access, there should be monitoring at the user level to identify and flag possible internal threats.
This Project uses unsupervised anomaly detection (Isolation Forest) to flag user behaviors that significantly deviate from the norm and may indicate insider threats.
Isolation Forest assigns an anomaly score to each data point based on how easily it can be isolated in a random forest of binary trees. Each tree splits features at random values, and points that are far from clusters (i.e., anomalies) tend to be isolated quickly — meaning they require fewer splits (shallower tree depth). In contrast, normal points lie within dense regions and require more splits (deeper tree depth) to isolate. Thus, the average path length over all trees is used to compute an anomaly score, where shorter paths indicate higher anomaly.
We will also use DBSCAN and KMEANS to compare with the Isolation Forest
Data¶
The Insider Threat Test Dataset By Software Engineering Institute is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data. - https://insights.sei.cmu.edu/library/insider-threat-test-dataset/
Lindauer, Brian (2020). Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1
Downloaded on kaggle here: https://www.kaggle.com/datasets/mrajaxnp/cert-insider-threat-detection-research/data?select=users.csv
Here’s a brief overview of each file used:
File | Description |
---|---|
logon.csv | User login/logout events (useful for detecting off-hours access or multiple logins) |
file.csv | File access activities (can reveal unusual copying or exfiltration attempts) |
decoy_file.csv | Accesses to honeypot files that normal users shouldn't open often (strong threat indicator) |
device.csv | Use of USB/storage devices (critical for identifying potential data exfiltration) |
users.csv | Static metadata for each user (department, role, etc.) for context or profiling |
Finding Insider Threats Within the Data¶
Possible examples of insider threats include employees who excessively access a large variety of files, interact heavily with removable media, or access decoy files planted to detect malicious behavior. Unusual spikes in file activity, USB usage, and decoy interactions from a single user can signal data exfiltration or reconnaissance attempts.
Based on the above assumptions, we engineered the following features:
Feature | Meaning | Why it may be suspicious |
---|---|---|
logon_count | Total logon events | Excessive logons could indicate scanning behavior or persistence attempts. |
failed_logons | Failed logon attempts | Brute force or privilege escalation attempts. |
unique_pcs | Number of unique PCs accessed | Wide movement across machines could indicate lateral movement. |
file_access_count | Number of files accessed | Sudden spike might indicate data collection. |
removable_write_count | Files written to removable drives | Strong exfiltration signal. |
removable_read_count | Files read from removable drives | Preparatory behavior for exfiltration. |
unique_files | Unique files accessed | Diversity might indicate broad searching. |
usb_inserts / usb_removals | USB usage events | Physical exfiltration risk. |
unique_devices | Unique devices used | More devices = more suspicious. |
decoy_file_accesses | Accesses to honeypot files | Highly suspicious, often direct signal of insider threat. |
Methods & Results¶
This project leveraged the CERT Insider Threat Dataset (r6.2) to detect anomalous user behavior through unsupervised machine learning. We used five core logs — logon.csv, file.csv, device.csv, decoy_file.csv, and users.csv — to engineer behavioral features per user. These included logon frequency, failed logons, device usage, file access patterns, decoy file interactions, and removable media activity. The data was aggregated to one row per user, with missing values imputed as zero. Feature scaling was applied using StandardScaler.
We evaluated three unsupervised anomaly detection methods:
Isolation Forest: Trained twice with contamination=0.5, and again without specifying contamination. Anomaly scores were thresholded using a statistical rule: mean + 2.5 standard deviations. This flagged 165 anomalous users. Isolation Forest measures how easily an instance is isolated in a tree structure; fewer splits imply higher anomaly.
KMeans Clustering: We ran KMeans with k=4 (chosen via silhouette score and elbow method). Distances to centroids were calculated and users exceeding mean + 2.5σ distance were marked anomalous, resulting in 203 flagged users.
DBSCAN: Used with eps=1.5 and min_samples=5, DBSCAN identified 51 users in noise cluster -1 as anomalies. While effective in identifying outliers in sparse regions, its sensitivity to density parameters limited its detection breadth.
Among these, Isolation Forest provided the most interpretable and consistent results by integrating all features and dynamically identifying outliers based on data structure rather than arbitrary thresholds. The most anomalous users exhibited unusually high decoy file access, removable media usage, and file activity. All findings were visualized using t-SNE projections and centroid-distance plots, confirming spatial separation of anomalies from typical users. We measured the most anomalous user's features based on their z-scores, which found significant evidence of suspect behaviors (e.g., z-score = 10 for total number of unique files accessed).
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from scipy.stats import zscore
# Load datasets (CSVs are extracted and in notebook directory)
logon = pd.read_csv("logon.csv")
file_access = pd.read_csv("file.csv")
device = pd.read_csv("device.csv")
decoy = pd.read_csv("decoy_file.csv")
users = pd.read_csv("users.csv")
# Rename for consistency
users = users.rename(columns={'user_id': 'user'})
decoy = decoy.rename(columns={'pc': 'decoy_pc'})
# ─── Logon features ───────────────────────────────────────
logon['datetime'] = pd.to_datetime(logon['date'])
logon_features = logon.groupby('user').agg(
logon_count=('id', 'count'),
failed_logons=('activity', lambda x: (x == 'Logon-Failure').sum()),
unique_pcs=('pc', 'nunique')
)
# ─── File access features ─────────────────────────────────
file_features = file_access.groupby('user').agg(
file_access_count=('filename', 'count'),
removable_write_count=('to_removable_media', 'sum'),
removable_read_count=('from_removable_media', 'sum'),
unique_files=('filename', 'nunique')
)
# ─── Device usage features ────────────────────────────────
device_features = device.groupby('user').agg(
usb_inserts=('activity', lambda x: (x == 'Connect').sum()),
usb_removals=('activity', lambda x: (x == 'Disconnect').sum()),
unique_devices=('file_tree', 'nunique')
)
# ─── Decoy access features ────────────────────────────────
# No 'user' column, so we need to infer from matching PC
decoy_access = decoy.merge(logon[['user', 'pc']], left_on='decoy_pc', right_on='pc', how='left')
decoy_features = decoy_access.groupby('user').agg(
decoy_file_accesses=('decoy_filename', 'count')
)
# ─── Combine all ───────────────────────────────────────────
df = users[['user']].drop_duplicates().set_index('user')
df = df.join([logon_features, file_features, device_features, decoy_features])
df.fillna(0, inplace=True)
# relevant features for visualization
eda_features = [
'logon_count', 'failed_logons', 'unique_pcs',
'file_access_count', 'unique_files',
'removable_write_count', 'removable_read_count',
'usb_inserts', 'usb_removals', 'unique_devices',
'decoy_file_accesses'
]
# Set up the figure
fig, axes = plt.subplots(nrows=6, ncols=2, figsize=(16, 24))
axes = axes.flatten()
for i, feature in enumerate(eda_features):
sns.violinplot(data=df, y=feature, ax=axes[i], inner='quartile', linewidth=1)
axes[i].set_title(f'Distribution of {feature}')
axes[i].set_xlabel('')
axes[i].set_ylabel(feature)
# Hide any unused subplots
for j in range(len(eda_features), len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
These violin plots show the distribution of each behavioral feature across all users¶
These EDA visuals help confirm that outlier detection methods (like Isolation Forest, KMeans distances, or DBSCAN) are appropriate, as these feature distributions reflect classic insider threat patterns — rare, high-severity behaviors: decoy_file_accesses, removable_read/write_count, usb_inserts, and usb_removals all have extreme outliers, which are strong indicators for anomaly detection.
Below we will run the Isolation Forest at contamination = 5% (and later set our own anomaly threshhold using Z-score).
# ─── Step 1: Scale the data ────────────────────────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# ─── Step 2: Fit Isolation Forest ──────────────────────────
model = IsolationForest(contamination=0.05, random_state=42)
df['anomaly'] = model.fit_predict(X_scaled)
# Anomaly = -1 for outliers, 1 for normal
df['anomaly'] = df['anomaly'].map({1: 0, -1: 1})
# ─── Step 3: Visualize with t-SNE ──────────────────────────
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
df['x'] = X_tsne[:, 0]
df['y'] = X_tsne[:, 1]
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='x', y='y', hue='anomaly', palette={0: 'green', 1: 'red'})
plt.title('t-SNE Projection of Users — Anomalies Highlighted')
plt.xlabel('t-SNE X')
plt.ylabel('t-SNE Y')
plt.legend(title='Anomaly')
plt.show()
# Show top users with highest anomalies for investigation
suspicious_users = df[df['anomaly'] == 1].sort_values(by=df.columns.difference(['x', 'y', 'anomaly']).tolist(), ascending=False)
print(df.shape[0])
print(suspicious_users.shape[0])
4000 200
Out of 4000 total users, 200 anomalies were detected (contamination = 5%)¶
# Fit Isolation Forest
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(df)
# Compute anomaly scores (lower = more anomalous)
anomaly_scores = model.decision_function(df) # higher = less anomalous
outlier_scores = -anomaly_scores # higher = more anomalous
# Optional: Add to dataframe
df['anomaly_score'] = outlier_scores
# Z-score normalization for interpretability
z_scores = df.drop(columns='anomaly_score').apply(zscore)
# Get the most anomalous user
most_anomalous_index = outlier_scores.argmax()
top_user_id = df.index[most_anomalous_index]
top_user_z = z_scores.loc[top_user_id]
# Print ranked features by z-score (most suspicious behaviors first)
print(f"Top anomalous user: {top_user_id}")
print(top_user_z.sort_values(ascending=False))
# Extract top user's z-scores and drop NaNs
top_user_z = z_scores.loc['JDM1042'].dropna()
# Sort by z-score for better visual order
top_user_z_sorted = top_user_z.sort_values()
# Plot
plt.figure(figsize=(10, 6))
top_user_z_sorted.plot(kind='barh', color='crimson')
plt.title("Anomalous Behavior Profile: User JDM1042")
plt.xlabel("Z-score")
plt.axvline(x=0, color='black', linewidth=0.8)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Top anomalous user: JDM1042 unique_files 10.498718 file_access_count 8.051325 removable_read_count 7.934316 removable_write_count 7.053283 usb_removals 6.411179 usb_inserts 6.405038 anomaly 4.358899 decoy_file_accesses 2.206429 x 2.060832 unique_devices 2.018946 logon_count 1.800654 unique_pcs -0.115865 y -0.594173 failed_logons NaN Name: JDM1042, dtype: float64
The biggest anomalie, JDM1042, exhibits multiple correlated behaviors consistent with insider threat patterns: excessive file access, use of removable media, access to honeypots, and lots of USB activity — all far beyond the statistical norms of peers.
This kind of z-score explanation translates a black-box anomaly detection into a clear, security-relevant story, which is what makes it compelling for cybersecurity analysts.
# ---- Feature preparation ----
features = df.drop(columns=['anomaly'], errors='ignore') # remove old anomaly score if present
X = features.fillna(0).copy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# ---- Elbow method & silhouette score ----
inertias = []
silhouettes = []
print("Evaluating k for KMeans...")
for k in range(2, 8):
km = KMeans(n_clusters=k, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
sil_score = silhouette_score(X_scaled, km.labels_)
silhouettes.append(sil_score)
print(f"k={k}, inertia={km.inertia_:.2f}, silhouette={sil_score:.4f}")
# plot of elbow and silhouette
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(range(2, 8), inertias, marker='o')
plt.title('KMeans Elbow Method')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.subplot(1, 2, 2)
plt.plot(range(2, 8), silhouettes, marker='o', color='green')
plt.title('Silhouette Score by k')
plt.xlabel('k')
plt.ylabel('Score')
plt.tight_layout()
plt.show()
# ---- Fit KMeans with chosen k (e.g., 4) ----
kmeans = KMeans(n_clusters=4, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# ---- Distance to cluster center (anomaly scoring) ----
centroids = kmeans.cluster_centers_
distances = np.linalg.norm(X_scaled - centroids[df['cluster']], axis=1)
df['distance_to_centroid'] = distances
# ---- Top anomalies ----
top_anomalies = df.sort_values('distance_to_centroid', ascending=False).head(10)
print("Top 10 anomalous users by distance to centroid:")
print(top_anomalies[['distance_to_centroid', 'cluster']])
# ---- t-SNE Visualization ----
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
tsne_result = tsne.fit_transform(X_scaled)
df['x'] = tsne_result[:, 0]
df['y'] = tsne_result[:, 1]
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='x', y='y', hue='cluster', palette='tab10')
plt.title("t-SNE Clustering View")
plt.legend()
plt.show()
#DBSCAN comparison ----
db = DBSCAN(eps=1.5, min_samples=5)
df['dbscan_label'] = db.fit_predict(X_scaled)
print("DBSCAN label distribution:")
print(df['dbscan_label'].value_counts())
Evaluating k for KMeans... k=2, inertia=32513.22, silhouette=0.5765 k=3, inertia=22667.46, silhouette=0.5524 k=4, inertia=18734.14, silhouette=0.3591 k=5, inertia=12406.75, silhouette=0.3990 k=6, inertia=9597.74, silhouette=0.4206 k=7, inertia=8383.62, silhouette=0.4151
Top 10 anomalous users by distance to centroid: distance_to_centroid cluster user SDH2394 11.966077 2 TAM3048 11.566825 1 DNS1758 11.446726 1 MTS0465 10.156047 1 RCF0044 9.814102 1 BPD2437 9.763080 1 CCB3055 9.212245 1 EIR1046 9.168279 1 HBB1759 9.167169 1 FZB3046 9.115166 1
DBSCAN label distribution: dbscan_label 1 3133 2 579 0 73 3 70 -1 51 4 50 6 21 5 16 7 7 Name: count, dtype: int64
# DBSCAN – label -1 is anomaly
dbscan_anomalies = (df['dbscan_label'] == -1).sum()
print(f"DBSCAN anomalies (label -1): {dbscan_anomalies}")
DBSCAN anomalies (label -1): 51
# Compute distances to assigned cluster centroid
distances = np.linalg.norm(X_scaled - kmeans.cluster_centers_[kmeans.labels_], axis=1)
# Define anomaly threshold based on statistics (e.g., mean + 2.5 * std)
threshold = distances.mean() + 2.5 * distances.std()
# Flag anomalies
kmeans_anomalies = df[distances > threshold]
print(f"KMeans anomalies detected (2.5 SD from mean): {len(kmeans_anomalies)}")
KMeans anomalies detected (2.5 SD from mean): 203
# Fit Isolation Forest without contamination
iso = IsolationForest(random_state=42)
iso.fit(X_scaled)
# Get raw anomaly scores (lower = more anomalous)
scores = -iso.decision_function(X_scaled) # flip sign for consistency (higher = more anomalous)
# Compute threshold: mean + 2.5 * std
threshold = scores.mean() + 2.5 * scores.std()
# Flag anomalies
iso_anomalies = df[scores > threshold]
print(f"Isolation Forest anomalies (2.5σ): {len(iso_anomalies)}")
#
Isolation Forest anomalies (2.5σ): 165
For comparison with othe models, we implemented the Isolation Forest like a statistical model, applying a standard deviation cutoff to its unsupervised anomaly scores, rather than letting it auto-select the 5% contamination percentage. We chose 2.5 instead of 3 standard deviations to 'err on the side of caution'.