Spotting Network Threats with Machine Learning

Finding the Threats Hiding in the Noise

A network quietly records everything that connects to it — millions of connections a day. Hidden in that flood are the threats that matter: scans, probes, and "command-and-control" traffic where a compromised machine phones home. Finding them by hand isn't realistic.

The goal was to train a model to read network connection logs and automatically flag the suspicious ones, so a team gets a short list to review instead of a firehose. The data was roughly 2 million network connections from Zeek (a standard network monitoring tool), labeled against the MITRE ATT&CK framework as benign or malicious — each carrying its native details: bytes sent and received, duration, ports, protocol, and packet counts.

Three models were trained and compared — Logistic Regression, Random Forest, and XGBoost — with adjustments so the rare malicious traffic wasn't drowned out by normal traffic.

Detection accuracy by model (ROC AUC) Random Forest 0.9999 XGBoost 0.9996 Logistic Regression 0.9910 Axis starts at 0.98 to show the gap. A perfect score is 1.0.

Random Forest was near-flawless on the test data — a ROC AUC of 0.9999 with essentially perfect precision and recall; XGBoost followed at 99.96% accuracy. Just as useful, the model was explainable: the size of the response to a connection was the biggest tell — malicious scans get small responses, while normal traffic returns much larger ones. That's a pattern a security analyst can understand and trust.

What this means for your business: threats hide inside ordinary-looking activity — far too much volume for a person to watch. A trained model scores every connection and surfaces only the suspicious ones for review. The same approach can feed a security dashboard or give a small team early warning, without the cost of a full security operations center — and the principle carries over to any "find the rare bad event in a huge stream" problem.

View the full technical analysis