Finding the Threats Hiding in the Noise
A network quietly records everything that connects to it — millions of connections a day. Hidden in that flood are the threats that matter: scans, probes, and "command-and-control" traffic where a compromised machine phones home. Finding them by hand isn't realistic.
The goal was to train a model to read network connection logs and automatically flag the suspicious ones, so a team gets a short list to review instead of a firehose. The data was roughly 2 million network connections from Zeek (a standard network monitoring tool), labeled against the MITRE ATT&CK framework as benign or malicious — each carrying its native details: bytes sent and received, duration, ports, protocol, and packet counts.
Three models were trained and compared — Logistic Regression, Random Forest, and XGBoost — with adjustments so the rare malicious traffic wasn't drowned out by normal traffic.
Random Forest was near-flawless on the test data — a ROC AUC of 0.9999 with essentially perfect precision and recall; XGBoost followed at 99.96% accuracy. Just as useful, the model was explainable: the size of the response to a connection was the biggest tell — malicious scans get small responses, while normal traffic returns much larger ones. That's a pattern a security analyst can understand and trust.
What this means for your business: threats hide inside ordinary-looking activity — far too much volume for a person to watch. A trained model scores every connection and surfaces only the suspicious ones for review. The same approach can feed a security dashboard or give a small team early warning, without the cost of a full security operations center — and the principle carries over to any "find the rare bad event in a huge stream" problem.