Tools & Dataset

UPDATE (18/11/2022): For the most recent version of CICIDS2017 (improved ground-truth labelling and additional features) as well as a fixed version of CSECICIDS2018, please check out our latest work here.

When using the fixed CICFlowMeter tool, the improved regenerated CICIDS2017 dataset and/or our labelling and benchmarking code, please cite our paper:

@inproceedings{engelen2021troubleshooting,
title={Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study},
author={Engelen, Gints and Rimmer, Vera and Joosen, Wouter},
booktitle={2021 IEEE Security and Privacy Workshops (SPW)},
pages={7--12},
year={2021},
organization={IEEE}
}

CICFlowMeter tool

Our fixed version of the CICFlowMeter tool can be found at https://github.com/GintsEngelen/CICFlowMeter.

Important changes

A TCP flow is no longer terminated after a single FIN packet. It now terminates after mutual exchange of FIN packets, which is more in line with the TCP specification.
An RST packet is no longer ignored. Instead, the RST packet also terminates a TCP flow

Note that a flow can still terminate by timing out (the total duration of the flow exceeds X seconds). This has been left unchanged.

Improved CICIDS2017 dataset for flow-based network intrusion detection

The latest version of our improved version of the CICIDS2017 flow-based dataset can be downloaded here.

20/10/2021 UPDATE: Dataset files reuploaded (fixed error in Idle Time features). Note that the fixed version of the CICFlowMeter tool is not affected.
22/10/2021 UPDATE: CICFlowMeter fixed as per this GitHub pull request (Fwd and Bwd Bulk features affected). Dataset CSV files regenerated and reuploaded.
24/11/2021 UPDATE: CICFlowMeter fixed as per this and this GitHub pull request (Down/Up ratio fixed and several features related to flow length affected). Dataset CSV files regenerated and reuploaded.

Important changes

X - Attempted label: Most attack classes present in CICIDS2017 require transmission of a payload in order to be effective. For any flow belonging to payload-reliant attack class X but that doesn't contain a payload, we give it the label X - Attempted, with X referring to the original attack class.
Flow construction: naturally, all modifications made to the original CICFlowMeter tool (described above) affect this regenerated dataset. The impact of these changes is thoroughly discussed in the paper.

Dataset composition

The final regenerated dataset is composed of the following flows:

Label	Effective flow count	"Attempted" flow count
Benign	1657069	N/A
FTP-Patator	3973	11
SSH-Patator	2980	8
DoS GoldenEye	7567	80
DoS Hulk	158469	579
DoS SlowHttpTest	1742	3367
DoS Slowloris	4001	1706
Heartbleed	11	0
Web Attack - Brute Force	151	1214
Web Attack - XSS	27	652
Web Attack - SQL Injection	12	0
Infiltration	32	16
Bot	738	1470
Portscan	159023	N/A
DDoS	95123	0

These flow counts (as well as the numbers reported in the paper) were obtained after removing all corrupted entries as well as all entries whose numerical features contained NaN values.

Note that the table in the paper has an error: the correct total amount of Attempted labels is 9103, and the correct amount of Benign flows for both the Intermediate and Final dataset version is 1657069.

Labelling and Benchmarking code

We describe the labelling logic in the Extended Documentation and the ML experiments in the paper. Our labelling and benchmarking code can be found in the GitHub repository.