Tools & Dataset
UPDATE (18/11/2022): For the most recent version of CICIDS2017 (improved ground-truth labelling and additional features) as well as a fixed version of CSECICIDS2018, please check out our latest work here.
When using the fixed CICFlowMeter tool, the improved regenerated CICIDS2017 dataset and/or our labelling and benchmarking code, please cite our paper:
title={Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study},
author={Engelen, Gints and Rimmer, Vera and Joosen, Wouter},
booktitle={2021 IEEE Security and Privacy Workshops (SPW)},
pages={7--12},
year={2021},
organization={IEEE}
}
CICFlowMeter tool
Our fixed version of the CICFlowMeter tool can be found at https://github.com/GintsEngelen/CICFlowMeter.
Important changes
- A TCP flow is no longer terminated after a single FIN packet. It now terminates after mutual exchange of FIN packets, which is more in line with the TCP specification.
- An RST packet is no longer ignored. Instead, the RST packet also terminates a TCP flow
Improved CICIDS2017 dataset for flow-based network intrusion detection
The latest version of our improved version of the CICIDS2017 flow-based dataset can be downloaded here.
20/10/2021 UPDATE: Dataset files reuploaded (fixed error in Idle Time features). Note that the fixed version
of the CICFlowMeter tool is not affected.
22/10/2021 UPDATE: CICFlowMeter fixed as per
this GitHub pull request
(Fwd and Bwd Bulk features affected). Dataset CSV files regenerated and reuploaded.
24/11/2021 UPDATE: CICFlowMeter fixed as per this and
this GitHub pull request (Down/Up ratio fixed and several features related to flow length affected).
Dataset CSV files regenerated and reuploaded.
Important changes
- X - Attempted label: Most attack classes present in CICIDS2017 require transmission of a payload in order to be effective. For any flow belonging to payload-reliant attack class X but that doesn't contain a payload, we give it the label X - Attempted, with X referring to the original attack class.
- Flow construction: naturally, all modifications made to the original CICFlowMeter tool (described above) affect this regenerated dataset. The impact of these changes is thoroughly discussed in the paper.
Dataset composition
The final regenerated dataset is composed of the following flows:
Label | Effective flow count | "Attempted" flow count |
---|---|---|
Benign | 1657069 | N/A |
FTP-Patator | 3973 | 11 |
SSH-Patator | 2980 | 8 |
DoS GoldenEye | 7567 | 80 |
DoS Hulk | 158469 | 579 |
DoS SlowHttpTest | 1742 | 3367 |
DoS Slowloris | 4001 | 1706 |
Heartbleed | 11 | 0 |
Web Attack - Brute Force | 151 | 1214 |
Web Attack - XSS | 27 | 652 |
Web Attack - SQL Injection | 12 | 0 |
Infiltration | 32 | 16 |
Bot | 738 | 1470 |
Portscan | 159023 | N/A |
DDoS | 95123 | 0 |
These flow counts (as well as the numbers reported in the paper) were obtained after removing all corrupted entries as well as all entries whose numerical features contained NaN values.
Note that the table in the paper has an error: the correct total amount of Attempted labels is 9103, and the correct amount of Benign flows for both the Intermediate and Final dataset version is 1657069.
Labelling and Benchmarking code
We describe the labelling logic in the Extended Documentation and the ML experiments in the paper. Our labelling and benchmarking code can be found in the GitHub repository.