This repository is the official implementation of the research mentioned in the chapter "An Empirical Analysis of Image-Based Learning Techniques for Malware Classification" of the Book "Malware Analysis Using Artificial Intelligence and Deep Learning"
The book or chapters can be purchased from: https://link.springer.com/chapter/10.1007/978-3-030-62582-5_16
The arXiv eprint is at: https://arxiv.org/abs/2103.13827
In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role—specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published.
- Classic ML-based approaches tried : K-NN, Random Forest, and XGBoost
- Deep Learning-based approaches tried: ANN, CNN, LSTM, and GRU
- Implementation is using sklearn, numpy, pandas and pytorch.
- MS Windows executable binary files are used as data.
- Features * Classic ML-based approaches: PE fie features are extracted and used * Deep Learning-based approaches: (1) Opcodes (2) Converted executables into gray-scale images
- This project is an extension of https://github.com/pratikpv/malware_classification
- Install pefile pythong package e.g.
conda install pefile - Install PyTorch and other libs e.g.
conda install -c pytorch torchtext. All other common dependencies should be covered by anaconda distro. objdumpin ubuntu. (This code is developed and tested for ubuntu-based development env)
* copy the malware samples at <project_dir>/data/exec_files/exec_files. You can reach out to me for samples used in this research. Overall directory structure should look like this,
├── config.py ├── data │ ├── exec_files │ │ └── exec_files │ │ ├── adload │ │ ├── agent │ │ ├── alureon │ │ ├── bho │ │ ├── ceeinject │ │ ├── cycbot │ │ ├── delfinject │ │ └── fakerean ├── data_preprocess.py ├── data_utils . . Execute data_preprocess.py with below mentioned options to preprocess the data.
python data_preprocess.py --extract_pe_features
python data_preprocess.py --bin_to_img
python data_preprocess.py --extract_opcodes
python data_preprocess.py --split_opcodes
Execute detect_malware.py with appropriate command-line args for models to train/test. e.g.
python detect_malware.py --deep_feedforward
python detect_malware.py --deep_rnn
python detect_malware.py --shallow_ml
python detect_malware.py --transfer_conv_ml
Apply for access here: https://forms.gle/65SNHJpQ7U4TYkCU7
Prajapati P., Stamp M. (2021) An Empirical Analysis of Image-Based Learning Techniques for Malware Classification. In: Stamp M., Alazab M., Shalaginov A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_16 or
@Inbook{ Prajapati2021, author={Prajapati, Pratikkumar and Stamp, Mark}, editor={Stamp, Mark and Alazab, Mamoun and Shalaginov, Andrii}, title={An Empirical Analysis of Image-Based Learning Techniques for Malware Classification}, bookTitle={Malware Analysis Using Artificial Intelligence and Deep Learning}, year={2021}, publisher={Springer International Publishing}, address={Cham}, pages={411-435}, abstract={In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role---specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published.}, isbn={978-3-030-62582-5}, doi={10.1007/978-3-030-62582-5_16}, url={https://doi.org/10.1007/978-3-030-62582-5_16} }