A MACHINE LEARNING FRAMEWORK FOR DOMAIN GENERATION ALGORITHM (DGA) BASED MALWARE DETECTION
ABSTRACT:-
Attackers usually use a Command and Control (C2) server to manipulate the communication. In order to perform an attack, threat actors often employ a Domain Generation Algorithm (DGA), which can allow malware to communicate with C2 by generating a variety of network locations. Traditional malware control methods, such as blacklisting, are insufficient to handle DGA threats. In this paper, we propose a machine learning framework for identifying and detecting DGA domains to alleviate the threat. We collect real-time threat data from the real-life traffic over a one-year period. We also propose a deep learning model to classify a large number of DGA domains. The proposed machine learning framework consists of a two level model and a prediction model. In the two-level model, we first classify the DGA domains apart from normal domains and then use the clustering method to identify the algorithms that generate those DGA domains. In the prediction model, a time-series model is constructed to predict incoming domain features based on the Hidden Markov Model (HMM). Furthermore, we build a Deep Neural Network (DNN) model to enhance the proposed machine learning framework by handling the huge dataset we gradually collected. Our extensive experimental results demonstrate the accuracy of the proposed framework and the DNN model. To be precise, we achieve an accuracy of 95.89% for the classification in the framework and 97.79% in the DNN model, 92.45% for the second-level clustering, and 95.21% for the HMM prediction in the framework.
Keywords: Antigen, Blood Samples, GPU, Histogram, LBP (local binary pattern), Nearest Neighbour Classifier, Image Processing, Pattern Matching.
OBJECTIVES OF THE PROJECT:-
The objectives of the systems development and event management are:
- In DBSCAN algorithm, we use the features described above to calculate the domain distance and to group the domains that are generated by the same DGA together according to their domain feature difference.
- Distinguish the model from training and prediction stages.
- The nodes in each layer are fully connected to the nodes in the next will not miss any local minima, but it will take a long time to converge.
EXISTING SYSTEM:-
Threat models: Multiple conditions for a DGA to function in a network environment where filtering results in a firewall that protects the communication and an empty cell in an Internet domain that results in NXDOMAIN error.
Each HMM date record represents a series of domain observations. First a sequence of domain name are processed by a feature extractor and each of these feature vectors is used as a training record.
Then, similar sequences are clustered as a group of DGA domain names with certain outcomes. After the training process, if a sequence does not have an HMM sequence representation (or it is not presented in the training data but the test data), the HMM model then generates the future predicted results. Otherwise, we will use an existing HMM sequence presentation.
DISADVANTAGES OF EXISTING SYSTEM:-
- Firewall protects the communication and an empty cell in an internet domain that results in no domain error.
- Queries not matching the knowledge are stored in a backlog of the software..
PROPOSED SYSTEM:-
In our proposed system, Domains extracted from DGAs. Machine learning framework that encompasses multiple feature extraction techniques and the models to classify the DGA domains from normal domains, cluster the DGA domains, and predict a DGA domain.
A deep learning model to handle large datasets. Multiple on- line sources from simple Google searching provide example codes for a DGA construction.
Online threat intelligence feeds give an approach to examining current and live threats in real-world environment.
Using real-time active malicious domains derived from DGAs on the public Internet measures the accuracy of the proposed approach.
The structure of the data is presented in a CSV format of domain names, originating malware, and DGA membership with the daily file size of approximate 110MB.
We propose a machine learning framework that consists of three important steps, as shown in Figure below.
We first have the DNS queries with the payload as the input.
MODULES:
- Data: Used Dataset.
- Feature Engineering: We Can collect data from source and apply preprocessing or Feature Engineering so that we can convert row data to pure data.
- Algorithms: Used Ml Classification Algorithms like LR,NB,RF,DT
- Accuracy: We finalize RF algorithms Because it gets more accuracy.
- UI : We Develop UI in Flask.Give input URl
- Model Trained : Give result DGA or NON- DGA
Advantages of Proposed System:-
- DOMAIN GENERATION ALGORITHM (DGA), WHICH ALLOWS MALWARE TO GENERATE NUMEROUS DOMAIN NAMES UNTIL IT FINDS ITS CORRESPONDING C&C SERVER.
- IT IS HIGHLY RESILIENT TO DETECTION SYSTEMS AND REVERSE ENGINEERING, WHILE ALLOWING THE C&C SERVER TO HAVE SEVERAL REDUNDANT DOMAIN NAMES.
HARDWARE AND SOFTWARE REQUIREMENTS :-
HARDWARE: –
- Processor: Intel Core i3 or more.
- RAM: 4GB or more.
- Hard disk: 250 GB or more.
SOFTWARE:-
- Operating System : Windows 10, 7, 8.
- Python
- Anaconda
- Spyder, Jupyter notebook, Flask.
- MYSQL