Topic > Data Mining and Machine Learning Methods for Cyber ​​Security Intrusion Detection

SummaryCyber ​​SecurityCyber ​​Security Datasets for ML and DMML and DM Procedures for Cyber ​​Factors That Influence the Computational Complexity of ML and DML MethodsThe Main Objective of this endeavor is an overview of machine learning and data mining strategies for cyber analytics to aid intrusion detection. ML helps the computer determine without being programmed exactly while DM explores the previous important and unimportant properties of the data. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Cyber ​​Security It is formulated to protect PCs, networks, programs and data from external and internal attacks or unapproved access. Cyber ​​security includes: firewalls, antivirus software, and an intrusion detection system (IDS). IDS helps recognize unapproved access. Three cyber analytics principles to aid IDS: misuse-based, anomaly-based, and hybrid. Misuse-based systems are effective systems intended to identify known attacks, however they are unable to recognize zero day or new attacks but generate the lowest false rate. -Based to identify deviations from standard practices, also these practices are adapted for each system, it also helps to understand zero-day or new attacks. Hybrid systems incorporate abuse and anomaly detections, are used to increase the detection rate and decrease false positives (FP) for obscure attacks. Adding network-allocated IDS and host-allocated IDS. Network IDS analyzes interference by observing movement across network devices while Host IDS supervises process and file activity. To address ML/DM, three ways are used: unsupervised, semi-supervised and supervised. The unsupervised approach involves the fundamental task of understanding designs and structures, while the semi-supervised approach involves naming and securing data by specialists to solve the problem. Finally in the supervised approach the data is finally labeled to find a prototype that processes the data. ML involves three main operations: training, validation and testing. Furthermore, the operations that are usually performed are: Analyze properties from the training data. Analyze the dimensional reduction. Determine the prototype using the training data. Use the trained prototype to specify unknown data, to get the result unambiguously DM involves six main operations: Defining the Data Problem Data Preparation Data Exploration Modeling and Evaluation of the Model Developing and Updating the Data The following Crisp-DM model processes the above operations to solve DM problemsBusiness understanding helps define the DM problem while data understanding collects and examines the data. The next phase, data preparation, involves getting to the latest information. In modeling, DM and ML strategies are applied and improved to fit the best model. Furthermore, the evaluation phase evaluates the strategy with appropriate measurements while the implementation varies from the presentation of a response to the complete execution of the information. Finally, the data investigator connects the steps up to disposition, while the customer carries out the sending phase. Cybersecurity Datasets for ML and DMThis part focuses on various data types for ML and DM approaches such as: packet-level data, NetFlow data, and public datasets. Packet level data: Nearly 144 IPs are registered by the Internet Engineering Task Force (IETF) and arewidely used across protocols. The purpose of these protocols is the transfer of bundles across the network. Furthermore, these network packets are transferred and recognized on a physical interface that can be occupied by the Application Program Interface (API) in PCs, also known as pcap.NetFlow Data: It is recognized as a highlighted router by Cisco. Version 5 of Cisco's NetFlow bundles flows in one direction. The aspects of the bundle are: ingress interface, source IP address, destination IP address, IPprotocol, source port, destination port, and type of services. Public Datasets: Experiments and publications contain data sets provided by the Defense Advanced Research Projects Agency (DARPA) in 1998 and 1999 that have basic aspects occupied by pcap. DARPA discovered four types of attacks in 1998: R2LAttack, U2R Attack, DOS Attack, Probe or Scan. ML and DM Procedures for CyberCyber ​​Security for ML and DM includes the following procedures: Artificial Neural Network: Contains a network of neurons in which the output of one node is the input of another. ANN can also serve as a multidivisional classifier of intrusion detection, i.e.: misuse, hybrid and anomaly detection. The 9 main factors of the data processing stage are: Protocol ID, Source Address, Destination Address, Source Port, Destination Port, ICMP Code, ICMP Type, Raw Data, and Data Length. Association rules and fuzzy association rules: the previous rule indicates the frequency with which a certain relationship appears in the data while the latter rule contains numerical and categorical variables. Bayesian network: is a graphical model that represents variables and the relationships between them. The network is composed of nodes as discrete or continuous random variables to form an acyclic graph. Clustering: is an arrangement of procedures for discovering designs in high-dimensional unlabeled information. One of the main purposes of clustering in intrusion detection is to obtain audit data except for explicit descriptions provided by the system administration. Decision Trees: A decision tree looks like a tree, representing its groups and branches, which in turn represent the combinations of elements that lead to those groups. A model is designated by testing its elements against the nodes of the decision tree. To construct decisions spontaneously, the ID3 and C4.5 algorithms are used. Some of the major advantages include decision trees are impulsive expression, precise classifications, and basic implementation. Adding to its disadvantages, the data includes sequential variables with different numbers of steps. Ensemble learning: The ensemble process incorporates different concepts and tries to formulate the ideal concepts compared to previous ones. Usually, ensemble methods use several weak students to build one strong student. Boosting is one of the methods of ensemble algorithms to train multiple learning algorithms. Some of the most popular algorithms include: Bagging is a technique for improving predictive model consensus to reduce overfitting. It is based on a model averaging technique and is known to improve the performance of nearest neighbor clustering. The Random Forest classifier is an ML technique that incorporates ensemble learning and decision trees. The input attributes are taken indiscriminately and the variance is controlled. Several advantages of random forests include: fewer control parameters and retaliation in case of over-adaptation; there is no need for attributional selection. Adding another advantage to Rando, Forest is that there is an inverse relationship between the model and the number of trees in theforest. Random forests also have some disadvantages, such as low model intractability. This activity also has a loss due to related factors and its dependence on the random generator. Evolutionary Computing: Evolutionary computing involves six main algorithms, namely: genetic programming, genetic algorithm, ant colony optimization, artificial immune systems, evolution strategies, and particle swarm optimization. This subdivision highlights two main commonly used practices: GA and GP. Both are based on the principle of survival of the fittest. They evolved on a population of individuals using specific operators. Commonly used operators are selection, crossover, and mutation. Genetic algorithm and genetic programming are distinguished by the way individuals represent each other. GA is expressed as bit strings and basic crossover and mutation operations. They are very simple while GP expresses programs and also represents trees along with operators like addition, subtraction, multiplication, division, not, or. The crossover and mutation operators in GP are much more complicated than those used in GA. Hidden Markov Models: A Markov chain is an arrangement of states that connects the change in probability, deciding the topology of the model. The structure demonstrated by HMM is believed to be a Markov procedure with dark parameters. In this illustration, each host is mentioned by its four states: Scanned, Good, Attacked, and Compromised. The border running from one nod to the other describes the source and destination of the state. Inductive Learning: To infer information from data, two practices are involved, namely deduction and induction. Deduction interprets through a logical sequence that presents data from top to bottom while inductive reasoning opposes deductive reasoning as it moves from bottom to top. In inductive learning, we begin with particular perceptions and measurements, begin to recognize examples and regularities, detail almost provisional speculations to investigate, and finally end up constructing some general conclusions or hypotheses. One of the important observations of researchers is that ML algorithms are inductive but mainly refer to repeated incremental pruning to produce error reduction (RIPPER) and near-optimal algorithm (AQ). RIPPER envisions a regime that uses the separate and conquer approach. It obeys one rule at a time to cover a maximum set of examples in the current training set. Naive Bayes: The Naïve Bayes classifier mainly follows Bayes' theorem. The name comes from the fact that the input features are independent as it reduces the task of high-dimensional density estimation to a one-dimensional kernel density estimate. The Naïve Bayes classifier has many restrictions as it is an optimal classifier due to its independent characteristics. Naïve Bayes classifier is an online algorithm that completes its training in linear time considering it one of the major advantages of Naive Bayes.Sequential Pattern Mining:Sequential Pattern Mining Sequential is essential for DM methods with a transactional database approach with temporary IDs, User ID and a set of items. An item set is a binary representation where an item has or has not been achieved. A sequence is a systematized list of elements. The number of elements in a sequence defines its length while its order is obtained from the temporal ID. Suppose that a sequence A of length n is in another sequence B of length m for which all elements of A are subsets of B elements. While the elements in sequence B that are not a subset of.