Abstract |
A SIGNATURE-BASED IDENTIFICATION PROCESS FOR AUTOMATED NETWORK TRAFFIC CLASSIFICATION A Thesis by JUSTIN THARP Submitted to the Office of Graduate Studies of Texas A&M University-Commerce in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE May 2014 A SIGNATURE-BASED IDENTIFICATION PROCESS FOR AUTOMATED NETWORK TRAFFIC CLASSIFICATION A Thesis by JUSTIN THARP Approved by: Advisor: Jinoh Kim Committee: Sang C. Suh, R. Daniel Creider Head of Department: Sang C. Suh Dean of the College: Grady Blount Dean of Graduate Studies: Arlene Horneiii ABSTRACT A SIGNATURE-BASED IDENTIFICATION PROCESS FOR AUTOMATED NETWORK TRAFFIC CLASSIFICATION JUSTIN THARP, MS Texas A&M University-Commerce, 2014 Advisor:Jinoh Kim, PhD Accurate application identification is one of the core elements of network operations and management to provide enhanced network services and security. While the signature-based approach that examines packet content for identification is attractive with greater accuracy than the traditional technique relying on TCP port numbers, one challenge is that applications generate over hundreds of signatures which makes it impractical to classify network traffic with such a large set of signatures due to a high degree of computational complexity for signature matching. In this thesis, I explore a set of techniques for signature refinement to improve the quality of signatures that enable us to identify unknown flows and decrease the number of signatures for saving memory and putting less strain on the processor. Another potential challenge is multiple matches arising when more than a single application identifiesthe data stream in question. In that case, the input stream cannot be adequately classified solely by the help of the application signatures, and it is necessary to establish an additional process that reconcilessuch multiple matches in order to make the final identification decision. In this thesis, I also present selection methods that could efficiently address the problem of multiple matches. As a result, this thesis provides an effective process for signature-based network application identification. iv TABLE OF CONTENTS LIST OF TABLES ......................................................................................................................... vi LIST OF FIGURES ...................................................................................................................... vii CHAPTER 1. TABLE OF CONTENTS ............................................................................................. iv 2. INTRODUCTION ........................................................................................................ 1 Statement of the Problem ..................................................................................... 2 Purpose of the Study ............................................................................................. 2 Significance of the Study ...................................................................................... 3 Method of Procedure ............................................................................................ 3 3. RELATED WORK ....................................................................................................... 5 4. DATASET DESCRIPTION ......................................................................................... 8 5. RECONCILING MULTIPLE MATCHES................................................................. 11 Problem Description ........................................................................................... 11 Problem Statement .................................................................................. 11 Motivation .............................................................................................. 13 Selection Heuristics ............................................................................................ 16 Greatest Number of Matches (GREATEST) ............................................. 17 Fraction-Based Selection (FRAC) ........................................................... 17 Probabilistic Selection(PROB) ................................................................. 18 Experimental Methodologies .............................................................................. 19 The Experimental Procedure .................................................................. 19 Performance Metrics .......................................................................................... 22 Experimental Results .......................................................................................... 23 Results with the Default Parameters ....................................................... 23 v CHAPTER Impact of the Number of Bytes Examined (B) ....................................... 25 Impact of the Minimal Signature LengthThreshold (T) ......................... 27 Impact of the Coverage Factor (F) ......................................................... 29 Discussion ............................................................................................... 31 6. SIGNATURE REFINEMENT.................................................................................... 34 Problem Statement .............................................................................................. 35 Refining Process ................................................................................................. 37 Expanding Signature Set by Merging ..................................................... 37 Identification of Core Signatures ............................................................ 40 A. Random-N ........................................................................... 41 B. Top-N .................................................................................. 42 C. Covering Subset .................................................................. 43 D. Google Refine ..................................................................... 45 Closing Remarks ................................................................................................ 46 7. CONCLUSION & FUTURE STEPS.......................................................................... 48 REFERENCES ................................................................................................................. 49 VITA ................................................................................................................................ 52 vi LIST OF TABLES TABLE 1. Applications Used in the Experiments ................................................................................ 9 2. Fraction of testing flows with at least a single signature for the application .................... 36 3. Number of signatures before merging .............................................................................. 39 4. Number of signatures after merging ................................................................................. 39 vii LIST OF FIGURES FIGURE 1. The Experimental Procedure............................................................................................... 3 2. Data collection network configuration: The data collector gathers captured traffic via (1) the wired Ethernet, (2) the WiFi network, and (3) 3G and LTE. ........................................ 9 3. Fraction of single vs. multiple matches (%) ..................................................................... 14 4. % that the candidate set C includes the “real” application ............................................... 15 5. The pseudo code for signature table construction............................................................. 21 6. Precision with default parameters ..................................................................................... 24 7. Recall with default parameters.......................................................................................... 24 8. Precision over the number of bytes inspected (B) ............................................................ 26 9. Recall over the number of bytes inspected (B) ................................................................. 26 10. The number of signatures with respect to the minimal signature length threshold factor (T) .................................................................................................................................... 27 11. Precision over the minimal signature length threshold factor (T) .................................... 28 12. Recall over the minimal signature length threshold factor (T) ......................................... 28 13. The number of signatures with respect to the coverage factor (F) ................................... 29 14. Precision with respect to the coverage factor (F) .............................................................. 30 15. Recall with respect to the coverage factor (F) .................................................................. 30 16. Pseudo code for the first step of signature refinement through merging .......................... 38 17. Upper TP and FP after the first step of signature refinement (merging) .......................... 38 18. Upper TP for Random-N .................................................................................................. 41 19. Upper FP for Random-N ................................................................................................... 41 viii 20. Upper TP for Top-N.......................................................................................................... 42 21. Upper FP for Top-N .......................................................................................................... 43 22. Upper TP and FP for covering subset ............................................................................... 44 23. Number of signatures selected by covering subset ........................................................... 44 24. Upper TP and FP after applying Google Refine ............................................................... 45 1 Chapter 1 INTRODUCTION Accurate application identification is one of the core elements of network operations and managementfor various purposes, including network planning and capacity provisioning, QoS assurance, security policy management and enforcement, network usage and performance monitoring, traffic engineering, and so forth (Bernaille, Teixeira & Salamatian, 2006; Grimaudo, Mellia & Baralis, 2012; En-Najjary, Urvoy-Keller, Pietrzyk & Costeux, 2010; Xie, Iliofotou, Keralapura, Faloutsos & Nucci, 2012). A classical technique for classifying network traffic is to use TCP/UDP port numbers, but it has become inaccurate and unreliable due to the fact that applications can be allocated to ports dynamically and some applications can be assigned to other port numbers outside their own. In addition to the use of non-standard port numbers, tunneling such as HTTP tunnels makes it complicated to identify corresponding applications from the traffic only based on port numbers (Pietrzyk, En-Najjary & Urvoy-Keller, 2010; Crotti, Dusi, Gringoli & Salgarelli, 2007). For these reasons, classifying traffic has become more difficult and new methods have been in development. An alternative method widely considered is to examine packet content and search commonpatterns also known as “signatures” that repeatedly appear in packets belonging to a specific application.A set of collected signatures could be used foridentifying applications associated with input traffic in the future. Creating signatures is often performed with a training data set (“training”), and classifying input traffic takes place with the signature set (“identification”). This signature-based approach is known to be highly accurate compared to other techniques relying on packet header information and/or statistical information of input 2 streams such as the number of bytes and time interval for transport layer sessions (e.g., TCP connections), which makes itattractive despite the high degree of computational complexity for deep packet inspection.However, existing studies are often limited to signature generation or a part of the identification process. In this research, I comprehensively work on the whole process of identification including signature generation and network traffic classification using the signatures. Statement of the Problem The first potential challenge is refining the signatures generated for the identification process. It was observed when generating signatures for 10 applications that each application created over 50 signatures each. This means that each application did over 50 iterations while the signatures were being generated. In real world use this is impractical because this will put a large amount of strain on the processor to compute and create all the signatures. This will make real time network classification improbable. Because, most network classification in the traditional method is manually done, automating it will make classifying faster. The next potential challenge when employingthe signature-based technique for application identification is multiple matches arising when more than a single application identifies the data stream in question. For instance, an input stream may have a string that matches with a signature for application X, and at the same time, it may also include a signature for application Y anywhere else in that data stream. When this happens, it is hard to tell what application the input stream belongs to, since it contains signatures for both X and Y. Purpose of the Study The purpose of this study is to investigate and compare a variety of identification methods and techniques and a variety of signature generation algorithms in order to develop new 3 algorithms that will improve the overall performance in the identification phase. Thus, the goals are to develop a signature refinement algorithm that will generate fewer numbers of signatures, signatures that will be of high quality, and to develop selection heuristics that will give high accuracy results during the identification phase when multiple matches appear. Significance of the Study Benefits from studying both the refinement method and identification process will lead to higher quality signatures being generated and higher accuracies when identifying unknown flows. With this, memory will be saved and there will be less strenuous work done on the processor. With higher quality signatures new patterns can be discovered in various applications. Method of Procedure Figure 1. The Experimental Procedure Figure 1 shows the experimental procedure that will be used in this study. First the training flows will enter the signature creation phase. The parameters B (Number of bytes inspected), T (Minimal Length Threshold) and F (Coverage Factor Threshold) will be used to aid in the signature creation phase. Once the signatures are created they will enter the identification phase along with the input flow. First, the identification will see if the input had any multiple matches or not. If it does, it will enter the selection heuristics to identify it, and if not then it will 4 be checked to see if the correct application signature identified it. The refinement techniques will mainly focus on the signature creation phase. Generating refined signatures is very important for the identification phase. With refined signatures the identification phase will be more accurate and will take less processing power. Plus, with the number of signatures reduced it will save memory in the system. The multiple matches problem will focus on the identification phase of the procedure,since other papers have not focused on identifying unknown flows when multiple matches occur. If the signatures are not able to correctly identify the unknown flow then this method of traffic classification will be unreliable as well. 5 Chapter 2 RELATED WORK This section lists and describes known algorithms that are currently used to generate signatures for research purposes. Signature identification is not used in business yet. Longest Common Subsequence (LCS) (Bergroth, Hakonen & Raita, 2000) is an algorithm that compares two strings and finds the longest common subsequence in the two strings. For example, for two strings of “AABBCCDD” and “ABCDFFGG”, LCS would return “ABCD” as the longest common subsequence because those characters appear in the same order in both strings. The resulted subsequence characters do not have to be in the same place in both strings, but they just have to appear commonly in both strings. LCS is used for many applications in a variety of domains such as gene sequence alignments (Yang, 2010), natural language processing (Duan, 2009), and network security (Coull, 2003). LCS is considered for network traffic classification in this thesis. LASER (Park, Won, Kim & Hong, 2008) is a modified LCS algorithm that searches for the longest common substring as signatures for traffic classification. Instead of finding the longest common subsequence,LASER searches for the longest chain of characters together that form a substring that can be found in two networkstreams, and produces an “exact” string in which the charactersshould be in the same place in both strings, whereas an LCS signature is not necessarily exact and more like a pattern. AutoSig (Ye, Xu, Wu & Po, 2009) also searches for common substrings in flows but does not rely on the LCS algorithm. It instead breaks a flow into multiple chunks (called “shingles”) that are compared and merged together to generate signatures. It also uses a tree to separate the substrings into further substrings so that all possible signature combinations can be recognized 6 and used for flow identification. Like LASER, AutoSig produces a set of “exact” strings as application signatures. The authors in the research paper (Wang, Xiang, Zhou & Yu, 2012) incorporated several techniques for signature creation for application identification. The input traffic data go through several processes in which first the data is extracted and tokenized, followed by a sequence alignment technique, and finally the signatures are constructed once all the data have been processed. Similar with LCS, it does not assume exact strings for application signatures but uses regular expression to represent signatures. Automated Construction of Application Signatures (ACAS) (Haffner, Sen, Spatscheck & Wang, 2005) also utilizes packet content (i.e., application information) for traffic classification, but it is different from the above techniques that search for common subsequences or common substrings that are found in network traffic. ACAS rather relies on machine learning algorithms for classifying network traffic. To enable this, the first N bytes are encoded and used for training, based on which the application identification takes place with machine learning techniques such as AdaBoosting. The authors showed promising results with a high degree of accuracy, but my work more considers traditional string matching for network traffic classification. Covering subset (Saha & Khuller, 2012) is an algorithm to find a subsetthat covers an entire set with the smallest number of elements. For example we have a set U = {1, 2, 3, 4, 5} and we have a series of subset S = {{1, 2, 3}, {1, 3, 5}, {3, 4}, {4, 5}}. The covering subset problem would choose the subsets {1, 2, 3}, {3, 4} and {4, 5} or it could choose {1, 2, 3}, {1, 3, 5} and {3, 4}. It can choose either one of these subsets because both of them cover all the instances in U. One of methods we present for selecting a subset of signatures employs this algorithm as we will describe later. 7 Google Refine (Trippe, 2013) is free software that can be downloaded off the internet. Google Refine has many features that help with cleaning the data such as clustering and merging.There are several clustering algorithms that Google Refine uses, and we used the features for our signature refinement (i.e., merging). The first set of algorithms it has is called key collision. The first key collision algorithm is called Fingerprinting. Fingerprinting tokens the signature at the whitespaces and edits the signature in a certain way to generate a key for the signature. Next is N-Gram Fingerprint that does the same as Fingerprinting but tokens the signatures at certain length intervals based on what the user enters in as N. Anotherclustering algorithm that Google Refine uses is Nearest Neighbors. The first Nearest Neighbor algorithm is called Levenshtein Distance. Levenshtein Distance groups together the signatures that meet a certain number of edits to the comparison signature. The other Nearest Neighbor is called PPM which uses a mathematical formula to group together signatures based on string comparisons. 8 Chapter 3 DATASET DESCRIPTION The traffic data set was collected in the period of March 2011 – December 2011 in a LAN located in Korea, exclusively configured for data collection. Figure 2 showsthe configuration of the LAN. As seen from the figure, theLAN is connected to the Internet via a dedicated router, andclient machines including PCs, laptops, and smart phonesare configured via wired or wireless Ethernet. The client machines generate service requests for targeted applicationsto collect, and the data collector captures user packetsin promiscuous mode capturing traffic for both directions. Figure 2shows the logical illustrationof capturing packets at the data collector from both wiredand wireless networks. In addition to collecting data from the LAN, mobile deviceswere also used for data collection. The mobile devices launch applications (by using apps) that access the Internet via 3G or LTE, and captured traffic for the applications by using a packet sniffing app ('Shark for Root', n.d.). The collected data in themobile devices were manually copied to the collector machine.The collected data files have the pcap format (Wiki.wireshark.org, n.d.). Thepcap files were inspected, and any noise (including non-applicationtraffic such as background maintenance ICMP messages) had been eliminated. The total data size is 12.5GB with 930 pcap files for 98 applications. 9 Figure 2. Data collection network configuration: The data collector gathers captured traffic via (1) the wired Ethernet, (2) the WiFi network, and (3) 3G and LTE. TABLE 1. Applications Used in the Experiments Application Category # Total Flows BitTorrent P2P 3,676 Cyworld Social computing 628 Daum 1,915 Naver 1,381 Wemakeprice 523 Afreeca Streaming 1,029 GomTV 940 PandoraTV 731 Tving 537 Melon 685 From the collected data, we obtained flow information based on the five tuples (i.e., source and destination IP addresses, source and destination port numbers, and protocoltype). We explicitly distinguish sessions from flows: a sessionis bidirectional with a pair of unidirectional flows. Asession begins with the connection initiation with the SYN packet and lasts until seeing 10 the corresponding FIN packetsfor TCP. Since UDP has no connection establishmentand termination, we assume that a session begins when the first packet appears and terminates when no corresponding packet that has the identical five tuples is seen for 60 seconds. Among the large set of applications, we selected ten applicationsfor our experiments as shown in Table I. There is no preference in the selection of applications but we simple chose applications that have the greater number of flows. 11 Chapter 4 RECONCILING MULTIPLE MATCHES This chapter describes and introduces the problem of multiple matches. Multiple matches are when more than one application signature identifies an unknown flow. For example, an input stream may be identified by a signature from application X, and at the same time, by a signature from application Y. When this situation happens during the identification process, how is the unknown stream properly classified? All literature I have read has not dealt with this problem, only the signature generating portion. To battle this problem we developed three different selection heuristics that determines the final classification decision based on different identifying parameters. The rest of this chapter is organized as follows. The next section will define the problem further and give the motivation for this research. Then we will give formal descriptions of the selection heuristics and how they work. Next, will be the section giving the experimental methodologies used for this research topic. The next section will display the results of the experiments followed by a light discussion. Last, the chapter will end on closing remarks on the multiple matches research before moving on to the next chapter. Problem Description In this section, we formulate the problem of multiple matches that we focus on throughout this thesis. We also present our observations that a substantial number of identification attempts resulted in multiple matches, which motivated us to pursue this research. Problem Statement Suppose a set of applications, A = {Ai i>0}, to which we classify input data streams. In this work, we consider a unidirectional flow as the basic unit of classification. By convention, 12 aflow consists of a set of packets that have identicalfive tuples — source and destination IP addresses, source and destination port numbers, and protocol type. From a given set of flows for an application, F = { fi i>0}, we collect a set of signatures, S = {si i>0}for that application. Since we assume multiple applications for classification, we use FX and SX for a set of flows and signatures respectively for application X. Note that we use the cardinality notation to represent the number of elements in a set, e.g., SX for the size of the signature set for application X. The identification problem is to map the input flow f in question to an application Ai, and the mapping function ψ is defined formally as follows: ψ: f → Ai The function ψ utilizes the known signature set for all applications for mapping, that is,SALL = {SA, SB, … }. For a given input flow f, we denote by si├ f if si is found in f(i.e., the signature matches with the flow). Let M(f) be a set of signatures, the element of which holds mi├ f, for all i. If the size of M(f) is one(i.e., M(f) =1), we refer to it as a “single” match, while it is considered as a “multiple” match if M(f) >1 holds. The otherwise case is “non-match” and the input flow will be classified into “unknown”, since no signature is found from the input flow. For a single match (i.e., M(f) =1), the identification procedure is straightforward as the mapping function ψ could simply map f to the application to which the signature in M(f) belongs. For a multiple match (i.e., M(f) >1), however, we should consider N:1 mapping since there could be more than a single application in question. We define a candidate set C as a set of applications to which any element of M(f) belongs. As an example, suppose a case of five 13 matches for f, i.e., M(f) = {m1, m2, m3, m4, m5}. If we suppose m1 is an element of SX, m2 is an element of SY, and the rest of the elements in M(f) belongs to SZ, then the candidate application set C should consist of C = {X, Y, Z}. This means that the mapping function ψ should have the capability to discriminate one out of the others in the candidate set for identification in the event of multiple matches, and it can be reduced to a selection problem. To deal with this, we establish a heuristic function H that picks one element from the given candidate set C, as follows: H: C →c, where c∈C Suppose the real application for f is X. If c = X, then we say that the identification is correct; otherwise, incorrect. Motivation We now present why we are interested in this problem with our observations. We conducted a set of experiments with ten applications from a recent traffic data set. The data set we used will be described in Section V in detail. We then selected 100 flows randomly without any preference for each application from the data pool. We considered three techniques, LCS, LASER, and AutoSig, which have often been referred by past work (Wang, Xiang, Zhou & Yu, 2012; Szabó, Turányi, Toka, Molnár & Santos, 2011; Hjelmvik & John, 2009; Pietrzyk, En-Najjary & Urvoy-Keller, 2010) in the experiments. Note that our focus is not on comparing the quality of signatures for the existing techniques, but on exploring the problem of multiple matches. For this reason, we simply used the identical flow set for training and identification in our experiments. 14 To construct a signature set, we examined the first N bytes from each flow. We employed the default values defined by each technique. For the LASER algorithm, the number of packet constraint was set to 100 and the minimal length constraint used was 4 because the authors did not present optimal values for those parameters.For LCS, we usedtwo additional thresholds to consider the minimal signature length and the coverage: Any candidate signature that has the byte length greater than or equal to the minimal signature length threshold was only accepted, andthe coverage threshold defines the minimal fraction that the candidate signatureshould appear for that application. If the pattern is observed more widely than the coverage threshold, the candidate signature is accepted. A candidate signature that meets both threshold constraints is finally selected as a signature for the application. Figure 3. Fraction of single vs. multiple matches (%) 15 Figure 4. % that the candidate set C includes the “real” application The constructed signaturesetis then used as an input to the identification process. For identification, we considered“exact matching” for LASER and AutoSig but not for LCS, since LCS is more based on pattern matching. The exact matching means that if the signature appears in the input flow with the exact byte order without skipping, we considered the signature matches with the flow in question. In contrast, we employed the LCS algorithm again for identification when using LCS. Therefore, if the result of the LCS algorithm with the signature and the input flow isexactly the same as the signature itself, we consideredthat the flow includes the signature (i.e., strcmp(signature,LCS(flow, signature)) == 0?“matched” :“not matched”). Figure 3 illustrates the fraction of single and multiple matches over the number of bytes examined for both signature construction and identification. When we inspect a relatively small number of bytes (i.e., N=16 and N=32), we can see a dominant number of multiple matches from the figure. As N increases, the number of single matches also increases. For N=64 bytes, almost 50% of the total identification attempts incurred multiple matches. The fraction of single matches further increases with a greater N, as shown in the figure. At the same time, the number of non-matches also increases significantly as N goes up, which is critical to identification performance 16 because we are unable to classify the traffic if there exists no signature that matches with the input flow. Therefore, it is very important to manage N reasonably. Since it has become more important to identify applications as early as possible for real-time analysis and response (Bernaille, Teixeira & Salamatian, 2006; Haffner, Sen, Spatscheck & Wang, 2005; Xie, Iliofotou, Keralapura, Faloutsos & Nucci, 2012) and it would be much beneficial for avoiding the number of non-matches, we consider a relatively small number of bytes for inspection in this work. Consequently, effective handling of multiple matches should be essential in the identification process. So the key question we now have is how can we settle multiple matches to reach the final identification?The good news is that the quality of signatures is fairly acceptable when N=64. Figure 4 shows the fraction whether the candidate application set (i.e., C) contains the “real” application to which the flow truly belongs. As can be seen from the figure, almost 100% of the total candidate sets contained the real applications when we use either LCS or AutoSig. This means that if we could successfully address the problem of multiple matches, we can expect a high degree of identification accuracy even with the traditional signature-based identification techniques. These observations motivated us, and in this paper we address the problem of multiple matches. In the next section, we will present how we identify the application associated with a traffic flow even in the event of multiple matches. Selection Heuristics To effectively deal with multiple matches, we designed the following three heuristics: greatest number of matches (GREATEST), fraction-based selection (FRAC), and probabilistic method (PROB).In this section, we describe how the heuristics couldmake a selection from a 17 given set of candidate applications. The presented heuristics will then be evaluated in the followed sections. Greatest Number of Matches (GREATEST) This technique chooses the candidate application that has the greatest number of application signatures that identified the flow. Formally, we assume a counter set associated with the candidate set. That is, for the candidate set C = {c1, c2, …,ck}, there is an associated counter set N = {n1, n2, …, nk}, where ni is the number of signatures for application cithat matched with the input flow.Based on the counter set information, the heuristic selects an application that has the greatest number of application signatures that match with the flow. The heuristic is defined as: For example, say 5 signatures from application X identified the flow and 3 signatures from application Y identified the flow. GREATESTwould like to identify the flow as application X, since more of its signatures identified the flow than the signatures from Y did. Fraction-Based Selection (FRAC) The above technique (GREATEST) is fairly intuitive, since it performs the discrimination based on the number of signatures that matchwith the given flow. However, there could be a degree of discrepancies with respect to the number of signatures each application has. In the above example, application Xmay have a greater number of signatures than application Y. Thenthere might be a higher possibility that X’s signatures can be found from the input flow than Y’s. 18 To consider the discrepancies in terms of the number of signatures for each application, we design another heuristic, fraction-based selection (FRAC), whichtakes a fraction into accounttodiscriminate the candidate applications. The fraction is determined by the number of application signatures that identified the flow divided by the total number of signatures for that application. That is, we consider a set of signatures for each application in addition to the counter set N defined in the previous heuristic. As in Section II, we useSX as the signature set for application X, and the cardinality of that is the number of elements in the set ( SX ). The heuristic is then defined: For example, if 5 out of the 15 signatures from application X identified the flow, the fraction for Xbecomes 0.2. Similarly, if 3 out of 6 signatures from application Y identified the flow, the fraction Y will be 0.5. In thatcase, FRACwill identify the flow as application Y, as Y has the greater fraction (i.e., 0.5 > 0.2), while GREATESTwouldselectXthanY. Probabilistic Selection(PROB) The last heuristic we developed relies on a simple probability model to make selection.Suppose a set of signatures MX = {m1, m2, …,mk} that matchwiththe input flow for application X. We define pi as the probability that the signature mi appeared in that application, computed with the training data set; that is, pi= mi/ SX . We then define the collective probability PX as the probability that the input flow belongs to application X. 19 𝑃 =1−Π(1−𝑝 ) Thus, PX stands for the probability that any of the matched signatures for application X appearing in the flow belongs to X.We also refer the collective probability P to confidence. Then we may want to choose a candidate with the maximumconfidence(P), and based on the intuition, we define a heuristic as: 𝑃 The heuristics developed in this section choose one application out of the given candidate application set to reach the final identification with their own unique discrimination function. In any event of tie that the discrimination function results in more than a single application, the heuristic simply runs a random function to break the tie. Experimental Methodologies We thus far discussed the problem of multiple matches and the selection heuristics that can settle multiple matches. We next evaluate the proposed heuristics. We introduce our experimental methodologies in this section, and then present experimental results in the next section. The Experimental Procedure Figure 1 from chapter 1 illustrates the overall procedure. The first stage in the procedure is to construct the signature set from the training data for all applications. In our experiments, we employed LCS because of its rich set of signatures compared to LASER and AutoSig. In 20 addition, unlike our expectation, the quality of LCS signatures is not inferior at all compared to the other techniques as also shown in Figure 1 of chapter 1. Anyhow, the purpose of our experiments is not on the discovery of the best way to searchfor adequate signatures but on helping identify flows effectively. When constructing the signatureset from the training flows, we employed three parameters to control: Number of bytes examined (B): B is a byte size constraint that determines how many bytes of the flow data that will be used in searchingfor signatures. Since protocol specifications are often placed at the first part of a flow, limiting the number of bytes for inspection would be desirable for resource saving and early application identification. Minimal length threshold factor (T): Tdefines a character length constraint, based on whichthe signature that contains at least a certain number of bytes will be accepted. The minimal length threshold is computed by⌊ ⌋. For example, if B=64 and T=0.1, the minimal length required for a signature is 6 bytes. Coverage threshold factor (F): The coverage constraint Fis used to keep track how frequently the signature is found for that application. Thus, it is simply calculated by dividing the number of flows the signature is observed by the total number of flows for that application in the training data set. If the fraction is greater than or equal to F,the signature will be considered as a candidate signature. These parameters would dominantly work during the signature construction. The default parameters we used are:B = 64 (bytes), T = 0.1, and F = 10%. In other words, the first 64 bytes are examined in the signature construction and application identification stages, the minimal 21 signature length is 6 bytes, and at least 10% of the total flows should contain the candidate sequence to be considered as a signature. Figure 5 provides a pseudo code for constructing a table that storessignatures for an application.The function takes a set of training flows to search for signatures, and produces a signature table (i.e., sigTable). In line 5-17, a bitmap is created for every signature found, if the size of the signature meets the minimal length constraint. The bitmaps will be used for computing “coverage” of signatures in the output table, as shown in line 19-23. Afterwards the fractions will be checked to see if they meet the coverage constraint by using the computed coverage fraction. Each table outputted by the function will hold the signatures for each application. 1 function construct_sig_table() 2 input: vector flows; 3 output: tablesigTable; 4 begin 5 tablesigBitmap; 6 for(i = 0; i sigs = LCS(flows[i], flows[j])); 9 foreach sig (sigs) { 10 if (sig.length< threshold) continue 11 if (! sigBitmap.exists(sig)) { 12 create a new entry for sig and add it to sigBitmap; 13 } 14 sigBitmap.get(sig).bitmap.set(i); 15 sigBitmap.get(sig).bitmap.set(j); 16 } 17 } 18 } 19 foreachbmap (sigBitmap) { 20 sigTable.insert(bmap.key,bmap.bitmap.count/flows.size()); 21 } 22 end Figure 5. The pseudo code for signature table construction Once all the signatures are gathered, they will be inputted as a parameter for identification with an input flow in question. In the identification stage, each signature in the 22 signature table will run through the LCS function with the input flow, and if the LCS function returns the signature itself, then the application signature is marked as identifying the flow as its application. If only one signature identifies the flow, it is a single match; if more than a single signature identifies the flow, then it should be a multiple match. Referring back to Figure 4, if the flow has multiple matches, it will be examined by a selection heuristic to make the final decision. For multiple matches, we set up the heuristics described in the previous section. In addition, we also used a simple heuristic based on a random function (RAND) that chooses one out of the candidate application set uniformly. Each heuristic chooses one application from the given candidate set, and the selected result is compared with the “real” application to which the input flow actually belongs. We compared the selection heuristics to see the identification performance. The next section will describe the metrics employed for performance comparison. Performance Metrics To measure performance, we consider metrics widelyadopted for application identification. Basically, TP standsfor true positive that is the total number of flows correctlyidentified for the application, FP is false positive that isthe total number of flows misclassified into the application, and FN is false negative that is the total number of flowsthat are not classified into the application (although they actually belong to the application). In particular, we considerthe following measures that combine the basic metrics: Precision: the number of correctly classified flows out ofthe total number of flows classified into the application (i.e.,Precision = TP/(TP+FP)). Recall: the number of correctly classified flows out of thetotal number of flows in the set (i.e., Recall = TP/(TP+FN)). 23 Hence, each of these measures is located between 0–100%; the greater, the better with respect to identification performance. Experimental Results In this section, we present our experimental results and core observations. We first run experiments with the default parameters, and then examine sensitivity by exploring a diverse set of values for each parameter used for the experiments. Results with the Default Parameters The first run of the experiment uses the default parameters (i.e., B = 64, T = 0.1, and F = 10%). Our experiments in this section focus on the performance of the selection heuristics, particularly with respect to precision and recall. Figure 6 illustrates precision when we use the default parameters. While our heuristics show 81-94% on average, RANDworks poorly with 74% for precision. This strongly suggests that efficient handling of multiple matches is essential to achieve high identification accuracy. PROB works fairly well with an average precision over 90%. Interestingly, as can be seen from the figure, GREATEST works very well with 94% precision on average despite its nature of simplicity. FRAC out performs RAND, but it is pretty behind GREATEST and PROB with an average precision of 81%. 24 Figure 6. Precision with default parameters Figure 7. Recall with default parameters The precision metric considers false positive, while recall focuses more on false negative. Figure 7 compares recall for the selection heuristics across the applications. The overall result is similar with the precisions in Figure 6: GREATEST (90%) outperforms the other heuristics followed by PROB (88%) and RAND is the worst. 25 From the results shown in Figure 6 and 7, we see that if many more signatures match with the input stream, there is a higher probability that the flow belongs to the application. To see if it is the case in another setting, we continue to explore a diverse set of values for the three parameters (i.e., B, T, and F). Impact of the Number of Bytes Examined (B) We next investigate the impact of the number of bytes inspected to the selection performance. Recall that the parameter B represents the number of bytes that are extracted from the flow data for both training and identification. We still fix T and F as the default values (i.e., T = 0.1, and F = 10%). In this experiment we used the following B values, B = {16, 32, 64, 128, 256}. Figure 8 and 9 show average precision and recall respectively over the number of bytes inspected (B). As can be seen from Figure 8, B has a significant impact to precision: as the number of bytes inspected increases, we can see precision also increases, implying greater identification accuracy. As in the default setting discussed in the previous section, GREATEST consistently out performs the other heuristics over a diverse set of B. With B=128 (bytes), GREATEST shows a precision over 98%. However, we observed no significant difference with a greaterB than 128 bytes. PROB follows GREATEST and shows over 90% precision with B≥64, andRAND is the worst. One interesting observation is that FRAC significantly improves precision as B increases. In particular, FRAC is comparable with GREATEST with B=128 and slightly better when B=256 (98.7% for GREATEST and 99.1% for FRAC). In Figure 9, FRAC also yields comparable results for recall when B≥128 bytes. 26 Figure 9 shows recall over a diverse set of B. Both precision and recall show a similar pattern. However, recall does not make an improvement and is even degraded when B>128 for the entire heuristics. This suggests that using the first 128 bytesin a flow or smaller than that would be recommended for the signature-based identification. Figure 8. Precision over the number of bytes inspected (B) Figure 9. Recall over the number of bytes inspected (B) 27 The reason for the steady improvement with a greater B can be explained with the increase of the fraction of single match as B increases, as discussed in Figure 1. We observed that the true positive rate for single match is very promising over 98%. This implies that with an adequate means to settle multiple matches, the signature-based identification would yield a high degree of classification accuracy. Impact of the Minimal Signature LengthThreshold (T) We next investigate how the minimal signature length impacts the performance by varying T. Note that the minimal length to be accepted as a signature is computed by ⌊ ⌋. We used T= {0.05, 0.1, 0.15, 0.2, 0.25}, B=64, and F=10% in this experiment. Figure 10. The number of signatures with respect to the minimal signature length threshold factor (T) 28 Figure 11. Precision over the minimal signature length threshold factor (T) Figure 12. Recall over the minimal signature length threshold factor (T) Figure 10 shows the number of signatures generated with different T. We can see the smaller number of signatures with a greater T, as expected. The average number of signatures is 62 with T=0.05, 0.1, and 0.15. It is reduced to 60 with T=0.2, and 44 with T=0.25. As shown in Figure 11 and 12, a greater T slightly increases precision but decreases recall. This means that using a greater minimal length threshold can reduce false positive, but it 29 increases false negative at the same time. In sum, there is a trade-off in choosing a value for the minimal signature length threshold with respect to the false positive and false negative rates. Impact of the Coverage Factor (F) The last experiment investigates the impact of coverage factor by varying F. We used F = {5%, 10%, 15%, 20%, 25%} and the default values for B and T. As in Figure 13, the greater coverage leads to the smaller number of signatures because a pattern should be more widely observed to be considered as a signature with a greater coverage factor. For example, with F=25%, any signature should appear in ¼ of the flows in the training data set, while F=5% requires any signature to be appeared in 1/20 of the training flows. For F=5%, the average number of signatures per application is around 180, while it drops to 16 for F=20%. Figure 13. The number of signatures with respect to the coverage factor (F) 30 Figure 14. Precision with respect to the coverage factor (F) Figure 15. Recall with respect to the coverage factor (F) Overall, identification performance becomes worse as F increases. As shown in Figure 14 and 15, both precision and recall decrease for GREATEST and PROB significantly over F. However, FRAC and RAND work better with F between 10% and 20%. GREATEST yields the best result with F=5%: 97% for both precision and recall. Therefore, it would be beneficial to choose 31 a relatively small value for the coverage factor around 5% of the entire number of flows in the training data set to expect better identification performance. Discussion We have seen that GREATEST out performs the other selection techniques over diverse settings despite its nature of simplicity. To see why, let us take a look at some examples. From an Afreeca flow, we observed that 7 Afreeca signatures identified it, 2 Pandora TV signatures did, and nothing for the other application signatures. Thus, GREATEST could correctly classify the flow into Afreeca. Here is another example for a Pandora TV flow: 2 Afreeca signatures matched, 2 forDaum, and 3 for Pandora TV, and as a result, GREATEST also identifies its application accurately. The results for FRAC method show a different story. FRAC method is heavily influenced by the number of signatures generated for each application. Let usconsider the two flows explained above again. For the Pandora TV flow, the fraction was 0.018 for Afreeca signatures, 0.034 for Daum, and 0.097 for Pandora TV. Even though Afreeca and Daum were only down by 1 in the GREATEST, they were at a disadvantage in FRAC, sinceAfreeca has 109 signatures, Daum has 59 signatures, and Pandora TV had 31 signatures. Therefore, FRAC was able to correctly identify the Pandora TV flow. However, if we take the second Afreeca flow example, the results gave Afreeca signatures a 0.064 and Pandora TV a 0.65. Even though Afreeca had the greatest number of matches, it lost to Pandora TV because it could produce a bigger fraction. The second example gives a look at why FRACdid not do as well as GREATESTin terms of accuracy. Now we consider PROB with the above example flows. For the Pandora TV flow example,PROBhad confidenceP=0.23 for Afreeca, P=0.28 for Daum, and P=0.31 for Pandora TV. Even though not all the signatures for Pandora TV identified the flow, this example shows 32 that the signatures that identified it had a relatively high confidence value. For the Afreeca flow,Afreeca had P=0.78 and Pandora TV had P=0.23, which enables PROB to classify the input flow accurately. Although PROB is slightly behind GREATEST with respect to overall accuracies, it would be rather reliable if there can be a high degree of discrepancies in terms of the available number of signatures for different applications, since PROB yields stable accuracies fairly close to GREATEST. Chapter Summary While the signature-based approach is attractive with greater accuracy and thus adopted widely for network traffic classification, one potential challenge is multiple matchesoccurred in the identificationprocess. Thus, it is necessary to effectively handlesuch multiple matches in order to obtainanadequate identification decision. To this end, we developeda set of selection heuristics that help battle multiple matches and help identify an input stream: GREATEST that selects based on the number of matches for each application, FRAC that considers the signature set sizein addition to the number of matches, and PROB based on a probability that any of the matched signatures for an application appears in a flow belonging to that application. To see effectiveness of the heuristic techniques, we conducted an extensive set of experiments with a recently collected data set. Our experimental results show that GREATEST and PROB fairly work well yielding up to 99%with respect to precision and recall. In particular, we observed that GREATEST consistently outperforms the other techniques despite its nature of simplicity. There are some tracks needed to further be explored. From a set of experiments, we observed that LCS produces a rich set of signatures. However, we also observed that a substantial number of signatures for an application look closely similar. Refining the signature set would be beneficial not only for reducing the number of signatures but also for enhancing the quality of 33 signatures. We plan to incorporate the refinement with thereconciliation presented in this paper, in order to provide a systematic methodology for the signature-based application identification. We also plan to employ additional traffic data sets for further verification of our techniques. 34 Chapter 5 SIGNATURE REFINEMENT With the payload inspection, it is known to be highly accurate with respect to identification performance although complexity is also high. However, our observations with existing signature generation algorithms indicate that simply relying on the collected signatures (without refining them) may lead to very poor accuracy less than 60% of true positives even in an optimal case, as we will show in this paper. The primary reason for the unacceptable identification accuracy basically comes from the poor quality of signatures. In general, signature generation is an iterative process, in which flows in the training data set are repeatedly compared to obtain common patterns, and a certain set of patterns that appear relatively frequently over the training data are chosen as signatures. Considering frequency however does not always confirm the quality of signatures. Suppose two signatures for application XYZ, “HTTP/1.1 200 OK XYZ” and “HTTP/1.1 302 Cache-Control XYZ”, obtained by using LCS. If an input stream actually belonging to that application has a pattern of “HTTP/1.1 304 Not Modified XYZ”, none of the above signatures can identify this flow. However, if we had more a common pattern from the signatures (e.g., “HTTP/1.1 XYZ”), we would expect a better identification result. This implies that the raw signatures after the iterative signature creation process could still be less general and somewhat specific to only a subset of flows in that application. It also suggests that it is necessary to improve the quality of signatures. We refer to this post-processing as a signature refining process. In this work, we present a signature refining process to create a high quality of signature from the crude signatures initially acquired from a signature generation method. Our post-processing includes the following two steps: (i) signature merging to consider any possible 35 combination of the raw signatures to relax their specificity, and (ii) signature identification to discover a manageable set of signatures, particularly with respect to the volume of the signatures as the cost of signature-based identification is strongly correlated with the number of signatures for payload examination. Our experimental results show that using a new set of signatures after refining can dramatically improve the upper bound of true positives up to 99% (that was less than 60% before refining), which should be the first step towards an acceptable degree of identification accuracy. Problem Statement In this section, we claim the problem of using the raw signatures obtained from existing signature generation algorithms with our preliminary experimental results. We randomly selected 200 flows for each application from our data set to use as training flows, and a disjoint set of 100 flows also randomly selectedwere selected for measuring identification performance (i.e., testing set). Therefore, the intersection of the training and testing sets should be the empty set. Initially, we have run the three signature generation algorithms (LCS, LASER, and AutoSig), and observed LCS is at least comparable to the others for identification accuracy. Hence, we chose LCS as the signature generation algorithm in this study. We employed a set of parameters for LCS as follows. Note that the default values for the parameters were chosen based on the previous chapter. 36 TABLE 2. Fraction of testing flows with at least a single signature for the application Table 2 shows our initial results for identifying the testing set with the signatures generated by LCS. Each cell in the table shows a fraction of testing flows that matched with at least one signature for each application in each row. For example, 45% of Afreeca testing flows had one or more signatures for Afreeca, 15% of Afreeca testing flows had 1+ signatures for BitTorrent, 48% of Afreeca testing flows had 1+ signatures for Cyworld, and so forth. Therefore, the diagonal cells highlighted in the table indicate the maximum bound of true positives, while the rest of the cells show maximum false positives. In this paper, we define uTP as the highest possible true positivesthat can be achieved when the signatures are used in the identification phase (i.e., the diagonal cells), while uFP is defined as the highest possible false positives that can be achieved (i.e., the non-diagonal cells). As can be seen from Table 2, the upper bound of true positives (uTP) is very low from 4% to 83%, with an average of 59%. This means that identification accuracy cannot go beyond 60%, which is unacceptable by and large. We also explored with different B and T, B = {64, 128} and T = {0.1, 0.125, 0.15}. No significant changes were observed with different T values, while B=128 yielded rather poor uTP as shown in the table. This will be a critical problem if the signatures were used to identify current traffic, and 37 this gives us our motivation for this research. Our goal is to maximize the upper TP and keep the upper FP low, so as to improve identification performance. In particular, we focus on maximizing uTP in this paper. In the next section, we show how we canincrease uTP through our refining process. Refining Process As discussed in the previous section, low uTP should be detrimental to overall identification performance. In this section, we introduce our refinement process to obtain a high quality of signatures that maximize uTP. We firstdescribe the first refinement step that merges raw signatures to expand the signature set. Then we will introducea set of methods to identify an essential set of signatures out of the large set of expanded signature set after merging. Expanding Signature Set by Merging There can be several reasons for poor uTP with the raw signature set, and one reason we observed is that a large number of signatures in that set are failed to contain truly common patterns; rather they are somewhat specific only to a small set of flows. For example, some signatures held too specific information such as dates (it is possible if a substantial number of training flows for that application were collected on the same day), and the signatures containing any specific date would not be useful for future testing flows. To help combat this problem, we merge the raw signatures by running the signature generation algorithm (LCS) against the set of signatures as illustrated in Figure 16. In the merge process, each signature is compared with each other to generate a new signature. If the new signature meets the length threshold, we keep the signature. By doing so, it is possible to collect more common patterns for each application. If we suppose n signatures before merging, there can be n(n-1)/2 new signatures at max after merging. After a new set of 38 signatures areconstructed,frequency is recomputed for each signature against the training data set. 1 function merge_sigs() 2 input: vector sigs; 3 output:vector refSigs; 4 begin 5 for(i = 0; i < sigs.size(); i++) { 6 for(j = i+1; j < sigs.size(); j++) { 7 newSig = LCS(sigs[i], sigs[j]); 8 if(newSig.length() **https://play.google.com/store/apps/details?id=lv.n3o.shark&hl=enG. Szabó, G., Turányi, Z., Toka, L., Molnár, S., & Santos, A. (2011). Automatic Protocol Signature Generation Framework for Deep Packet Inspection. 5th International ICST Conference on Performance Evaluation Methodologies and Tools (VALUETOOLS '11). Trippe, T. (2013). Patent Assignee Cleanup Using Google Refine (Open Refine) - Text Facets and Clustering., http://www.patinformatics.com/blog/patent-assignee-cleanup-using-google-refine-open-refine-text-facets-and-clustering/ Wang, Y., Xiang, Y., Zhou, W., & Yu, S. (2012). Generating Regular Expression Signatures for Network Traffic Classification in Trusted Network Management. J. Netw. Comput. Appl., 992-1000. Wiki.wireshark.org,. (n.d.). Libpcap File Format., from http://wiki.wireshark.org/Development/LibpcapFileFormat Xie, G., Iliofotou, M., Keralapura, R., Faloutsos, M., & Nucci, A. (2012). Subflow: Towards Practical Flow-Level Traffic Classification. INFOCOM. 51 Yang, J., Xu, Y., & Shang, Y. (2010). An Efficient Parallel Algorithm for Longest. World Congress on Engineering. Ye, M., Xu, K., Wu, J., & Po, H. (2009). AutoSig: Automatically Generating Signatures for Applications. CIT. 52 VITA Justin Tharp was born and raised in Greenville, Texas. He attended Texas A&M University in College Station, Texas for his Bachelors of Science in Aerospace Engineering in May 2012. In the Fall of 2012 he started attending Texas A&M University in Commerce, Texas for his Masters of Science in Computer Science. Since attended he became a Graduate Assistant in Research in January 2013 and researched identifying network traffic using signatures. He will be graduating from Texas A&M Commerce this upcoming May 2014. Email: js_tharp@yahoo.com ** |