Summary:

The main idea presented in the paper is to use traditional Data Mining techniques for detecting malicious executables. The authors suggest that although, traditional signature methods and heuristics used do a good job of recognizing malicious pieces about which information is already fed to them, they still do not have the capability of detecting new malicious executables.

They present two simple yet effective data mining techniques for detecting new malicious code. However these techniques have been modified slightly to be able to run over binary code, instead of traditional data. They use the most basic classification technique that involves training the algorithm with some set of data. The main methods presented though are the Naïve Bayes and the Multi-Naïve Bayes Algorithms.

The main points that came out in the discussion were that although the paper doesn’t present a ground-breaking new scheme, it does a great job of using existing techniques from one field of computer science where it hasn’t been tried before. The solution presented for the problem is almost obvious on second thought.

Some of the pros and cons as seen by everyone in the discussion were:

Pros:

The authors have done considerable work and actually implemented all the ideas that they had and presented experimental results.
Even if the solution as proposed by them might not be directly applicable, it does offer a means to the expert to be used as tools for developing heuristics.
The methods suggested by the authors generate rules which can be used effectively by the experts.

Cons:

The results obtained by the authors suggest there are quite a few false positives, although small by data mining standards would still create problems in case of detection of executables. This is because some of the good software, which might be important might be classified as malicious.
Another problem in the paper is that the authors present comparison of their method with a signature based method developed by themselves. They didn’t do a satisfactory job of comparing the results with commercial anti-virus software, which would have helped increase the credibility of their solution. The reason they present isn’t satisfactory either because they could have tested their experimental set of data with an older version of the commercial solutions that didn’t have information about the latest malicious executables.

The voting for the paper was as follows:

Accept: 1 Weak Accept : 3

Reject: 0 Weak Reject : 4