So, in this blog I would be discussing about my project that I did during my final year of my college. I have a lot of interest in working in field of information security and cloud computing, but I have not developed any projected related to the field that I have interest in. So, I thought of building a project that follows my interest.

When we think of information security, we think about protecting sensitive or private information from the intruder. The question arises how the intruder access user’s information. The answer is through malware. So, I thought of building a web-based application that can help the user to check whether file is malicious or legitimate.

Now I will discuss about the implementation of the project. Firstly, I founded a data-set which consist of more than 100k file names which are categorized into two category that are malicious and non-malicious. This categorization of the file names is done on the basis of 54 unique features. Size of Optional Header, Major linker version, Miner Liner version, Size of code, these are some the features that are included in the data-set. Let me give you a sneak peek of the data-set.

Data-set image

Then I started working on the data-set. I extracted all the important features that helps in determining that whether the file is malicious or non-malicious. So, I figured out that out of 54 features around 10 features were important for determining. So here is the sneak peak of the code as well as the result. Then I tried various classification algorithm to find out the accuracy like decision tree, Adaboost, Random Forest, Gradient Boost. So, I figured out that Random Forest have the highest success rate out of all with around 99.3% accuracy. Then I saved it into the classifier which will be used in the backed model of this application to the test with real files. Here is the sneak peek of the result.

Results from the data-set

Now I started working onto the back-end of this application that works on actual file. Firstly, I loaded the classifier that I built earlier and then I extracted information from the file and then collected information about the file on the basis of the important features that were earlier extracted and then this was tested using the Random Forest Algorithm to find out the final result whether the file is malicious or not. This model was very complex to build as this was my first time working with real files. Thanks to the internet specially Stack Overflow for resolving my issues but it took a lot of time to build. Here is a sneak peek of some code.

Then I started working on the UI of this web application. The UI is built using HTML CSS. It consist of 2 pages, first page is the place where the user uploads the file that he thinks is suspicious and the second pages gives out the result whether the file is safe to use or it is malicious. Here is the sneak peek on the Home Page of the application.

If the back-end of this application predicts that the file is malicious then the following page will be displayed.

And if the back-end of this application predict that file is legitimate then the following page will be displayed.

This is all about my project. This project is still in beta version and more features will be added soon in the future. A full report of the file would be shared with user that would help in explaining why the file is malicious or legitimate. I hope you like my final year project. Feel free to express your thoughts in the comment section. Also feel free to give me some claps if you love it.

Ham, Hyo-Sik & Kim, Hwan-Hee & Kim, Myung-Sup & Choi, Mi-Jung. (2014). Linear SVM-Based Android Malware Detection for Reliable IoT Services. Journal of Applied Mathematics. 2014. 1–10. 10.1155/2014/594501.

Baldangombo, Usukhbayar & Horng, Shi-Jinn. (2013). A Static Malware Detection System Using Data Mining Methods. International Journal of Artificial Intelligence & Applications. 4. 10.5121/ijaia.2013.4411.