How does my spam filter program work?
By Asim Krishna Prasad
Posted on 21/01/15
Tag :
Project
Someone once told me that if I can't explain my code to someone, then my code is a failure. Since I don't wanna be a loser :P; So, in this post I am going to explain how does my Spam Filter work, what's happening inside and what is the basis of filtering the mails. Jumping quickly to the files present in the application (V-2.0.0), following are the files present in the application and their functions.
The whole application is drived by the init.sh shell script. This is how it works
- First it executes the download_mail file which is actually executable file of download_mail.py
- download_mail allows the user to login to his Gmail account and download mails of a specific folder to his system.
- If the download is successful, it creates the required directories and saves the name of the folder in a text file, temp.txt
- Then it exits and sends the control back to init.sh
- In case there is no Internet or the login fails, it exits the whole application
- init.sh now reads the temp.txt file and transfers all the required files to the target folder.
- The files transferred are :
- sc.sh, a driver shell script
- classify, an executable file of classify.cpp
- spam_dict.txt, a text file which contains spam-words
- nonspam_dict.txt, a text file which contains non-spam-words
- extract_words, an executable file of extract_words.cpp
- Now the control is transferred to sc.sh
- It first takes all the mails and saves their name in a text file, listfiles.txt
- Then two folders are created, SPAM_MAILS and NONSPAM_MAILS
- Now every files is processed, using classify
- classify takes a .txt file, a mail, and counts the number of Spam words and Non-Spam words in the mail, taking
spam_dict.txt and nonspam_dict.txt as resources.
- The output of classify is saved in a text file, classified.txt, output is two numbers, Spam words count and Non Spam Words Count
- For each mail, sc.sh reads classified.txt and :
- If number of Spam words is greater
- All the words of this mail are appended in the spam_dict.txt using extract_words
- The mail is copied to the SPAM_MAILS folder.
-
- If number of Non Spam words is greater
- All the words of this mail are appended in the nonspam_dict.txt using extract_words
- The mail is copied to the NONSPAM_MAILS folder.
- classified.txt is removed after each mail is processed
- After all the mails are processed :
- spam_dict.txt and nonspam_dict.txt are transferred to the main directory
- spam_dict.txt is removed from SPAM_MAILS folder
- nonspam_dict.txt is removed from NONSPAM_MAILS folder
- listfiles.txt is removed
- Control goes back to init.sh
- Now the dictionaries, spam_dict.txt and nonspam_dict.txt , are cleaned using clean_dict, which is an executable file of clean_dict.cpp
- clean_dict takes a dictionary and removes all the duplicates from it and sorts it, using a temporary text file for data dumping and retreival, temp_dict.txt
- Now temp.txt and temp_dict.txt are deleted.
- The application terminates
Bayesian algorithm is used in this project so far (V-2.0.0) but it's not that efficient, atleast produces satisfactory results :P.
A video of the application working in real time can be found here.
Hope it helps :)
Asim Krishna Prasad
COMMENTS :