Machine Learning: Filtering Email for Spam or Ham

You may have seen our previous posts on machine learning — specifically, how to let your code learn from text and working with stop words, stemming, and spam. So today, we’re going to build our machine learning-based spam filter, using the tools we walked through in those posts: tokenizer, stemmer, and naive bayes classifier. We are going to work with bluebird promise library here, so if you are not used to promises, please take a look at the bluebird API reference. Training and Testing Dataset Before we begin, it’s important to have good training data. You can download some here — we are interested in two. TR-mails.zip, the raw emails’ corpus spam-mail.tr, the correct labels (spam or ham) associated to each training email in TR-mails.zip, where each line tells us…


Link to Full Article: Machine Learning: Filtering Email for Spam or Ham