So you have learned a bit about machine learning and web crawling. Now you are excited to build something simple yet still useful. What can you make? Here is one idea.
Let’s say you managed a Facebook Page for your business or community and want to regularly post relevant news or articles from the internet. But since doing it manually is boring and now you have learned machine learning and web crawling, we can make a Facebook Page news auto-poster.
As you may notice from the diagram, Python is chosen for this project. The main reason is simply because there are some amazing opensource libraries that can simplify this project:
The development is separated into 2 phases:
- Preparation: create news crawlers to gather data, then use them to train text classification model using machine learning and NLP (Natural Language Processing) techniques
- Execution: create two executables/jobs:
First, source of the news articles are need to be decided. In this case, three news portals (detik.com, liputan6.com and kompas.com) are chose. Then, a crawler is created for each of them. To make extending it easier, below abstract class is used:
from abc import ABCMeta, abstractmethod class BaseCrawler: __metaclass__ = ABCMeta @abstractmethod def crawl(self, silent=False): raise NotImplementedError
In this project, article’s title is considered sufficient to determine whether an article is relevant or not. After all, many people nowadays only read news title then immediately make conclusion regardless of the content, right? 😛
By running crawlers for several hours, few hundreds of news titles should be collected. Next, each news title should be manually labeled either as 1 (relevant) or 0 (irrelevant) and saved as CSV like this. This will be the dataset that will be used to train a text classification model.
There are many techniques available to train a text classification model. In this project, I choose to use classics ones (mainly due to limited production server computational resources):
- Preprocessing: Stemming (using Sastrawi) and dictionary-based stopwords removal
- Vectorization: TF-IDF vectorization (using Sklearn)
- Classification: SGD Classifier, which is a Linear SVM with SGD (using Sklearn, again 💕)
The complete code, train.py, is also including another technique that utilize Indonesian Pretrained Word2Vec (by Facebook) to determine word weight during TF-IDF vectorization (I got this code from another awesome blog post). When this vectorization technique is combined with RBF SVM classifier, it actually yields better result (~9% better) than the previous technique. But since I haven’t optimized the code for low RAM machine, I choose to use the previous technique.
The output of the training is a trained model in pickle format file. By saving the trained model in a file, we don’t have to train the model every time we want to classify a new input (news title).
crawler.py is set to run periodically using cron to “produce” news titles (reusing previous crawlers and trained text classification models) and store them in sqlite3 database. Basically this script is mostly just reusing what we have done in preparation phase.
fb.py is also set to run periodically to “consume” stored news titles by publishing them to Facebook Page using available API. Obviously
crawler.py must run more often then
fb.py to make sure there is enough supply of news titles (and links).
There you have it: a “simple” yet (hopefully) useful machine learning project. It should at least provide some experience on web crawling, text classification and Facebook API integration.
By the time this article is written, an instance of this project is still posting links in “Pantau Harga” Facebook Page (the contents are mainly in Indonesian). It personally helps me to keep the page “alive” with almost no effort (I still need to delete invalid/false-positive posts manually, though). So, I hope you find it useful as well. Cheers!