Bot that detects spam in comments #2 (more training data, SVM class...

I updated a bot which purpose is to detect spam comments on Steem blockchain. It uses Multinomial Naive Bayes algorithm combined with SVM (model stacking). It can reply to spam comment and downvote it. I've done it for #polish community, but it can be adapted for every tag (or all tags) - it's a matter of training file.

Github repository

Log from console:

I have stacked 4 algorithms: Multinomial Naive Bayes and 3 variants of SVM.

self.model = StackedModel([
            MultinomialNB(),
            SVC(kernel='linear', C=C, probability=True),
            SVC(kernel='rbf', gamma=0.7, C=C, probability=True),
            NuSVC(probability=True)

To check the accuracy, I calculated a confusion matrix for each algorithm.

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) 

Confusion matrix:
[[65  1]
 [ 0 45]] 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False) 

Confusion matrix:
[[65  1]
 [ 0 45]] 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.7, kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False) 

Confusion matrix:
[[66  0]
 [ 0 45]] 

NuSVC(cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, nu=0.5, probability=True, random_state=None,
   shrinking=True, tol=0.001, verbose=False) 

Confusion matrix:
[[65  1]
 [ 1 44]]

The confusion matrix for the stacked model looks as follows.

Stacked model
Confusion matrix:
[[65  1]
 [ 0 45]]

As you can see the results are similar for each algorithm separately as well as for the stacked model. You have to experiment a bit here to find the best combination. The results will probably change slightly as the data set increases.

Bot checks not only current comment, but also previous comments. I think that single comment nice photo is ok, but if user posts this type of comments all the time it is considered spam:

The bot also pays attention to repeated, generic comments:

And even scams (if user is on scamlist):

Running

$ POSTING_KEY=<posting_key> spam_detector.py config.json

Private posting key is stored as environment variable.

Configuration

All parameters are stored in config.json file.

Key	Value
account	account used by bot
nodes	list of Steem nodes
tags	tags which are observed
probability_threshold	threshold to classify as spam
training_file	input training file
blacklist_file	file containing blacklist
whitelist_file	file containing whitelist
scamlist_file	file containing users who post scams
reply_mode	0 - without reply, 1 - with reply
vote_mode	0 - without vote, 1 with vote
vote_weight	weight of the vote from range [-100.0, 100.0]
num_previous_comments	number of user comments that are investigated

Training file contains rows with label ham or spam like below:

ham    Wow. Even though I was well aware of Churchill's later career, I actually didn't know he was here during the Anglo Boer war, let alone as a prisoner of war. Thank you for a very interesting and informative post!
ham    Yea this post isn't really about fixing all the problems on Steem - it's just that there always seems to be a lot of drama over the trending page, and i think it's a bad thing for new people coming to the site to see first, so just throwing out the idea of getting rid of it for now.
ham    Yea, I believe there was something about notifications in one of the SteemIt, Inc roadmaps but don't quote me on that. Notifications are really important though, can't expect everyone to use
ham    Yeah, I may have to sit down & do a post or two myself! It’s fun to imagine! Other than promoted posts, I do think we should have advertising, albeit in a very user focused & friendly way.
ham    Yes I agree. My suggestion was based on how things actually are currently which as you said is not representative of the best posts. I don’t believe that is going to change any time soon, if ever, so in the mean time I think it would be better to just get rid of that page.
ham    Yes! This thought never occurred to me before, but your idea is perfect!! I think it would help underpaid content creators be noticed. Better yet, don't sort people based on potential payout. Create an algorithm that sorts out such things as grammatical and spelling errors, "articles" that are too short, authors that post 10 times per day, copy/paste content, ect. and only the highest quality bloggers would make it to the top...
ham    Yes, there are only a few flagging because majority is scared. He has already ruined many people's accounts and reps and flagged all of their posts to $0.00 for voicing opinons. People disagree with the rewards of his posts. You are well aware of haejin's 10-12 posts per day reaching an easy $350 per post every time. I don't think anyone is against his predictions in the sense that anyone is able to use common sense and choose if they invest or not based on his predictions. I have not seen any whales helping recover these people's accounts for flagging him. Perhaps this is not an unjustified flag war? I have sacrificed my entire blog and all earnings for six weeks to try and lower the rewards. I am not scared of the consequences as I know what they are. People are scared though so I think if a lot of users delegate a small portion of their Steem power to one of these accounts then the rewards can be lower substantially. I also feel that it would be a more organized approach at flagging him as it will be a scheduled downvote of 10 posts every evening. I feel that if enough people make the delegation's he will be unable to flag every user that delegated down to $0.00 as he would have to use all of his power flagging instead of upvoting himself. You can count on support from whales to resolve unjustified flag wars, if you feel like post are more over-valued than the majority of Steem content then flag them and don't be scared of reprisals.
ham    you are right. As it is now, he's spending a tonne of his vote power flagging anyone who disagrees with his rewards. He cannot flag everyone it would cut into his profits, as his vote power drains to 0. If rancho comes in and starts flagging too, then they are making even less money because now he's wasting his vote power by flagging instead of upvoting the 10 posts a day that he has to.
ham    You know.. I delegated what little SP i can afford exactly because you took the risk. Now if he did wanna go all out flag, he'd had to waste his vp on both you and me. if enough people did it we can even go against the biggest abusers too.
ham    Your concept is very solid, it might seem hard to implement in the start but I know that if you keep at it you will reach your goal!I cannot wait to start using your system!
spam    i follow you
spam    Upvote, follow, resteem
spam    UPVOTED
spam    UPVOTED & RESTEEMED
spam    Upvoted and followed you back
spam    UPVOTED RESTEEMED
spam    very funny
spam    very nice
spam    Write Link, send 0.100 sbd. 3000+ followers can see you (resteem)
spam    Yes very nice post.

Technology Stack

python3.6
libraries: steem-python, scikit-learn, pandas, textblob, bs4

Repository contains requirements.txt file.

Roadmap:

~~enlarging the training set~~
~~adding new algorithm such as Support Vector Machine~~
~~taking into account previous comments, not only current one~~
~~adding to blacklist / whitelist~~
taking into account user reputation
tuning parameters in existing algorithms
adding new algorithm such as Neural Network and maybe Random Forest
enlarging the training set (again)

Posted on Utopian.io - Rewarding Open Source Contributors

Bot that detects spam in comments #2 (more training data, SVM classifier, checking user previous comments, whitelist / blacklist / scamlist)

Running

Configuration

Technology Stack

Roadmap: