Detecting Spam Messages Is NOT That Hard

Hello everyone again. We will detect spam messages today. So how do we do that? We will look at the old messages and learn and remove the unnecessary messages from our message box. I want this article to be short and to the point. I would like to have an article that you can read immediately and add something to yourself. In just a few minutes, you can detect spam messages with just a few lines of code.

As always, we are adding the necessary libraries. Today I will show you how you can visualize your projects better with WordCloud. That’s why we need to include the libraries we will use for this function.

We read our data and look at samples from our data with df.sample ().

As you can see we have nothing to do with “Unnamed” columns. That’s why we’re dropping these lines with the df.drop () function. We label the columns required for us in a more understandable way. We name column v2 with the text messages “Text”. The v1 column shows whether it is spam or not, so we call this column “Class”. That way, everything became clearer.

Also, with future engineering, we added a new column named “Label” to our data. Values ​​in this column will take the value 1 if the message is spam, or 0 if the message is not. In other words, we have actually added the numerical form of the “Class” class to the data.

Now let’s group the messages according to the “Class” column. So, we will be able to see the total number of spam and ham messages and their unique message count. We can even see the most repetitive spam and ham messages.

Thanks to include = “O” it is taken in columns that are objects.
# There are 747 spam and 4825 ham messages in the data.
# 4516 has a unique message. 653 of these messages are spam messages.
# The most recurring ham message is: “Sorry, I’ll call later” and has 30 of them.

With a small visualization, we can see the distribution of ham and spam messages more clearly.

We can easily distinguish spam and raw messages thanks to the “Class” group. If you want to see these messages separately: just like Adidas, just do it!

Sorry about this joke, but I had to. I saw it on Instagram :)) I hope I was able to make you laugh too. Anyway, let’s continue.

Now we have come to a step that we have to do in NLP projects. We need to clear all messages in the “Text” column of all unnecessary punctuation and numbers. At the same time, we must convert all letters to lowercase and remove unnecessary words. I don’t know if you have done an NLP project before, if you haven’t, you can check my sample projects in my GitHub account. I will add the codes of this project at the end of my article. Maybe it will help you understand why this step is the most important.

First, let’s send the ham messages to the “words_cleaner” function. Next, let’s find out how many times each word in the ham messages is repeated. We see the top 10 most repetitive words.

Have you noticed that some signs have not been removed? What do you think could be the reason? Later in the post, I will show you how to get rid of them all completely with a more advanced function. So, you will see the difference.

And it’s time for WordCloud.I told you that I will show you how you can make great visualizations with WordCloud. I’m sure it will help you impress people while presenting your project :))

First, we will write a code consisting of just a few lines. Later I’ll show you how to do it in a real cloud. But, let’s go the easy way for now.

First, let’s send the ham messages we cleared to our function.

Now, let’s first clean the spam messages and then send them to the WordCloud function.

As I mentioned above a more advanced function for clearing the “Text” column, now we come to that part. With the “features_cleaner” function, we clear all messages in the data. “Label” column will be our target column in machine learning, we assign this column to y. And after clearing the “Text” column, we assign it to x.

Here I will stop for a minute and show you the difference between cleared and uncleared data.

Okay, I promised you a short article. Don’t worry, we are almost at the end. Only one last step is left and we will move on to the machine learning part later.

The last step is CountVectorizer. We sort of translating each word into a vector. You know what it looks like, just like applying a dummy variable to object data.

Now everything is ok. Now we can classification.

Isn’t the result super cool ?? We can distinguish between spam and ham messages with a rate of 97.77 percent. But one piece of advice for you is always to predict with a few different models. Because it can sometimes be difficult to predict which model will work better. So I also predicted with RandomForest and DecisionTree.

We have come to the end of another post. I hope I was able to explain everything well. We did a fun and different project together. I leave a link to access the codes of this project in my GitHub account. If you have anything to ask, you can leave a comment. See you in my next article. Hope you stay healthy and safe …

Data Scientist / Electric and Electronic Engineering Ege University / https://github.com/ariseyda / https://www.linkedin.com/in/%C5%9Feyda-ar%C4%B1-0316371a9?li