E-mail Spam Classification using LSTM

Posted on October 2022

import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from wordcloud import WordCloud, STOPWORDS
from collections import Counter
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

This notebook is meant to be an attempt to analyze email data. The data consist of emails with a label that inform whether the corresponding email is a spam or not spam. In this notebook I use wordcount to analyze whether the spam email is good enough for further approach such as creating email classifier. Lastly I use LSTM to create a machine learning model that classify whether an email is considered a spam or not spam.

First lets load the data.

df_train = pd.read_csv('../input/email-classification-nlp/SMS_train.csv', encoding='latin_1')

df_test = pd.read_csv('../input/email-classification-nlp/SMS_test.csv', encoding='latin_1')

df_train.describe()

	S. No.
count	957.000000
mean	479.000000
std	276.406404
min	1.000000
25%	240.000000
50%	479.000000
75%	718.000000
max	957.000000

df_test.describe()

	S. No.
count	125.000000
mean	63.000000
std	36.228442
min	1.000000
25%	32.000000
50%	63.000000
75%	94.000000
max	125.000000

The data divided into two parts, a train data and a test data. We will use the train data to both train and validate the model. The test data will be used as an unseen data to determine the performance of the model.

df_train

	S. No.	Message_body	Label
0	1	Rofl. Its true to its name	Non-Spam
1	2	The guy did some bitching but I acted like i'd...	Non-Spam
2	3	Pity, * was in mood for that. So...any other s...	Non-Spam
3	4	Will ü b going to esplanade fr home?	Non-Spam
4	5	This is the 2nd time we have tried 2 contact u...	Spam
...	...	...	...
952	953	hows my favourite person today? r u workin har...	Non-Spam
953	954	How much you got for cleaning	Non-Spam
954	955	Sorry da. I gone mad so many pending works wha...	Non-Spam
955	956	Wat time ü finish?	Non-Spam
956	957	Just glad to be talking to you.	Non-Spam

957 rows × 3 columns

First, lets take a look at wordcloud on emails that classifed as a spam.

Wordcloud for spam messages

df_visualize = df_train[df_train['Label'] == 'Spam']


comment_words = ''
stopwords = set(STOPWORDS)

# iterate through the csv file
for val in df_visualize['Message_body']:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

png

Based on this wordcloud we know that words like “call”, “free”, and “mobile” is being used a lot in spam emails. By my experience, a lot of ads use these kind of words to advertise their products. Thus the data is good enough for us to build a machine learning classifier.

Neural Network spam detection

Lets count how many unique words we have to create the tokenizer

# count unique word
def counter_word (text):
    count = Counter()
    for i in text.values:
        for word in i.split():
            count[word] += 1
    return count

text_values = df_train['Message_body']

counter = counter_word(text_values)

Define a few model parameters. We will use 80% of the training data as the true training dataset which is about 765 and the rest of the training data will be used as validaton dataset.

# Model parameter

vocab_size = len(counter)
embedding_dim = df_train['Message_body'].str.len().max()

max_length = 20

training_size = 765

training_sentences = df_train['Message_body'][0:training_size]
training_labels = df_train['Label'][0:training_size]

val_sentences = df_train['Message_body'][training_size:]
val_labels = df_train['Label'][training_size:]

training_labels = training_labels.replace(['Spam'], 1)
training_labels = training_labels.replace(['Non-Spam'], 0)

val_labels = val_labels.replace(['Spam'], 1)
val_labels = val_labels.replace(['Non-Spam'], 0)

Lets tokenize and pad the data

tokenizer = Tokenizer(num_words=vocab_size, oov_token='OOV')
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length)

print(df_train['Message_body'][1])
print(training_sequences[1])

The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free
[6, 354, 130, 120, 1143, 38, 2, 1144, 57, 536, 40, 713, 8, 420, 229, 311, 164, 78, 10, 70, 537, 13, 3, 140, 12, 45]

Tokenize and pad the validation dataset aswell

val_sequences = tokenizer.texts_to_sequences(val_sentences)
val_padded = pad_sequences(val_sequences, maxlen=max_length)

Model definition

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

2022-10-25 03:20:42.969154: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 20, 446)           2167560   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               261632    
_________________________________________________________________
dense (Dense)                (None, 16)                2064      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 2,431,273
Trainable params: 2,431,273
Non-trainable params: 0
_________________________________________________________________

from tensorflow.keras.utils import plot_model
plot_model(model, to_file='model_architecture.png', show_shapes=True, show_layer_names=True)

png

Training the model

# start training
epochs = 10
history = model.fit(training_padded, training_labels, epochs=epochs, validation_data=(val_padded, val_labels))

2022-10-25 03:20:45.354437: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/10
24/24 [==============================] - 7s 95ms/step - loss: 0.3344 - accuracy: 0.8784 - val_loss: 0.1613 - val_accuracy: 0.9427
Epoch 2/10
24/24 [==============================] - 1s 55ms/step - loss: 0.1055 - accuracy: 0.9843 - val_loss: 0.0853 - val_accuracy: 0.9635
Epoch 3/10
24/24 [==============================] - 1s 58ms/step - loss: 0.0280 - accuracy: 0.9961 - val_loss: 0.1016 - val_accuracy: 0.9688
Epoch 4/10
24/24 [==============================] - 1s 58ms/step - loss: 0.0125 - accuracy: 0.9961 - val_loss: 0.1515 - val_accuracy: 0.9583
Epoch 5/10
24/24 [==============================] - 1s 56ms/step - loss: 0.0048 - accuracy: 0.9987 - val_loss: 0.1156 - val_accuracy: 0.9635
Epoch 6/10
24/24 [==============================] - 1s 57ms/step - loss: 0.0010 - accuracy: 1.0000 - val_loss: 0.1277 - val_accuracy: 0.9635
Epoch 7/10
24/24 [==============================] - 1s 55ms/step - loss: 6.2759e-04 - accuracy: 1.0000 - val_loss: 0.1338 - val_accuracy: 0.9635
Epoch 8/10
24/24 [==============================] - 1s 55ms/step - loss: 4.5663e-04 - accuracy: 1.0000 - val_loss: 0.1370 - val_accuracy: 0.9635
Epoch 9/10
24/24 [==============================] - 2s 85ms/step - loss: 3.5204e-04 - accuracy: 1.0000 - val_loss: 0.1391 - val_accuracy: 0.9635
Epoch 10/10
24/24 [==============================] - 1s 55ms/step - loss: 2.8094e-04 - accuracy: 1.0000 - val_loss: 0.1443 - val_accuracy: 0.9583

model_loss = pd.DataFrame(model.history.history)
model_loss

	loss	accuracy	val_loss	val_accuracy
0	0.334381	0.878431	0.161303	0.942708
1	0.105477	0.984314	0.085291	0.963542
2	0.027962	0.996078	0.101602	0.968750
3	0.012480	0.996078	0.151530	0.958333
4	0.004808	0.998693	0.115609	0.963542
5	0.001038	1.000000	0.127676	0.963542
6	0.000628	1.000000	0.133848	0.963542
7	0.000457	1.000000	0.136957	0.963542
8	0.000352	1.000000	0.139128	0.963542
9	0.000281	1.000000	0.144294	0.958333

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

png

The final performance of the data is 100% accuracy on the training data and about 96% accuracy on the validation data.

Prediction on Unseen data

Lets use the unseen test data to see the real performance of the model

Lets tokenize and pad the test data

testing_sentences = df_test['Message_body']
testing_labels = df_test['Label']

testing_labels = testing_labels.replace(['Spam'], 1)
testing_labels = testing_labels.replace(['Non-Spam'], 0)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length)

Predict the test data

predictions = model.predict(testing_padded)

testing_labels = pd.DataFrame(testing_labels)
testing_labels['Prediction'] = predictions

sns.histplot(testing_labels, x='Prediction', hue='Label', element='poly')

<AxesSubplot:xlabel='Prediction', ylabel='Count'>

png

As you can see by the plot above, the model seems did a good job on classifying the email. But lets see how the performance with metric

# Scoring metric for prediction

testing_labels['Prediction_labels'] = (testing_labels['Prediction'] > 0.5).astype(int)

accuracy = accuracy_score(testing_labels['Label'], testing_labels['Prediction_labels'])

print('Accuracy: {}'.format(accuracy))
print(classification_report(testing_labels['Label'], testing_labels['Prediction_labels']))

Accuracy: 0.936
              precision    recall  f1-score   support

           0       0.92      0.92      0.92        49
           1       0.95      0.95      0.95        76

    accuracy                           0.94       125
   macro avg       0.93      0.93      0.93       125
weighted avg       0.94      0.94      0.94       125

from sklearn.metrics import confusion_matrix

cm = pd.DataFrame(confusion_matrix(testing_labels['Label'], testing_labels['Prediction_labels']))
cm.columns = ['Predict Non-Spam', 'Predict Spam']
cm.index = ['Actual Non-Spam', 'Actual Spam']
cm.iloc[0] = cm.iloc[0]/cm.sum(axis=1)[0]
cm.iloc[1] = cm.iloc[1]/cm.sum(axis=1)[1]

sns.heatmap(cm, annot=True, cmap='Blues', fmt='.2%')

<AxesSubplot:>

png

The F1 score of the model on the unseen data is about 94% which is good. Also, the model is able to predict 100% accuracy on the non-spam email while the model predict 89.47% accuracy on the spam email. Which is in my opinion is good considering we dont want our model to falsely predict a non-spam email as a spam email while the vice versa is allowed to a certain degree.

Thankyou for reading this!