Decision Tree - Credit Card Fraud Detection

Ujwal Watgule

3.40/5 (4 votes)

Aug 15, 2023

CPOL

3 min read

15423

Credit card fraud detection is an important application of machine learning techniques.

Introduction

In this article, we'll implement Decision Tree algorithm for credit card fraud detection. The Decision Tree algorithm is a popular and powerful supervised machine learning algorithm used for both classification and regression tasks.

Background

Decision Tree algorithm builds a tree-like model of decisions based on the features of the data. Each internal node of the tree represents a decision based on a feature, and each leaf node represents a class label or a predicted value.

Please refer to my Medium article "Machine Learning - Decision Tree" to understand Decision Tree concept in detail.

High Level Steps

Below is the overview of high level steps involved in detecting credit card fraud detection using Decision Tree algorithm in Machine Learning

Data Collection: Collect a labeled dataset that includes historical credit card transactions, where each transaction is labeled as either fraudulent or legitimate. The dataset should contain relevant features such as transaction amount, merchant information, transaction time, and other related variables.

Data Preprocessing: Preprocess the dataset by performing tasks such as data cleaning, handling missing values, feature selection and normalization. Ensure that the dataset is balanced, meaning it has a similar number of fraudulent and valid transactions to prevent bias in the model.

Splitting the Dataset: Split the preprocessed dataset into training and testing sets. The training set will be used to build the Decision Tree model, while the testing set will be used to evaluate the model's performance.

Decision Tree Model: Build a Decision Tree model on the training data. The features of the dataset will serve as inputs, and the label (fraudulent or legitimate) will be the target variable. The Decision Tree algorithm will learn patterns and decision rules based on the features to classify transactions as either fraudulent or legitimate.

Model Training: Train the Decision Tree model on the training data, using a suitable metric such as Information Gain or Gini Impurity to determine the best feature to split the data at each node.

Model Evaluation: Evaluate the trained model using the testing data. Calculate metrics such as accuracy, precision, recall, and F2-score to assess the model's performance in correctly identifying fraudulent transactions and minimizing false positives and false negatives.

Fine Tuning: Adjust the Decision Tree model's parameters and hyperparameters, such as maximum depth, minimum samples per leaf and splitting criteria, to optimize its performance thereby preventing overfitting and improve the model's generalization ability.

Prediction: Use the trained Decision Tree model to make predictions on new, unseen credit card transactions. The model will classify each transaction as either fraudulent or legitimate based on the learned decision rules.

Using the Code

Below is the implementation of algorithm and code is written in Python with the help of jupyter notebook.

# importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

You can use any dataset containing credit card transactions. Dataset used in this implementation is downloaded from Kaggle.

# load dataset
creditdata_df = pd.read_csv("~path~//creditcard.csv")
print(f"Dataset Shape :-")
print (creditdata_df.shape)

Output

Dataset Shape :-
(284807, 31)

After loading creditcard.csv data in dataframe, let us view or inspect the data.

#view data 
creditdata_df.head(10)

Output

Let us find legitimate and fraudulent records from dataset:

# Check for data based on Class column value which indicates
# 1 => False & 0 => True
false = creditdata_df[creditdata_df['Class']==1]
true = creditdata_df[creditdata_df['Class']==0]
n=len(false)/float(len(true))
print (n)
print('False Detection : {}'.format(len(creditdata_df[creditdata_df['Class']==1])))
print('True Detection:{}'.format(len(creditdata_df[creditdata_df['Class']==0])),"\n")

Output

0.0017304750013189597
False Detection : 492
True Detection:284315

Check for statistical view of both type of records:

#False Datection Transaction
print("False Detection Transaction")
print("============================")
print(false.Amount.describe(),"\n")

#True Detection Transaction
print("True Detection Transaction")
print("============================")
print(true.Amount.describe(),"\n")

Output

Now it's time to separate features and target variable:

X = creditdata_df.drop('Class', axis=1)
y = creditdata_df.drop['Class']

Split data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a Decision Tree classifier:

classifier = DecisionTreeClassifier()

Now let us train the classifier:

classifier.fit(X_train, y_train)

Now let us try to make predictions on the test set:

y_pred = classifier.predict(X_test)

Calculate accuracy of the model:

accuracy = accuracy_score(y_test, y_pred) * 100
print("Accuracy:", accuracy) 

confusion_mat = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_mat)

Output

Accuracy: 99.90695551420245
Confusion Matrix: 
[[56833 31] 
[ 22 76]]

Now, at the end, it's time to validate and evaluate our model:

#Precision
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

precision=precision_score(y_test, y_pred, pos_label=1)*100
print('\n Score Precision :\n',precision )

#Recall
# Recall = TP / (TP + FN)
recall=recall_score(y_test, y_pred, pos_label=1)*100
print("\n Recall Score :\n", recall)

fscore=f1_score(y_test, y_pred, pos_label=1)*100
print("\n F1 Score :\n", fscore)

Output

 Score Precision :
 71.02803738317756

 Recall Score :
 77.55102040816327

 F1 Score :
 74.14634146341463

As you can see, Decision Tree algorithm implemented with dataset creditcard.csv resulted in 99.90 accuracy.

Conclusion

In conclusion, our credit card fraud detection system, powered by a decision tree classifier, holds great potential in safeguarding financial transactions from fraudulent activities.

History

15^th August, 2023: Initial version