Coding an Object-Oriented Decision Tree in Python with MongoDB

Jay Parekh
12 min readAug 30, 2022
Source: Johannes Plenio

Introduction

Object-Oriented Programs (OOPs), benefit from their modularity as sections can be isolated for troubleshooting. Functions can also be called for use in other functions, reducing the coding time required.

The decision tree is a popular algorithm due to its reduced computational cost when compared to neural networks. Although they aren’t as adaptable due to fixed parameters, they are useful for data structures that aren’t prone to change.

MongoDB is a secure document database that can be connected to Python in order to build secure, database-enabled programs. It is a NoSQL database which means it stores data in list format for versatility.

In this project, we use Python to develop an object-oriented program that uses a decision tree to predict the risk of stroke, using MongoDB for database storage.

For data security, any sensitive information will be censored.

Step 1: Obtain the data.

I used Kaggle to find a dataset with high votes and high usability, to prevent any issues regarding data integrity. I downloaded the Stroke Prediction Dataset by fedesoriano, who’s uploaded 30 datasets to Kaggle all with usability of 10.0, check them out! The file was provided in comma-separated values (.csv) format.

Step 2: Configure the MongoDB client.

To use MongoDB you’ll need to create an account. It’s completely free and they offer subscriptions if required. Once registered, you will be directed to the Projects section of the MongoDB dashboard. Click “New project”.

MongoDB dashboard projects section.

Name your project (I named mine 0), and press “next”.

Name your project

Add members and set permissions for your project by adding their email addresses to send them an invite. When doing so you can set roles using the drop-down box, to ensure the security of the project. Click “Create Project”.

Add members and set permissions.

You will be taken to the “Database Deployments” section of the project dashboard and will be prompted to create a database. Click “build a database”.

Data

You will then deploy a cloud database which for free, will use a shared server. Click “Create” on the box for shared servers.

Cloud database deployment options

MongoDB will automatically configure your cluster. You can customize cloud provider, change server, upgrade cluster size, explore backup options and name your cluster. I used the default settings.

Cluster configuration

Set your login details for the cluster. Alternatively, an x.509 certificate can be used for passwordless authentication. Once you have done so the login method will appear in a table.

Security configuration

Next, choose whether you want to connect locally or through the cloud. I chose to connect locally for demonstration purposes. Click “Add your current IP address” to connect to the cluster, then click “Finish and close”.

Connection configuration

When prompted, click “Go to Databases”.

Congratulations prompt

You will return to the “Database Deployments” page to find the new cluster. Click “Connect”.

Database Deployments

To connect to the Python application, click “Connect your application”.

Application connection options

Use the following code to check your Python version.

from platform import python_version
python_version()

Select your version on MongoDB and select “include full driver code example”. Copy the code to your clipboard.

Connecting the Python application to the cluster

Create a blank Jupyter notebook,in the same folder as the dataset. Import pymongo and then paste the code from MongoDB on the next line, changing “<password>” (including the<>) for your password.

# MongoDB connection.
# Import pymongo to connect to MongoDB.
import pymongo
# Using the code from MongoDB, connect the account to the client.
client = pymongo.MongoClient("mongodb+srv://user:<password>@cluster0.2z3xdjn.mongodb.net/?retryWrites=true&w=majority")
# Attempt to connect to the client.
db = client.test

Now test your connection by typing “db” on the next line, and running it. In the output you should see “connect=True” if it worked.

Testing the client connection

Well done! If you’ve made it this far you’ve successfully configured the MongoDB Client.

Step 3: Upload the dataset to MongoDB.

Use the following code snippet to load the stroke dataset, convert it to a dictionary format, create a database and upload the converted data. If an alternative dataset was to be used, you would simply change the filename to match.

# Uploading data to MongoDB.
# Pandas was imported to read the (.csv) file.
import pandas as pd
# The dataset was loaded using pandas' read_csv function.
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
# The dataset was converted to dictionary format.
data = df.to_dict(orient = 'records')
# Create a database in the client.
db = client['Database']
# The transformed data was inserted into the MongoDB database
db.stroke.insert_many(data)

PyMongo confirms the upload when successful.

Successful upload comment

Go back to “Database Deployments” on MongoDB and on your cluster click “Browse Collections”. As expected, the data was transferred successfully.

Dataset uploaded to MongoDB

Step 4: Design the Object-Oriented Program.

Define the class as “Model”.

# The class was defined and hosted all the functions necessary for the task.
Class Model:

Configure the “_init_” function of the class by adding any variables used throughout the class. The dataset was loaded from MongoDB and converted to a Pandas data frame. The decision tree was imported and defined, its random seed was fixed to ensure repeatability. The dataset was previewed.

To log in using your own details, copy the code, from the connect wizard described earlier and replace my details with it.

To use an alternative model change the “DecisionTreeClassifier” to an alternative model.

    # Initialization of the dataset and the model.
def __init__(self):
# Import necessary libraries
import pandas as pd # For data manipulation.
from sklearn.tree import DecisionTreeClassifier # For use as the classification algorithm.
# Pymongo libraries for MongoDB integration.
import pymongo # To communicate with MongoDB.
from pymongo.mongo_client import MongoClient # To login to MongoDB.
from pymongo.server_api import ServerApi # To communicate with the MongoDB server.

# Log into the client using the username and password at mongoDB.
client = pymongo.MongoClient("mongodb+srv://jayparekh:<password>@cluster0.m6t3kym.mongodb.net/?retryWrites=true&w=majority", server_api=ServerApi('1'))
# Create a database
db = client['CMP6221']
# Create a collection in the database.
test = db.stroke
#Convert the collection to Pandas dataframe.
test = pd.DataFrame(list(test.find()))
# Define the converted dataframe.
self.df = test
# Define the decision tree setting the random state for repeatability.
self.classifier = DecisionTreeClassifier(random_state=100)
# Display the dataset.
print('Raw dataset: \n', self.df.head())
Dataset preview.

Configure a function called “corr” to use the data frame to generate a correlation matrix to gain insights to the data. Plot the matrix as a heatmap for better visualization.

If your dataset is called something other than “df”, change the filename within the function.

# Visualizing the data.
def corr(self):
from matplotlib import pyplot as plt # To visualize data.
import seaborn as sns # To plot visualizations.
# Use the corr function to calculate the correlations within the dataset, and define them for use in a heatmap.
corrdata = self.df.corr()
# Plot a heatmap to show the correlation of the data within the dataset titled correlation heatmap.
fig = plt.subplots (figsize = (10,10))
plt.title("Correlation heatmap")
sns.heatmap(corrdata)
plt.show()

Configure a function called “clean” to drop null values and to encode string variables to binary format so the decision tree can interpret them.

If your dataset is called something other than “df”, change the filename within the function.

If you were to encode features in another dataset, the features such as “ever_married” would be changed to your features, duplicate the script as needed.

# Data cleaning, dropped null values and encoded string variables.
def clean(self):
from sklearn import preprocessing # For label encoding.
self.df = self.df.dropna()
le = preprocessing.LabelEncoder()
self.df['ever_married'] = le.fit_transform(self.df['ever_married'])
self.df['work_type'] = le.fit_transform(self.df['work_type'])
self.df['smoking_status'] = le.fit_transform(self.df['smoking_status'])
self.df['Residence_type'] = le.fit_transform(self.df['Residence_type'])
self.df['gender'] = le.fit_transform(self.df['gender'])
print('Cleaned dataset: \n', self.df.head())

Configure a function called “split” to separate the necessary feature data, selected from viewing the correlation heatmap, from the label data, and then use “train-test split” to split the data into training and test sets.

Use “test_size” as a variable so a user can alter the size of the split without modifying the script.

If your dataset is called something other than “df”, change the filename within the function.

If you want to change the variables used for the feature data, you can select them in the “self.X” section. Use “self.y” to set the label variable.

# Train-test-split with a ratio of 70:30 set by the variable test_size. Residential status and gender were omitted as no correlation was shown.
def split(self, test_size):
import numpy as np # For manipulation of arrays.
from sklearn.model_selection import train_test_split # For preparing the data for fitting.
self.X = np.array(self.df[['age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'avg_glucose_level', 'bmi', 'smoking_status']])
self.y = np.array(self.df['stroke'])
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size = test_size, random_state = 42)

Configure a function called “fit” to train the decision tree model using the training feature and label data.

If your configured algorithm has another name, replace “classifier” with that name.

# Model fitted using the training data.
def fit(self):
self.model = self.classifier.fit(self.X_train, self.y_train)

Configure a function called “predict” to use the trained model to predict outcomes using the test data.

If your configured algorithm has another name, replace “classifier” with that name.

# Model tested using the test data.
def predict(self):
self.predictions = self.classifier.predict(self.X_test)
return self.predictions

Configure a function called “cvscoring” to use 10-fold cross-validation to score the model.

If your configured algorithm has another name, replace “classifier” with that name.

# Cross valdidation of the original model.
def cvscoring(self):
from sklearn.model_selection import cross_val_score # For cross validation of the models.
from matplotlib import pyplot as plt # To visualize data.
import seaborn as sns # To plot visualisations.
orig_cv_scores = cross_val_score(estimator = self.classifier, X = model_instance.X, y= model_instance.y, cv = 10)
fig = plt.subplots(1, figsize=(5, 5))
sns.boxplot(data = orig_cv_scores)
plt.title("Intitial Cross Validation Scores DT")
plt.show()
print('\n Initial mean CV score: ', model_instance.model.score(model_instance.X_test, model_instance.y_test))

Configure a function called “initial_params” to obtain the parameters of the model.

If your configured algorithm has another name, replace “classifier” with that name.

# Pprint obtained the parametters of the initial model.
def initial_params(self):
from pprint import pprint # To evaluate model parameters.
print('\n Initial parameters:')
pprint(self.classifier.get_params())

Configure a function called “tune” to tune the original model, 10-fold cross-validate it, and return the percentage improvement of the new model.

To alter the parameters used in tuning, modify the param_dist table.

If your configured algorithm has another name, replace “classifier” with that name.

# Model tuning. The model was tuned with a variety of suitable combinations which used RandomizedSearchCV for implementation.
# max_depth considers the number of splits that can be made before a prediction.
# max_features determines the number of features considered.
# min_samples_leaf defines the minimum samples required for a node to split.
# criterion determines the function used to measure the purity of a split.
def tune(self):
from pprint import pprint # To evaluate model parameters
from sklearn.model_selection import RandomizedSearchCV # To tune the model.
import pandas as pd # For data manipulation.
from matplotlib import pyplot as plt # To visualize data.
import seaborn as sns # To plot visualisations.
from sklearn.metrics import accuracy_score # To calculate accuracy.
param_dist = {"max_depth": [5, 7, 10, 15, None],
"max_features": [None, 2, 4, 6, 8, 10],
"min_samples_leaf": (1, 2, 4),
"criterion": ["gini", "entropy"]}
self.tree_cv = RandomizedSearchCV(self.classifier, param_dist, cv=10)

# The tuned model was fitted to the data.
self.tuned_tree = self.tree_cv.fit(model_instance.X_train, model_instance.y_train)

# The array of cross validation results were obtained as a table using '.cv_results_'.
tuned_results = self.tree_cv.cv_results_
# The "DataFrame" function was used to convert the array of results to a dataframe defined as 'Tuned_Results'.
Tuned_Results = pd.DataFrame(tuned_results)
# A box plot was drawn showing the mean cross valdation scores obtained.
fig = plt.subplots(1, figsize=(5, 5))
sns.boxplot(data = Tuned_Results, y = 'mean_test_score')
plt.title("Tuned Mean CV Scores DT")
plt.show()

# The tuned model's parameters were shown.
print('\n Optimal parameters:')
print(self.tree_cv.best_params_)

# The mean cross validation score of the tuned model was shown.
print("\n Tuned mean CV score:")
print(self.tree_cv.best_score_)

# The percentage improvement over the original model parameters was shown.
print("\n Percentage improvement using RandomizedSearchCV")
print(((self.tree_cv.best_score_-accuracy_score(model_instance.y_test, model_instance.predictions ))/accuracy_score(model_instance.y_test, model_instance.predictions ))*100)

Configure a function called “confmatrix” to evaluate the tuned model and visualize its predictions.

# A confusion matrix represented the model output.
def confmatrix(self):
from sklearn.metrics import confusion_matrix # To visualize results.
import seaborn as sns # To plot visualisations.
from matplotlib import pyplot as plt # To visualize data.
ConfMatrix = confusion_matrix(model_instance.y_test,model_instance.predictions)
sns.heatmap(ConfMatrix, annot=True)
plt.title("Tuned Decision Tree Confusion Matrix")

Configure a function called “classreport” to obtain a classificiation report of the tuned model.

# A classification report for the tuned model.
def classreport(self):
from sklearn.metrics import classification_report # To print performance metrics.
print('Tuned model classification report: \n', classification_report(model_instance.y_test,model_instance.predictions))

Configure a function called “treeviz” to visualize the decision tree.

If your configured algorithm has another name, replace “classifier” with that name.

# Visualizing the decision tree.
def treeviz(self):
from matplotlib import pyplot as plt # To visualize data.
from sklearn import tree # To plot decision tree.
fig = plt.figure(figsize=(25,20))
plt.suptitle("Decison Tree Visualization")
_ = tree.plot_tree(self.classifier, feature_names=self.df.columns.values, class_names='stroke', filled=True)

Step 5: Design the _main_ function, to execute the model.

Each function was called in their required order, adding variables where they were used. To change the train-test split, alter the variable “0.3” to the required size of the test data, between 0 and 1.

# The main: function was defined. Functions were called by each stage to execute the model.
if __name__ == '__main__':
model_instance = Model()
model_instance.corr()
model_instance.clean()
model_instance.split(0.3)
model_instance.fit()
model_instance.predict()
model_instance.initial_params()
model_instance.cvscoring()
model_instance.tune()
model_instance.confmatrix()
model_instance.classreport()
model_instance.treeviz()

Results

Raw dataset: 
_id id gender age hypertension heart_disease \
0 630c2d66ccb4d1494b0bba11 9046 Male 67.0 0 1
1 630c2d66ccb4d1494b0bba12 51676 Female 61.0 0 0
2 630c2d66ccb4d1494b0bba13 31112 Male 80.0 0 1
3 630c2d66ccb4d1494b0bba14 60182 Female 49.0 0 0
4 630c2d66ccb4d1494b0bba15 1665 Female 79.0 1 0

ever_married work_type Residence_type avg_glucose_level bmi \
0 Yes Private Urban 228.69 36.6
1 Yes Self-employed Rural 202.21 NaN
2 Yes Private Rural 105.92 32.5
3 Yes Private Urban 171.23 34.4
4 Yes Self-employed Rural 174.12 24.0

smoking_status stroke
0 formerly smoked 1
1 never smoked 1
2 never smoked 1
3 smokes 1
4 never smoked 1
Correlation heatmap
Cleaned dataset: 
_id id gender age hypertension heart_disease \
0 630c2d66ccb4d1494b0bba11 9046 1 67.0 0 1
2 630c2d66ccb4d1494b0bba13 31112 1 80.0 0 1
3 630c2d66ccb4d1494b0bba14 60182 0 49.0 0 0
4 630c2d66ccb4d1494b0bba15 1665 0 79.0 1 0
5 630c2d66ccb4d1494b0bba16 56669 1 81.0 0 0

ever_married work_type Residence_type avg_glucose_level bmi \
0 1 2 1 228.69 36.6
2 1 2 0 105.92 32.5
3 1 2 1 171.23 34.4
4 1 3 0 174.12 24.0
5 1 2 1 186.21 29.0

smoking_status stroke
0 1 1
2 2 1
3 3 1
4 2 1
5 1 1

Initial parameters:
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': 100,
'splitter': 'best'}
Initial mean CV score: 0.9205702647657841
Initial cross validation score
Optimal parameters:
{'min_samples_leaf': 1, 'max_features': 4, 'max_depth': 5, 'criterion': 'gini'}

Tuned mean CV score:
0.9540180690216286
Percentage improvement using RandomizedSearchCV
3.6333787366415136
Tuned cross validation score
Tuned model classification report: 
precision recall f1-score support

0 0.97 0.95 0.96 1413
1 0.15 0.20 0.17 60

accuracy 0.92 1473
macro avg 0.56 0.58 0.56 1473
weighted avg 0.93 0.92 0.93 1473
Tuned confusion matrix
Decision tree visualization

Conclusion

In conclusion the model saw a 3.63% (to 2 d.p.) increase in cross-validation score. The model used object oriented programming although most functions did not have inputs which would make it difficult to change the dataset, for example. An improved method would be to have a file named “config” which stored the parameters of each function so the program can be controlled securely.

References

--

--

Jay Parekh
0 Followers

With a background in Mechanical Engineering, I am pursuing a Masters degree in Artificial Intelligence to merge technology with reality.