Coding an Object-Oriented Decision Tree in Python with MongoDB

12 min readAug 30, 2022

Introduction

Object-Oriented Programs (OOPs), benefit from their modularity as sections can be isolated for troubleshooting. Functions can also be called for use in other functions, reducing the coding time required.

The decision tree is a popular algorithm due to its reduced computational cost when compared to neural networks. Although they aren’t as adaptable due to fixed parameters, they are useful for data structures that aren’t prone to change.

MongoDB is a secure document database that can be connected to Python in order to build secure, database-enabled programs. It is a NoSQL database which means it stores data in list format for versatility.

In this project, we use Python to develop an object-oriented program that uses a decision tree to predict the risk of stroke, using MongoDB for database storage.

For data security, any sensitive information will be censored.

Step 1: Obtain the data.

I used Kaggle to find a dataset with high votes and high usability, to prevent any issues regarding data integrity. I downloaded the Stroke Prediction Dataset by fedesoriano, who’s uploaded 30 datasets to Kaggle all with usability of 10.0, check them out! The file was provided in comma-separated values (.csv) format.

Step 2: Configure the MongoDB client.

To use MongoDB you’ll need to create an account. It’s completely free and they offer subscriptions if required. Once registered, you will be directed to the Projects section of the MongoDB dashboard. Click “New project”.

Name your project (I named mine 0), and press “next”.

Add members and set permissions for your project by adding their email addresses to send them an invite. When doing so you can set roles using the drop-down box, to ensure the security of the project. Click “Create Project”.

You will be taken to the “Database Deployments” section of the project dashboard and will be prompted to create a database. Click “build a database”.

You will then deploy a cloud database which for free, will use a shared server. Click “Create” on the box for shared servers.

MongoDB will automatically configure your cluster. You can customize cloud provider, change server, upgrade cluster size, explore backup options and name your cluster. I used the default settings.

Set your login details for the cluster. Alternatively, an x.509 certificate can be used for passwordless authentication. Once you have done so the login method will appear in a table.

Next, choose whether you want to connect locally or through the cloud. I chose to connect locally for demonstration purposes. Click “Add your current IP address” to connect to the cluster, then click “Finish and close”.

When prompted, click “Go to Databases”.

You will return to the “Database Deployments” page to find the new cluster. Click “Connect”.

To connect to the Python application, click “Connect your application”.

Use the following code to check your Python version.

from platform import python_version
python_version()

Select your version on MongoDB and select “include full driver code example”. Copy the code to your clipboard.

Connecting the Python application to the cluster

Create a blank Jupyter notebook,in the same folder as the dataset. Import pymongo and then paste the code from MongoDB on the next line, changing “<password>” (including the<>) for your password.

# MongoDB connection.
# Import pymongo to connect to MongoDB.
import pymongo# Using the code from MongoDB, connect the account to the client.
client = pymongo.MongoClient("mongodb+srv://user:<password>@cluster0.2z3xdjn.mongodb.net/?retryWrites=true&w=majority")# Attempt to connect to the client.
db = client.test

Now test your connection by typing “db” on the next line, and running it. In the output you should see “connect=True” if it worked.

Well done! If you’ve made it this far you’ve successfully configured the MongoDB Client.

Step 3: Upload the dataset to MongoDB.

Use the following code snippet to load the stroke dataset, convert it to a dictionary format, create a database and upload the converted data. If an alternative dataset was to be used, you would simply change the filename to match.

# Uploading data to MongoDB.
# Pandas was imported to read the (.csv) file.
import pandas as pd# The dataset was loaded using pandas' read_csv function.
df = pd.read_csv('healthcare-dataset-stroke-data.csv')# The dataset was converted to dictionary format.
data = df.to_dict(orient = 'records')# Create a database in the client.
db = client['Database']# The transformed data was inserted into the MongoDB database
db.stroke.insert_many(data)

PyMongo confirms the upload when successful.

Successful upload comment

Go back to “Database Deployments” on MongoDB and on your cluster click “Browse Collections”. As expected, the data was transferred successfully.

Step 4: Design the Object-Oriented Program.

Define the class as “Model”.

# The class was defined and hosted all the functions necessary for the task.
Class Model:

Configure the “_init_” function of the class by adding any variables used throughout the class. The dataset was loaded from MongoDB and converted to a Pandas data frame. The decision tree was imported and defined, its random seed was fixed to ensure repeatability. The dataset was previewed.

To log in using your own details, copy the code, from the connect wizard described earlier and replace my details with it.

To use an alternative model change the “DecisionTreeClassifier” to an alternative model.

    # Initialization of the dataset and the model.
    def __init__(self):
        # Import necessary libraries
        import pandas as pd # For data manipulation.
        from sklearn.tree import DecisionTreeClassifier # For use as the classification algorithm.# Pymongo libraries for MongoDB integration.
        import pymongo # To communicate with MongoDB.
        from pymongo.mongo_client import MongoClient # To login to MongoDB.
        from pymongo.server_api import ServerApi # To communicate with the MongoDB server.
        
        # Log into the client using the username and password at mongoDB.
        client = pymongo.MongoClient("mongodb+srv://jayparekh:<password>@cluster0.m6t3kym.mongodb.net/?retryWrites=true&w=majority", server_api=ServerApi('1'))
        # Create a database
        db = client['CMP6221']
        # Create a collection in the database.
        test = db.stroke
        #Convert the collection to Pandas dataframe.
        test = pd.DataFrame(list(test.find()))
        # Define the converted dataframe.
        self.df = test
        # Define the decision tree setting the random state for repeatability.
        self.classifier = DecisionTreeClassifier(random_state=100)
        # Display the dataset.
        print('Raw dataset: \n', self.df.head())

Configure a function called “corr” to use the data frame to generate a correlation matrix to gain insights to the data. Plot the matrix as a heatmap for better visualization.

If your dataset is called something other than “df”, change the filename within the function.

# Visualizing the data.
    def corr(self):
        from matplotlib import pyplot as plt # To visualize data.
        import seaborn as sns # To plot visualizations.
        # Use the corr function to calculate the correlations within the dataset, and define them for use in a heatmap.
        corrdata = self.df.corr()
        # Plot a heatmap to show the correlation of the data within the dataset titled correlation heatmap.
        fig = plt.subplots (figsize = (10,10))
        plt.title("Correlation heatmap")
        sns.heatmap(corrdata)
        plt.show()

Configure a function called “clean” to drop null values and to encode string variables to binary format so the decision tree can interpret them.

If your dataset is called something other than “df”, change the filename within the function.

If you were to encode features in another dataset, the features such as “ever_married” would be changed to your features, duplicate the script as needed.

# Data cleaning, dropped null values and encoded string variables.
    def clean(self):
        from sklearn import preprocessing # For label encoding.
        self.df = self.df.dropna()
        le = preprocessing.LabelEncoder()
        self.df['ever_married'] = le.fit_transform(self.df['ever_married'])
        self.df['work_type'] = le.fit_transform(self.df['work_type'])
        self.df['smoking_status'] = le.fit_transform(self.df['smoking_status'])
        self.df['Residence_type'] = le.fit_transform(self.df['Residence_type'])
        self.df['gender'] = le.fit_transform(self.df['gender'])
        print('Cleaned dataset: \n', self.df.head())

Configure a function called “split” to separate the necessary feature data, selected from viewing the correlation heatmap, from the label data, and then use “train-test split” to split the data into training and test sets.

Use “test_size” as a variable so a user can alter the size of the split without modifying the script.

If your dataset is called something other than “df”, change the filename within the function.

If you want to change the variables used for the feature data, you can select them in the “self.X” section. Use “self.y” to set the label variable.

# Train-test-split with a ratio of 70:30 set by the variable test_size. Residential status and gender were omitted as no correlation was shown.
    def split(self, test_size):
        import numpy as np # For manipulation of arrays.        
        from sklearn.model_selection import train_test_split # For preparing the data for fitting.
        self.X = np.array(self.df[['age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'avg_glucose_level', 'bmi', 'smoking_status']])
        self.y = np.array(self.df['stroke'])
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size = test_size, random_state = 42)

Configure a function called “fit” to train the decision tree model using the training feature and label data.

If your configured algorithm has another name, replace “classifier” with that name.

# Model fitted using the training data.
    def fit(self):
        self.model = self.classifier.fit(self.X_train, self.y_train)

Configure a function called “predict” to use the trained model to predict outcomes using the test data.

If your configured algorithm has another name, replace “classifier” with that name.

# Model tested using the test data.
    def predict(self):
        self.predictions = self.classifier.predict(self.X_test)
        return self.predictions

Configure a function called “cvscoring” to use 10-fold cross-validation to score the model.

If your configured algorithm has another name, replace “classifier” with that name.

# Cross valdidation of the original model.
    def cvscoring(self):
        from sklearn.model_selection import cross_val_score # For cross validation of the models.
        from matplotlib import pyplot as plt # To visualize data.
        import seaborn as sns # To plot visualisations.
        orig_cv_scores = cross_val_score(estimator = self.classifier, X = model_instance.X, y= model_instance.y, cv = 10)
        fig = plt.subplots(1, figsize=(5, 5))
        sns.boxplot(data = orig_cv_scores)
        plt.title("Intitial Cross Validation Scores DT")
        plt.show()
        print('\n Initial mean CV score: ', model_instance.model.score(model_instance.X_test, model_instance.y_test))

Configure a function called “initial_params” to obtain the parameters of the model.

If your configured algorithm has another name, replace “classifier” with that name.

# Pprint obtained the parametters of the initial model.
    def initial_params(self):
        from pprint import pprint # To evaluate model parameters.
        print('\n Initial parameters:')
        pprint(self.classifier.get_params())

Configure a function called “tune” to tune the original model, 10-fold cross-validate it, and return the percentage improvement of the new model.

To alter the parameters used in tuning, modify the param_dist table.

If your configured algorithm has another name, replace “classifier” with that name.

# Model tuning. The model was tuned with a variety of suitable combinations which used RandomizedSearchCV for implementation.
    # max_depth considers the number of splits that can be made before a prediction.
    # max_features determines the number of features considered.
    # min_samples_leaf defines the minimum samples required for a node to split.
    # criterion determines the function used to measure the purity of a split.
    def tune(self):
        from pprint import pprint # To evaluate model parameters
        from sklearn.model_selection import RandomizedSearchCV # To tune the model.
        import pandas as pd # For data manipulation.
        from matplotlib import pyplot as plt # To visualize data.
        import seaborn as sns # To plot visualisations.
        from sklearn.metrics import accuracy_score # To calculate accuracy.
        param_dist = {"max_depth": [5, 7, 10, 15, None],
                      "max_features": [None, 2, 4, 6, 8, 10],
                      "min_samples_leaf": (1, 2, 4),
                      "criterion": ["gini", "entropy"]}
        self.tree_cv = RandomizedSearchCV(self.classifier, param_dist, cv=10)
        
        # The tuned model was fitted to the data.
        self.tuned_tree = self.tree_cv.fit(model_instance.X_train, model_instance.y_train)
        
        # The array of cross validation results were obtained as a table using '.cv_results_'.
        tuned_results = self.tree_cv.cv_results_# The "DataFrame" function was used to convert the array of results to a dataframe defined as 'Tuned_Results'.
        Tuned_Results = pd.DataFrame(tuned_results)# A box plot was drawn showing the  mean cross valdation scores obtained.
        fig = plt.subplots(1, figsize=(5, 5))
        sns.boxplot(data = Tuned_Results, y = 'mean_test_score')
        plt.title("Tuned Mean CV Scores DT")
        plt.show()
        
        # The tuned model's parameters were shown.
        print('\n Optimal parameters:')
        print(self.tree_cv.best_params_)
        
        # The mean cross validation score of the tuned model was shown.
        print("\n Tuned mean CV score:")
        print(self.tree_cv.best_score_)
        
        # The percentage improvement over the original model parameters was shown.
        print("\n Percentage improvement using RandomizedSearchCV")
        print(((self.tree_cv.best_score_-accuracy_score(model_instance.y_test, model_instance.predictions ))/accuracy_score(model_instance.y_test, model_instance.predictions ))*100)

Configure a function called “confmatrix” to evaluate the tuned model and visualize its predictions.

# A confusion matrix represented the model output.
    def confmatrix(self):
        from sklearn.metrics import confusion_matrix # To visualize results.
        import seaborn as sns # To plot visualisations.
        from matplotlib import pyplot as plt # To visualize data.
        ConfMatrix = confusion_matrix(model_instance.y_test,model_instance.predictions)
        sns.heatmap(ConfMatrix, annot=True)
        plt.title("Tuned Decision Tree Confusion Matrix")

Configure a function called “classreport” to obtain a classificiation report of the tuned model.

# A classification report for the tuned model.
    def classreport(self):
        from sklearn.metrics import classification_report # To print performance metrics.
        print('Tuned model classification report: \n', classification_report(model_instance.y_test,model_instance.predictions))

Configure a function called “treeviz” to visualize the decision tree.

If your configured algorithm has another name, replace “classifier” with that name.

# Visualizing the decision tree.
    def treeviz(self):
        from matplotlib import pyplot as plt # To visualize data.
        from sklearn import tree # To plot decision tree.
        fig = plt.figure(figsize=(25,20))
        plt.suptitle("Decison Tree Visualization")
        _ = tree.plot_tree(self.classifier, feature_names=self.df.columns.values, class_names='stroke', filled=True)

Step 5: Design the _main_ function, to execute the model.

Each function was called in their required order, adding variables where they were used. To change the train-test split, alter the variable “0.3” to the required size of the test data, between 0 and 1.

# The main: function was defined. Functions were called by each stage to execute the model.
if __name__ == '__main__':
    model_instance = Model()
    model_instance.corr()
    model_instance.clean()
    model_instance.split(0.3)
    model_instance.fit() 
    model_instance.predict()
    model_instance.initial_params()
    model_instance.cvscoring()  
    model_instance.tune()
    model_instance.confmatrix()
    model_instance.classreport()
    model_instance.treeviz()

Results

Raw dataset: 
                         _id     id  gender   age  hypertension  heart_disease  \
0  630c2d66ccb4d1494b0bba11   9046    Male  67.0             0              1   
1  630c2d66ccb4d1494b0bba12  51676  Female  61.0             0              0   
2  630c2d66ccb4d1494b0bba13  31112    Male  80.0             0              1   
3  630c2d66ccb4d1494b0bba14  60182  Female  49.0             0              0   
4  630c2d66ccb4d1494b0bba15   1665  Female  79.0             1              0   

  ever_married      work_type Residence_type  avg_glucose_level   bmi  \
0          Yes        Private          Urban             228.69  36.6   
1          Yes  Self-employed          Rural             202.21   NaN   
2          Yes        Private          Rural             105.92  32.5   
3          Yes        Private          Urban             171.23  34.4   
4          Yes  Self-employed          Rural             174.12  24.0   

    smoking_status  stroke  
0  formerly smoked       1  
1     never smoked       1  
2     never smoked       1  
3           smokes       1  
4     never smoked       1

Cleaned dataset: 
                         _id     id  gender   age  hypertension  heart_disease  \
0  630c2d66ccb4d1494b0bba11   9046       1  67.0             0              1   
2  630c2d66ccb4d1494b0bba13  31112       1  80.0             0              1   
3  630c2d66ccb4d1494b0bba14  60182       0  49.0             0              0   
4  630c2d66ccb4d1494b0bba15   1665       0  79.0             1              0   
5  630c2d66ccb4d1494b0bba16  56669       1  81.0             0              0   

   ever_married  work_type  Residence_type  avg_glucose_level   bmi  \
0             1          2               1             228.69  36.6   
2             1          2               0             105.92  32.5   
3             1          2               1             171.23  34.4   
4             1          3               0             174.12  24.0   
5             1          2               1             186.21  29.0   

   smoking_status  stroke  
0               1       1  
2               2       1  
3               3       1  
4               2       1  
5               1       1  

 Initial parameters:
{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 100,
 'splitter': 'best'}Initial mean CV score:  0.9205702647657841

Optimal parameters:
{'min_samples_leaf': 1, 'max_features': 4, 'max_depth': 5, 'criterion': 'gini'}

 Tuned mean CV score:
0.9540180690216286 Percentage improvement using RandomizedSearchCV
3.6333787366415136

Tuned model classification report: 
               precision    recall  f1-score   support

           0       0.97      0.95      0.96      1413
           1       0.15      0.20      0.17        60

    accuracy                           0.92      1473
   macro avg       0.56      0.58      0.56      1473
weighted avg       0.93      0.92      0.93      1473

Conclusion

In conclusion the model saw a 3.63% (to 2 d.p.) increase in cross-validation score. The model used object oriented programming although most functions did not have inputs which would make it difficult to change the dataset, for example. An improved method would be to have a file named “config” which stored the parameters of each function so the program can be controlled securely.

References

scikit-learn

"We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've…

scikit-learn.org

MongoDB: The Developer Data Platform | MongoDB

Products Solutions Customer Stories Learn how businesses are taking advantage of MongoDB White Papers & Presentations…

www.mongodb.com

NumPy

Why NumPy? Powerful n-dimensional arrays. Numerical computing tools. Interoperable. Performant. Open source.

numpy.org

pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of…

pandas.pydata.org

seaborn: statistical data visualization - seaborn 0.11.2 documentation

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing…

seaborn.pydata.org

pprint - Data pretty printer - Python 3.10.6 documentation

Source code: Lib/pprint.py The module provides a capability to "pretty-print" arbitrary Python data structures in a…

docs.python.org

Coding an Object-Oriented Decision Tree in Python with MongoDB

Introduction

Step 1: Obtain the data.

Step 2: Configure the MongoDB client.

Step 3: Upload the dataset to MongoDB.

Step 4: Design the Object-Oriented Program.

Step 5: Design the _main_ function, to execute the model.

Results

Conclusion

References

scikit-learn

"We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've…

MongoDB: The Developer Data Platform | MongoDB

Products Solutions Customer Stories Learn how businesses are taking advantage of MongoDB White Papers & Presentations…

NumPy

Why NumPy? Powerful n-dimensional arrays. Numerical computing tools. Interoperable. Performant. Open source.

pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of…

seaborn: statistical data visualization - seaborn 0.11.2 documentation

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing…

pprint - Data pretty printer - Python 3.10.6 documentation

Source code: Lib/pprint.py The module provides a capability to "pretty-print" arbitrary Python data structures in a…

Written by Jay Parekh