Picture Credits:https://www.diabetes.co.uk

Diabetes Prediction using PIMA Dataset

6 min readMay 31, 2020

Predict the onset of diabetes based on diagnostic measures

What is Diabetes?

Diabetes is a disorder that happens when the blood glucose, often referred to as blood sugar, becomes too high. Blood glucose is the primary energy source, which comes from the food you consume. Insulin, a pancreatic hormone, allows the glucose from food to reach the cells for energy use. Your body often does not produce enough — or any — insulin, or use insulin well. Glucose then stays in the blood and doesn’t reach your cells.

Prerequisites

Python 3.+
Understanding of libraries (Scikit Learn, Numpy, Pandas, Matplotlib, Seaborn)
Jupyter Notebook or Google Colab
Basic understanding of classification methods or Algorithms.

Dataset : Pima Indians Diabetes Database

This Dataset is used to predict whether the patienmt is having diabetes or not. As this dataset contents some diagnostic measurements on which it predicts the output.

Data Exploration

Upon finding a data set, we will first examine the data set and we will try to know about dataset .This step is required to familiarize yourself with the data, gain some understanding of the possible features and see if data cleaning is required.

Now, Firstly we will import the required libraries and then import the dataset using read_csv function of panadas library.

We can examine the data set using the dataframe.head() method

#importing required library
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline
#importing datasetpima=pd.read_csv("datasets_228_482_diabetes.csv")pima.head()

Here using Describe function we find some statistical data like no of the data , percentile, mean,min,max and std of the numerical values of the Series or DataFrame

He we will divide the columns into dependent variable and independent variable on their nature or behaviour.

pima.describe()#independent variable
ind_var=pima[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]#dependent variable
dep_var=pima.Outcome

Feature Engineering

Feature engineering is the process of translating the collected data into features that better reflect the problem we are trying to solve to the model, enhancing its efficiency and precision.

Feature engineering can create new features from existing features and it can combine or blends several features to create a more intuitive features to give input in model.

Here first we find the importance of every feature relative to the output variable or target feature using ExtraTreesClassifier.Here we created a model and in that model we fit the independent and dependent variable as input and then we find the importance of features and then we plot importance of feature using matplotlib.

from sklearn.ensemble import ExtraTreesClassifiermodel=ExtraTreesClassifier()model.fit(ind_var,dep_var)#print(model.feature_importances_)feat_imp=pd.Series(model.feature_importances_,index=ind_var.columns)feat_imp.nlargest(8).plot(kind='barh')plt.show()

Now We created a correlation matrix and then we plot a matrix using heat map using seaborn

cor_mat=pima.corr()top=cor_mat.indexplt.figure(figsize=(10,10))g=sns.heatmap(pima[top])

Splitting Data into Train/Test using Scikit Learn

Next, we can split the features and target variable into train and test portions.

from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test=train_test_split(ind_var,dep_var,test_size=0.25,random_state=0)

Model Selection

The most interesting and the core of machine learning is model selection or algorithm selection process. It is the step in which we select the model that best performs for the data set in hand.

Here as the target variable is categorical as we have to predict that the patient is having diabetes or not. so it is binary classification so here we will use different types of classification algorithm or model.

Logistic Regression Model

from sklearn.linear_model import LogisticRegressionlr_model = LogisticRegression(solver='liblinear')lr_model.fit(X_train,y_train)y_pred_logistic=lr_model.predict(X_test)

K- Nearest Neighbor

from sklearn.neighbors import KNeighborsClassifierknn_model = KNeighborsClassifier(n_neighbors=9)y_pred_knn = knn_model.fit(X_train, y_train).predict(X_test)

Navie Bayes

from sklearn.naive_bayes import GaussianNBnb_model = GaussianNB()y_pred_navie= nb_model.fit(X_train, y_train).predict(X_test)

Support Vector Machine

from sklearn.svm import LinearSVCsvc_model = LinearSVC(random_state=0)y_pred_svc = svc_model.fit(X_train, y_train).predict(X_test)

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifierdtc_model = DecisionTreeClassifier()y_pred_dtc = dtc_model.fit(X_train, y_train).predict(X_test)

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifierrfc_model = RandomForestClassifier()y_pred_rfc = rfc_model.fit(X_train, y_train).predict(X_test)

Performance Evaluation

import math as mtfrom sklearn import metrics

Performance measure for Logistic Regression

Accuracy we got from Logistic Regression model is 0.82

print("Accuracy:",metrics.accuracy_score(y_test, y_pred_logistic))print("Precision:",metrics.precision_score(y_test, y_pred_logistic))print("Recall:",metrics.recall_score(y_test, y_pred_logistic))print("F-Score:",metrics.f1_score(y_test, y_pred_logistic))Accuracy for Logistic Regression : 0.8181818181818182 
Precision for Logistic Regression : 0.7567567567567568
Recall for Logistic Regression : 0.5957446808510638 
F-Score for Logistic Regression : 0.6666666666666666

Our Precision for the model stands at 0.82. This indicates that 82% of the time our model classified the patients that it has diabetes when they actually had a high risk of getting diabetes.

Performance measure for K neighbor Classifier

Accuracy we got from K neighbor Classifier model is 0.77

print("Accuracy for K neighbor Classifier :",metrics.accuracy_score (y_test, y_pred_knn))print("Precision for K neighbor Classifier :", metrics.precision_score(y_test, y_pred_knn))print("Recall for K neighbor Classifier :", metrics.recall_score(y_test, y_pred_knn))print("F-Score for K neighbor Classifier :",metrics.f1_score(y_test, y_pred_knn))Accuracy for K neighbor Classifier : 0.7727272727272727
Precision for K neighbor Classifier : 0.6304347826086957 
Recall for K neighbor Classifier : 0.6170212765957447 
F-Score for K neighbor Classifier : 0.6236559139784946

Performance measure for naive Bayes

Accuracy we got from naive baiyes model is 0.79

print("Accuracy for naive baiyes :",metrics.accuracy_score(y_test, y_pred_navie))print("Precision for naive baiyes :",metrics.precision_score(y_test, y_pred_navie))print("Recall for naive baiyes :",metrics.recall_score(y_test, y_pred_navie))print("F-Score for naive baiyes :",metrics.f1_score(y_test, y_pred_navie))Accuracy for naive baiyes : 0.7922077922077922 
Precision for naive baiyes : 0.6744186046511628 
Recall for naive baiyes : 0.6170212765957447 
F-Score for naive baiyes : 0.6444444444444444

Performance measure for Support vector machine

Accuracy we got from Support vector machine model is 0.35

print("Accuracy for Support vector machine :",metrics.accuracy_score (y_test, y_pred_svc))print("Precision for Support vector machine :",metrics.precision_ score(y_test, y_pred_svc))print("Recall for Support vector machine :",metrics.recall_ score (y_test, y_pred_svc))print("F-Score for Support vector machine :",metrics.f1_score (y_test, y_pred_svc)Accuracy for Support vector machine : 0.35714285714285715 
Precision for Support vector machine : 0.3219178082191781 
Recall for Support vector machine : 1.0 
F-Score for Support vector machine : 0.48704663212435234

Performance measure for Decision tree classifier

Accuracy we got from Decision tree classifier model is 0.79

print("Accuracy for Decision tree classifier :",metrics.accuracy_scor e(y_test, y_pred_dtc))print("Precision for Decision tree classifier :",metrics.precision_ score(y_test,y_pred_dtc ))print("Recall for Decision tree classifier :",metrics.recall_score  (y_test, y_pred_dtc))print("F-Score for Decision tree classifier :",metrics.f1_score (y_test, y_pred_dtc))Accuracy for Decision tree classifier : 0.7922077922077922 
Precision for Decision tree classifier : 0.6415094339622641 
Recall for Decision tree classifier : 0.723404255319149 
F-Score for Decision tree classifier : 0.68

Performance measure for Random forest classifier

Accuracy we got from Random forest classifier is 0.78

print("Accuracy for Random forest classifier :",metrics.accuracy_ score(y_test, y_pred_rfc))print("Precision for Random forest classifier :",metrics.precision_ score(y_test,y_pred_rfc ))print("Recall for Random forest classifier :",metrics.recall_ score(y_test, y_pred_rfc))print("F-Score for Random forest classifier :",metrics.f1_ score(y_test, y_pred_rfc))Accuracy for Random forest classifier : 0.7857142857142857 
Precision for Random forest classifier : 0.6521739130434783 
Recall for Random forest classifier : 0.6382978723404256 
F-Score for Random forest classifier : 0.6451612903225806

Performance Visualization

log=metrics.accuracy_score(y_test, y_pred_logistic)knn=metrics.accuracy_score(y_test, y_pred_knn)navie=metrics.accuracy_score(y_test, y_pred_navie)svc=metrics.accuracy_score(y_test, y_pred_svc)dtc=metrics.accuracy_score(y_test, y_pred_dtc)rfc=metrics.accuracy_score(y_test, y_pred_rfc)model=['LR','KNN','NAVIE','SVC','DTC','RFC']accuracy=[log,knn,navie,svc,dtc,rfc]data= pd.DataFrame({'Model': model, 'Score': accuracy})
print(data)

a = sns.barplot(x = 'Model', y = 'Score', data = data)a.set(xlabel='Models', ylabel='Accuracy Score')for p in a.patches:height = p.get_height()a.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center")plt.show()

Conclusion

We thus select the Logistic Regression as the right model due to high accuracy, precision and recall score.

Based on the feature importance:

Glucose is the most important factor in determining the onset of diabetes and BMI and Age is also important factor after glucose.
Other factors such as Diabetes Pedigree Function, Insulin, Pregnancies, Blood Pressure and Skin Thickness also contributes to the prediction.

As we can see, the findings derived from Feature Significance make sense because Glucose level is one of the first items that is actually tracked in patients having more risk of diabetes.