Diabetes Prediction using PIMA Dataset
Predict the onset of diabetes based on diagnostic measures
What is Diabetes?
Diabetes is a disorder that happens when the blood glucose, often referred to as blood sugar, becomes too high. Blood glucose is the primary energy source, which comes from the food you consume. Insulin, a pancreatic hormone, allows the glucose from food to reach the cells for energy use. Your body often does not produce enough — or any — insulin, or use insulin well. Glucose then stays in the blood and doesn’t reach your cells.
Prerequisites
- Python 3.+
- Understanding of libraries (Scikit Learn, Numpy, Pandas, Matplotlib, Seaborn)
- Jupyter Notebook or Google Colab
- Basic understanding of classification methods or Algorithms.
Dataset : Pima Indians Diabetes Database
This Dataset is used to predict whether the patienmt is having diabetes or not. As this dataset contents some diagnostic measurements on which it predicts the output.
Data Exploration
Upon finding a data set, we will first examine the data set and we will try to know about dataset .This step is required to familiarize yourself with the data, gain some understanding of the possible features and see if data cleaning is required.
Now, Firstly we will import the required libraries and then import the dataset using read_csv function of panadas library.
We can examine the data set using the dataframe.head() method
#importing required library
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline
#importing datasetpima=pd.read_csv("datasets_228_482_diabetes.csv")pima.head()
Here using Describe function we find some statistical data like no of the data , percentile, mean,min,max and std of the numerical values of the Series or DataFrame
He we will divide the columns into dependent variable and independent variable on their nature or behaviour.
pima.describe()#independent variable
ind_var=pima[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]#dependent variable
dep_var=pima.Outcome
Feature Engineering
Feature engineering is the process of translating the collected data into features that better reflect the problem we are trying to solve to the model, enhancing its efficiency and precision.
Feature engineering can create new features from existing features and it can combine or blends several features to create a more intuitive features to give input in model.
Here first we find the importance of every feature relative to the output variable or target feature using ExtraTreesClassifier.Here we created a model and in that model we fit the independent and dependent variable as input and then we find the importance of features and then we plot importance of feature using matplotlib.
from sklearn.ensemble import ExtraTreesClassifiermodel=ExtraTreesClassifier()model.fit(ind_var,dep_var)#print(model.feature_importances_)feat_imp=pd.Series(model.feature_importances_,index=ind_var.columns)feat_imp.nlargest(8).plot(kind='barh')plt.show()
Now We created a correlation matrix and then we plot a matrix using heat map using seaborn
cor_mat=pima.corr()top=cor_mat.indexplt.figure(figsize=(10,10))g=sns.heatmap(pima[top])
Splitting Data into Train/Test using Scikit Learn
Next, we can split the features and target variable into train and test portions.
from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test=train_test_split(ind_var,dep_var,test_size=0.25,random_state=0)
Model Selection
The most interesting and the core of machine learning is model selection or algorithm selection process. It is the step in which we select the model that best performs for the data set in hand.
Here as the target variable is categorical as we have to predict that the patient is having diabetes or not. so it is binary classification so here we will use different types of classification algorithm or model.
Logistic Regression Model
from sklearn.linear_model import LogisticRegressionlr_model = LogisticRegression(solver='liblinear')lr_model.fit(X_train,y_train)y_pred_logistic=lr_model.predict(X_test)
K- Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifierknn_model = KNeighborsClassifier(n_neighbors=9)y_pred_knn = knn_model.fit(X_train, y_train).predict(X_test)
Navie Bayes
from sklearn.naive_bayes import GaussianNBnb_model = GaussianNB()y_pred_navie= nb_model.fit(X_train, y_train).predict(X_test)
Support Vector Machine
from sklearn.svm import LinearSVCsvc_model = LinearSVC(random_state=0)y_pred_svc = svc_model.fit(X_train, y_train).predict(X_test)
Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifierdtc_model = DecisionTreeClassifier()y_pred_dtc = dtc_model.fit(X_train, y_train).predict(X_test)
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifierrfc_model = RandomForestClassifier()y_pred_rfc = rfc_model.fit(X_train, y_train).predict(X_test)
Performance Evaluation
import math as mtfrom sklearn import metrics
Performance measure for Logistic Regression
Accuracy we got from Logistic Regression model is 0.82
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_logistic))print("Precision:",metrics.precision_score(y_test, y_pred_logistic))print("Recall:",metrics.recall_score(y_test, y_pred_logistic))print("F-Score:",metrics.f1_score(y_test, y_pred_logistic))Accuracy for Logistic Regression : 0.8181818181818182
Precision for Logistic Regression : 0.7567567567567568
Recall for Logistic Regression : 0.5957446808510638
F-Score for Logistic Regression : 0.6666666666666666
Our Precision for the model stands at 0.82. This indicates that 82% of the time our model classified the patients that it has diabetes when they actually had a high risk of getting diabetes.
Performance measure for K neighbor Classifier
Accuracy we got from K neighbor Classifier model is 0.77
print("Accuracy for K neighbor Classifier :",metrics.accuracy_score (y_test, y_pred_knn))print("Precision for K neighbor Classifier :", metrics.precision_score(y_test, y_pred_knn))print("Recall for K neighbor Classifier :", metrics.recall_score(y_test, y_pred_knn))print("F-Score for K neighbor Classifier :",metrics.f1_score(y_test, y_pred_knn))Accuracy for K neighbor Classifier : 0.7727272727272727
Precision for K neighbor Classifier : 0.6304347826086957
Recall for K neighbor Classifier : 0.6170212765957447
F-Score for K neighbor Classifier : 0.6236559139784946
Performance measure for naive Bayes
Accuracy we got from naive baiyes model is 0.79
print("Accuracy for naive baiyes :",metrics.accuracy_score(y_test, y_pred_navie))print("Precision for naive baiyes :",metrics.precision_score(y_test, y_pred_navie))print("Recall for naive baiyes :",metrics.recall_score(y_test, y_pred_navie))print("F-Score for naive baiyes :",metrics.f1_score(y_test, y_pred_navie))Accuracy for naive baiyes : 0.7922077922077922
Precision for naive baiyes : 0.6744186046511628
Recall for naive baiyes : 0.6170212765957447
F-Score for naive baiyes : 0.6444444444444444
Performance measure for Support vector machine
Accuracy we got from Support vector machine model is 0.35
print("Accuracy for Support vector machine :",metrics.accuracy_score (y_test, y_pred_svc))print("Precision for Support vector machine :",metrics.precision_ score(y_test, y_pred_svc))print("Recall for Support vector machine :",metrics.recall_ score (y_test, y_pred_svc))print("F-Score for Support vector machine :",metrics.f1_score (y_test, y_pred_svc)Accuracy for Support vector machine : 0.35714285714285715
Precision for Support vector machine : 0.3219178082191781
Recall for Support vector machine : 1.0
F-Score for Support vector machine : 0.48704663212435234
Performance measure for Decision tree classifier
Accuracy we got from Decision tree classifier model is 0.79
print("Accuracy for Decision tree classifier :",metrics.accuracy_scor e(y_test, y_pred_dtc))print("Precision for Decision tree classifier :",metrics.precision_ score(y_test,y_pred_dtc ))print("Recall for Decision tree classifier :",metrics.recall_score (y_test, y_pred_dtc))print("F-Score for Decision tree classifier :",metrics.f1_score (y_test, y_pred_dtc))Accuracy for Decision tree classifier : 0.7922077922077922
Precision for Decision tree classifier : 0.6415094339622641
Recall for Decision tree classifier : 0.723404255319149
F-Score for Decision tree classifier : 0.68
Performance measure for Random forest classifier
Accuracy we got from Random forest classifier is 0.78
print("Accuracy for Random forest classifier :",metrics.accuracy_ score(y_test, y_pred_rfc))print("Precision for Random forest classifier :",metrics.precision_ score(y_test,y_pred_rfc ))print("Recall for Random forest classifier :",metrics.recall_ score(y_test, y_pred_rfc))print("F-Score for Random forest classifier :",metrics.f1_ score(y_test, y_pred_rfc))Accuracy for Random forest classifier : 0.7857142857142857
Precision for Random forest classifier : 0.6521739130434783
Recall for Random forest classifier : 0.6382978723404256
F-Score for Random forest classifier : 0.6451612903225806
Performance Visualization
log=metrics.accuracy_score(y_test, y_pred_logistic)knn=metrics.accuracy_score(y_test, y_pred_knn)navie=metrics.accuracy_score(y_test, y_pred_navie)svc=metrics.accuracy_score(y_test, y_pred_svc)dtc=metrics.accuracy_score(y_test, y_pred_dtc)rfc=metrics.accuracy_score(y_test, y_pred_rfc)model=['LR','KNN','NAVIE','SVC','DTC','RFC']accuracy=[log,knn,navie,svc,dtc,rfc]data= pd.DataFrame({'Model': model, 'Score': accuracy})
print(data)
a = sns.barplot(x = 'Model', y = 'Score', data = data)a.set(xlabel='Models', ylabel='Accuracy Score')for p in a.patches:height = p.get_height()a.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center")plt.show()
Conclusion
We thus select the Logistic Regression as the right model due to high accuracy, precision and recall score.
Based on the feature importance:
- Glucose is the most important factor in determining the onset of diabetes and BMI and Age is also important factor after glucose.
- Other factors such as Diabetes Pedigree Function, Insulin, Pregnancies, Blood Pressure and Skin Thickness also contributes to the prediction.
As we can see, the findings derived from Feature Significance make sense because Glucose level is one of the first items that is actually tracked in patients having more risk of diabetes.