Optimizing Diabetes Prediction with ROC Analysis in Python

Optimizing Diabetes Prediction with ROC Analysis in Python

Introduction

In the realm of healthcare analytics, Python has become a go-to language for building predictive models. This article focuses on using Python to predict diabetes outcomes using logistic regression, emphasizing the importance of Receiver Operating Characteristic (ROC) analysis for model evaluation. We’ll leverage the Pima Indians Diabetes dataset, a staple in machine learning for diabetes prediction.

Python Libraries and Dataset

Python offers a rich ecosystem of libraries for machine learning. For this task, we’ll use `pandas` for data handling, `scikit-learn` for modeling, and `matplotlib` for visualization.

Dataset Exploration

The Pima Indians Diabetes dataset, available in the UCI Machine Learning Repository, comprises medical measurements of Pima Indian women and a binary outcome for diabetes presence.

Data Preparation

First, let’s import the necessary libraries and load the dataset.

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=columns)
```

Model Training and Evaluation

We’ll split the dataset into training and test sets, then train a logistic regression model.

```python
# Splitting the dataset
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

# Logistic Regression model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
```

ROC Curve Analysis

The ROC curve is a critical tool for evaluating binary classifiers. It plots the true positive rate against the false positive rate at various threshold settings.

```python
# Predict probabilities
probs = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and ROC area
fpr, tpr, _ = roc_curve(y_test, probs)
roc_auc = roc_auc_score(y_test, probs)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
```

Conclusion

Using Python for diabetes prediction through logistic regression and ROC analysis offers a powerful approach to evaluate the model’s discriminatory ability. This method is especially vital in healthcare, where understanding the balance between sensitivity and specificity can guide clinical decisions.

End-to-End Python Example

Here’s the complete Python script that encapsulates loading the dataset, training the logistic regression model, and evaluating it using the ROC curve:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=columns)

# Split dataset
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

# Train logistic regression model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Predict probabilities and compute ROC curve
probs = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, probs)
roc_auc = roc_auc_score(y_test, probs)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
```

This script is a holistic guide to predicting diabetes using logistic regression in Python, with an emphasis on ROC analysis for model evaluation.

 

Essential Gigs