June 17, 2022

Welcome to the wonderful world of widgets. ¶

Today's widget is a Jupyter Notebook converted html file that introduces us to supervised learning. The support vector machine model is trained to predict if the input data is grouped in the "positive hyperplane" or "negative hyperplane" of the training set graph.

data: BMI, blood glucose, insulin lvl

STEPS

pre-process data: standardize data so in the same range
train test data
use support vector machine Classifier
train support vector machine Classifier

#https://www.youtube.com/watch?v=xUE7SjVx9bQ&list=PLfFghEzKVmjsNtIRwErklMAN8nJmebB0I&index=30
#for command parameters of pd.read_ccsv enter: "pd.read_csv?"

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
data = pd.read_csv("dataset/diabetes.csv")
data.head()
data.shape
data.describe()
data["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

0 represents non-diabetic 1 represents diabetic

data.groupby("Outcome").mean()

# separating data and labels
x = data.drop(columns="Outcome",axis=1)
y = data["Outcome"]
print(x)
print(y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

Data Standardization

scaler = StandardScaler()
#scaler.fit(x)
#standardized_data = scaler.transform(x)
#can combine the fit and transform command
standardized_data = scaler.fit_transform(x)
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]

x = standardized_data
y = data["Outcome"]
#print (x)
#print (y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

Train Test Split

#reserve 20% of dataset for test; stratify to ensure even nondiabetic and diabetic split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=1)
print(x.shape, x_train.shape, x_test.shape)

(768, 8) (614, 8) (154, 8)

Train model

classifer = svm.SVC(kernel="linear")
classifer.fit(x_train, y_train)

SVC(kernel='linear')

Evaluate Model: accuracy score overfitting: high accuracy score on train, low accuracy score on test

x_train_prediction = classifer.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)
print("Accuracy on Training Data : ", training_data_accuracy)

Accuracy on Training Data :  0.7833876221498371

x_test_prediction = classifer.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)
print("Accuracy on Test Data : ", test_data_accuracy)

Accuracy on Test Data :  0.7792207792207793

Making a Predictive System input_data to numpy array bc process faster; converts list to numpy array model trained on 768 examples, need to reshape array since using only one datapoint

#known nondiabetic data
#input_data = (1,121,78,39,74,39,0.261,28)

#known diabetic data
#input_data = (2,174,88,37,120,44.5,0.646,24)

input_data = (5,120,78,23,79,28.4,0.323,34)

input_data_as_numpy_array = np.asarray(input_data)
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

#standardize data, since training data was standardized
std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifer.predict(std_data)
print(prediction)

if(prediction[0] == 0):
    print("Model predicts a nondiabetic")
else: 
    print ("Model predicts a diabetic")

[[  2. -10.   0.   0.   0.   0.   0.   0.]]
[0]
Model predicts a nondiabetic

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
Outcome
0	3.298000	109.980000	68.184000	19.664000	68.792000	30.304200	0.429734	31.190000
1	4.865672	141.257463	70.824627	22.164179	100.335821	35.142537	0.550500	37.067164