June 17, 2022

Welcome to the wonderful world of widgets.

Today's widget is a Jupyter Notebook converted html file that introduces us to supervised learning. The support vector machine model is trained to predict if the input data is grouped in the "positive hyperplane" or "negative hyperplane" of the training set graph.

data: BMI, blood glucose, insulin lvl

STEPS

  1. pre-process data: standardize data so in the same range
  2. train test data
  3. use support vector machine Classifier
  4. train support vector machine Classifier
In [7]:
#https://www.youtube.com/watch?v=xUE7SjVx9bQ&list=PLfFghEzKVmjsNtIRwErklMAN8nJmebB0I&index=30
#for command parameters of pd.read_ccsv enter: "pd.read_csv?"

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
data = pd.read_csv("dataset/diabetes.csv")
data.head()
data.shape
data.describe()
data["Outcome"].value_counts()
Out[7]:
0    500
1    268
Name: Outcome, dtype: int64

0 represents non-diabetic 1 represents diabetic

In [8]:
data.groupby("Outcome").mean()
Out[8]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
Outcome
0 3.298000 109.980000 68.184000 19.664000 68.792000 30.304200 0.429734 31.190000
1 4.865672 141.257463 70.824627 22.164179 100.335821 35.142537 0.550500 37.067164
In [12]:
# separating data and labels
x = data.drop(columns="Outcome",axis=1)
y = data["Outcome"]
print(x)
print(y)
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

Data Standardization

In [13]:
scaler = StandardScaler()
#scaler.fit(x)
#standardized_data = scaler.transform(x)
#can combine the fit and transform command
standardized_data = scaler.fit_transform(x)
print(standardized_data)
[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
In [16]:
x = standardized_data
y = data["Outcome"]
#print (x)
#print (y)
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

Train Test Split

In [18]:
#reserve 20% of dataset for test; stratify to ensure even nondiabetic and diabetic split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=1)
print(x.shape, x_train.shape, x_test.shape)
(768, 8) (614, 8) (154, 8)

Train model

In [19]:
classifer = svm.SVC(kernel="linear")
classifer.fit(x_train, y_train)
Out[19]:
SVC(kernel='linear')

Evaluate Model: accuracy score overfitting: high accuracy score on train, low accuracy score on test

In [20]:
x_train_prediction = classifer.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)
print("Accuracy on Training Data : ", training_data_accuracy)
Accuracy on Training Data :  0.7833876221498371
In [22]:
x_test_prediction = classifer.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)
print("Accuracy on Test Data : ", test_data_accuracy)
Accuracy on Test Data :  0.7792207792207793

Making a Predictive System input_data to numpy array bc process faster; converts list to numpy array model trained on 768 examples, need to reshape array since using only one datapoint

In [51]:
#known nondiabetic data
#input_data = (1,121,78,39,74,39,0.261,28)

#known diabetic data
#input_data = (2,174,88,37,120,44.5,0.646,24)

input_data = (5,120,78,23,79,28.4,0.323,34)

input_data_as_numpy_array = np.asarray(input_data)
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

#standardize data, since training data was standardized
std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifer.predict(std_data)
print(prediction)

if(prediction[0] == 0):
    print("Model predicts a nondiabetic")
else: 
    print ("Model predicts a diabetic")
[[  2. -10.   0.   0.   0.   0.   0.   0.]]
[0]
Model predicts a nondiabetic