Menu

DATA SCIENCE TERM PROJECT

0 Comment

DATA SCIENCE TERM PROJECT
(DSCI 64210-001-201880)
PROF. JASON COLON
PREDICT IF THE PERSON WILL GET A STROKE OR NOT.

SANIL UMAKANT KAMAT
811015517
INDEX
Title Pg.no
Business Understanding 3
Data Understanding 4
Data Preparation 5-10
Modeling 11-12
Evaluation 13-15
Deployment 16
References 17

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

?
BUSINESS UNDERSTANDING
? Nowadays, stroke is the major and common cause of death. One never knows when A patient, he is going to get a stroke.
? Its post effects are very perilous. Stroke can send a person to coma, can cause paralysis and can even cause sudden death.
? The preventive measure for stroke will always be beneficial and cost-effective compared to the responsive medical treatment such as surgery.
? Through this project, I am trying to predict if the person will get a stroke or not beforehand so that timely medications can save someone’s life.

?
DATA UNDERSTANDING
? The main idea of this project is to predict if the “person will get a stroke or not”. Therefore, the target variable here is “stroke”. I have obtained the data sets from “Kaggle.com”.
? The dataset from “kaggle.com” contains the following attributes:
? Id
? Gender
? Age
? Hypertension
? Heart disease
? Ever_married
? Work_type
? Residence_type
? Avg_glucose_level
? Bmi
? Smoking status
? Stroke

?

DATA PREPARATION
? Data Preparation is the most challenging and most crucial part of the data mining process.
? Things which I did during this phase are:
? Convert the datasets which are available in CSV format to ARFF format.
? Find out which attributes are useful and which are not for predicting my target variable “stroke”.
? Cleaning the data by removing the blank spaces.
? Remove the attributes which are not useful.
? After removing the unuseful attributes, filter the data by applying the filtering techniques.
? The final list of attributes after filtering is:
? Gender
? Age
? Hypertension
? Heart disease
? Ever married
? Work type
? Residence type
? Avg glucose level
? Bmi
? Smoking status
? Stroke
DATA CLEANING
Before:

?
After:

? After getting the desired attributes I applied supervised filtering technique which is discretize under attribute section.

? Before filtering:

? After Filtering:

? I made target variable from numeric to nominal, as numeric values are not accepted by J48 and other methods.

MODELING
? After filtering the data, I will be using the supervised data mining technique to predict my target variable.
? The reason I will be using the supervised data mining technique because based on my labelled input variables my output target variable will be predicted. (Machine Learning Mastery, 2016)
? Classification: In the classification model, there are various algorithms like J48 decision tree, Naïve Bayes, adaptive Bayes etc.. I will be using Naïve Bayes algorithm.

OUTPUT:
NAÏVE BAYES MODEL:
? Naïve Bayes uses the concept of conditional probabilities, its advantages include:
• It is fast.
• Highly scalable.
? The disadvantage of this model is the implicit assumption that attributes present in the dataset are mutually independent.

Output Of My Model

? I have used a 10 fold cross-validation method to get the output.
? The accuracy of this model is 96.7258%.
? Correctly classified instances are 41979 out of total 43400 instances.
? Incorrectly classified instances are 1421 out of total 43400 instances.
? Total number of instances of my data set is 43400.

EVALUATION

? Trained data is biased towards “Class No” which is “Label 0” as shown below.

? Confusion matrix:

? As most of the trained data is biased towards “Class No”, Correctly modelled as a “person will not get a stroke” are 41892.
? Correctly modelled as “person will get a stroke” are 87.
? Therefore, the total correctly classified instances are 41979(41892+87) which is
96.7258%.
? Total incorrectly classified instances are 1421(725+696) which is 3.2742%.

True Positive Rate, False Positive Rate, Precision, Recall.

? As per TP rate, 98.3% is correctly modelled as persons who will not get a stroke and
11.1% of people will get a stroke.
? Recall is called as True Positive Rate or Sensitivity.
• It is given by Recall = TP
TP+FN
• From the above confusion matrix, for Class No, TP is 41892 and FN is 725.
• By including the values of TP and FN in the above Recall formula we get an answer
as 0.983 for class No (for the people who will not get stroke).
• For Class-Yes, TP is 87 and FN is 696 therefore Recall is 0.11. (for persons who will get stroke).
? Precision is also known as Positive Preferred Value (PPV).
• Precision = TP
TP+FP
• From the above confusion matrix, for Class No, TP is 41892 and FP is 696.
• By including the values of TP and FP in the above formula we get an answer
as 0.983 for class No (for the people who will not get stroke).
• For Class-Yes, TP is 87 and FP is 725, therefore we get the value as 0.107. (for persons who will get stroke).

Roc (Receiver Operating Characteristics)curve:

? This curve plots the False positive rate on the x-axis and True positive rate on the y-axis.
? The area under Roc is 83.83%.

Deployment:

? This system for predicting if a patient will suffer a stroke or not would be valuable to medical insurers. Take preventive stroke measures for those patients who are identified to be at risk. By tracking medical costs, comparing the results could be beneficial. Getting patients invested in preventive measures to reap better outcomes will be looked as having caught the increased stroke risk early.
? Certain risk factors of this measure could be false identification of patients as being at high risk. Project expenses could be justified if the test implements show cost savings.

References:
Machine Learning Mastery. (2016, September 22). Supervised and Unsupervised Machine Learning Algorithms. Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/.
Saumya Agarwal. (2018, April 16). Patient data train and test. Retrieved from https://www.kaggle.com/asaumya/patient-data-train-and-test/version/1.
Simafore. (n.d). 4 key advantages of using decision trees for predictive analytics. Retrieved from http://www.simafore.com/blog/bid/62333/4-key-advantages-of-using-decision-trees-for-predictive-analytics.
Simafore.(n.d). 3 challenges with Naive Bayes classifiers and how to overcome. Retrieved from http://www.simafore.com/blog/3-challenges-with-naive-bayes-classifiers-and-how-to-overcome.
Wikipedia. (2018, November 2). Precision and Recall. Retrieved from https://en.wikipedia.org/wiki/Precision_and_recall#Precision.

x

Hi!
I'm Kim!

Would you like to get a custom essay? How about receiving a customized one?

Check it out