Menu

Name

0 Comment

Name: Sree Lakshmi Addepalli
Email: [email protected] Number: N12311918
PREDICTIVE ANALYTICS HOMEWORK 1
I
Difference Between Forecasting and Predictive Analysis

Source: https://powerpivotpro.com/2017/08/difference-forecasting-predictive-analytics/Forecasting
It anticipates the trend over the given time (mostly months or years) for a large population.

Here the timelines have to be long to anticipate the next thing.

Timelines in an implicit attribute.

Here we deal taking into consideration large amount of population.

Its too specific to a result (like weather, production) and deals with numbers.

It looks at time series of a data numbers and predicts the future data value looking at the trends.

In the above example forecasting is done after taking the huge population into consideration along with timelines ranging for months we forecast that the population may take product A.

Predictive Analytics
It anticipates what a customer may select/purchase and its based on the current timestamp.

Here the timelines are short to anticipate the next thing.

Timelines in not an implicit attribute.

Here we deal taking a customer and his past actions.

Its vast and involve covering number of tools, techniques such as ML models.

It takes into consideration various inputs affecting a user’s decision to predicts its future behaviour and the input may not be numbers.

In the example above seeing the user’s past searches we can predict that he might select Product B as product B is similar to his searches. Here numbers are not involved nor long timelines.

Nate silver Forecasting
For the 2012 president elections Nate silver presented a 90.9% likelihood that Obama would be the president and he predicted the results of 50 out 50 states correctly. He Predicted the results using a statistical methodology called Forecasting.

Source: https://twitter.com/cosentino/status/266042007758200832/photo/1Nate’s model has two parts the process i.e. the % of population who have intentions to vote and the sampling how poles are affected by actual voting intentions. Then he models the intended voting in each state. This is then used to predict the actual vote. This data may have been taken months before the election and intentions vary everyday hence we need to incorporate times series. So, for each day it was considered a vote various attributes like race, gender was taken into consideration.

Based on different days the votes could go either way hence Nate Silver applied forecasting after collecting data for months and predicted the output. Here you can see that one implicit variable is time and hence time in an integral part of forecasting.

Source: http://blog.revolutionanalytics.com/2012/11/in-the-2012-election-data-science-was-the-winner.htmlPresident Obama Predictive Analytics Campaign
President Obama with a group of data scientists used the idea of Uplifting model for the 2012 campaign. Uplifting model is a type of Predictive analytics where we are predicting which voter will be giving a positive result after influencing them with a call, add, mailings, phone call and those who would give a negative result.

In the uplifting or persuasion model the Lost causes are those who won’t vote even if you influence them as they are supporters of Romney. The Do not disturbs or the sleep dogs are those whom if an influential add or banner is sent may turn their vote towards Romney and hence shouldn’t be touched. The sure things are those whom will vote for Obama even if they are not sent any add or campaign hence investing on them is wastage of resources. Only the persuadable need to be targeted for campaigning.
After these experiments success the analysts used a matrix of political, household and demographical data to develop a model and obtained a score for each person and those in the range of perusable were called, knocked the door or given adds in the final weeks of the campaign. The models helped the campaign to reach out to voters at every individual level, based on the message they were most likely to be respond to and what form of campaign they were most likely to be persuaded by.

So here the predictive analytical model took into care the political, household and demographical data as factors and predicted a score. Here there were no timelines and it was targeted for each individual. So this is a predictive model.

References:
https://www.quora.com/What-is-the-real-difference-between-forecasting-and-predictive-analyticshttps://powerpivotpro.com/2017/08/difference-forecasting-predictive-analytics/https://searchbusinessanalytics.techtarget.com/video/How-uplift-modeling-helped-Obamas-campaign-and-can-aid-marketershttps://www.theguardian.com/science/grrlscientist/2012/nov/08/nate-sliver-predict-us-electionObama Campaign using Uplifting model
For the 2012 president elections Obama needed to campaign in such a way so that he could get more support from voters.

The campaign had to be done by also utilising the resources in the most efficient way.

Hence, after sitting with a group of data scientists the idea of Uplifting model was thought to be used for the campaign.

Uplifting model is a type of Predictive analytics where we are predicting who will be giving a positive result after influencing them with a call, add, mailings, phone call and those who would give a negative result.

Consider the below diagram, The Lost causes are those who wont vote even if you influence them as they are supporters of Romney. The Do not disturbs or the sleep dogs are those whom if an influential add or banner is sent may turn their vote towards Romney and hence shouldn’t be touched. The sure things are those whom will vote for Obama even if they are not sent any add or campaign hence investing on them is wastage of resources. Only the persuadable need to be targeted for campaigning

Source: https://www.analyticbridge.datasciencecentral.com/profiles/blogs/what-are-uplift-modelsHence to apply uplifting models to president Obama campaign, the team first setup an experiment in which some were called and some weren’t and later both the groups were polled to see whether they would vote for president Obama. The support increased by 5% points in the persuadable group.

After these experiments success the analysts used a matrix of political, household and demographical data to develop a model and obtained a score for each person and those in the range of perusable were called, knocked the door or given adds in the final weeks of the campaign.

For the tv adds an optimizer was created to list “good buys” based on the channel’s large numbers of the persuadable voters watched.
The models helped the campaign to reach out to voters at every individual level, based on the message they were most likely to be respond to and what form of campaign they were most likely to be persuaded by.

References:
https://www.analyticbridge.datasciencecentral.com/profiles/blogs/what-are-uplift-modelshttps://searchbusinessanalytics.techtarget.com/video/How-uplift-modeling-helped-Obamas-campaign-and-can-aid-marketersFailure of Forecasting in 2016 US presidential Elections
The outcome of the 2016 election disagreed and ran counter to every major forecast by Princeton Election Consortium, Nate Silver’s FiveThirtyEight undercutting the belief in the power of big data science and predictive analytics. All predicted that the democrats had advantage but this didn’t happen. Many reasons played for this failure.

Trumps campaign targeted demographics which lived in big numbers and caused improvement in swing states Florida and Pennsylvania.

Clinton performed poorly in several swing states in comparison with Obama performance in 2012.

A big number of Clintons potential voters in populated traditionally” blue” states, but also in some very populated states traditionally “red”, like Texas, which were estimated safe for Trump.
Pollsters stated that most of the polls were accurate, but that pundit’s interpretation of these polls neglected polling error. According to  Public Opinion Quarterly, the main sources of polling error were “a late swing in vote preference toward Trump” .Second one being a failure to adjust the over-representation of college graduates favouring Clinton whereas the share of undisclosed Trump voters proved to be negligible.
Nate Silver got that the high number of third-party voters and undecided voters in the election was neglected in most of these models, and that many voters decided to vote for Trump.
References: https://en.wikipedia.org/wiki/United_States_presidential_election,_2016Use of Predictive Analytics in elections of 2016
Hillary Clinton
Used a complex computer algorithm named “ADA”. The algorithm played hand in almost all strategic decision Clinton’s aides took like when and where to deploy the candidate, where to aid television adds, and more. County level campaign offices as well as staging popular high-profile concerts were dependent on ADA.

The algorithm was fed with polling numbers private and public as well as ground level voter data. As the initial polling’s started the polls were factored too. ADA ran 4000 simulations per day on that data to show how the race against Trump looked like. This would guide Clinton in deciding where to spend her time and resources.

Due to the algorithm the importance of some states were lost like Michigan and Wisconsin due to which Michigan was visited at last and Wisconsin wasn’t even visited.

Reference:https://www.washingtonpost.com/news/post-politics/wp/2016/11/09/clintons-data-driven-campaign- relied-heavily-on-an-algorithm-named-ada-what-didnt-she-see/?noredirect=on&utm_term=.319173b2e98fDonald Trump
Cambridge Analytica introduced itself to American politics with the goal of give republicans big data tools to level with the democrats.

It was involved in developing detailed psychological profiles of every American voter so that campaign pitches could be personalised.

The data was obtained from a 120-question survey about there personality and behaviour and then scored people on five personality traits like openness, agreeableness and extroversion.

The results obtained was joined with polls, voter online activity and record.

Cambridge Analytica also collected lot of Data from a professor Aleksandr Kogan who created an app requiring sign -in of Facebook and hence got lot of user’s data as well as its friends data leading to privacy breach.

Reference: https://www.npr.org/2018/03/20/595338116/what-did-cambridge-analytica-do-during-the-2016-electionII
Predictive Analytics
Predictive analytics is a methodology of using various tools and techniques on a knowledge database collected from historical events to predict the most probable like hood that the event will occur in the future out of the various events.

The tools and techniques used can be any kind statistical methods and algorithms. Basically, predictive analytics extracts the hidden information inside the knowledge base and hidden relations between various attributes which can then be used to predict in future how a certain attribute will behave. This can be used by many companies to take decisions so that if any unforeseen event is seen in future it can be prevented right now.

For Example: For an ATM company monitoring their cash count in the ATMS predictive analytics can be useful in predicting the cash replenishment date beforehand and this can help them in managing their inventory without having to manually check on the ATMS. Here the data source is the transaction history and the hidden variables are the transaction amount per day, time when there is peak withdrawal, currency notes which are withdrawn maximum. The techniques used can be to use a mathematical model and fit it into the tabularised transaction data and get the required value for the future.

Reference: https://cs.nyu.edu/~abari/PA/Ch1/Ch1_IntrotoPAandRelatedFieldsPP.pdfPrescriptive Analytics
Any analytical problem can be approached in 3 steps namely:
Descriptive Analytics: Looks into the details of the problem by mining the data, finding out relationships between the attributes and summarising them.

Predictive Analytics: Looks through the data and predict in future how a certain attribute will behave.

Prescriptive analytics: while predictive looks into what and when will it happen prescriptive looks into why it will happen and suggest the decisions to be taken accordingly.
For example: Consider the previous example where we were predicting the day to replenish cash. Here prescriptive analytics goes further and tells why the cash is replenishing at such a rate? Suppose the cash is replenishing at a very fast rate due to a crowded neighbourhood the ATM company can decide to install another ATM nearby to decrease this rate. If the replenish rate is slow then we can see at what location can the ATM be placed to increase the rate. Taking such decisions is possible through prescriptive analytics.

Source: https://en.wikipedia.org/wiki/Prescriptive_analytics#/media/File:Three_Phases_of_Analytics.pngReference:
https://en.wikipedia.org/wiki/Prescriptive_analyticsCRISP-DM
It stands for Cross Industry Standard Process for Data Mining.

It’s an extensively used Project Life Cycle by Data Scientists for any Data Analytics Project.

It consists of 6 steps and this repeats itself to make changes to the model.

Business Understanding: For a data mining project it usually involves the business requirements gathering, end-goals and the estimations, timelines and project plan. For the previous example the end-goal would be to figure out the number of days before cash replenishes.

Data Understanding: This step is more of getting the data sources, looking through it, understanding the terms and terminologies and seeing whether the data is of good quality or is it having just random data/anomaly. In the previous example the transaction logs can be understood to figure out how many bills of each change in the currency are being utilised and understand the logs and seeing whether the transactions are correct or is there any discrepancy.

Data Preparation: Here the data from all the required sources are pulled and integrated and formatted in a structured form by removing any discrepant data, filling missing values and creating new attributes. In the previous example the transaction logs need to converted from unstructured to a structure format and new attributes like bills consumed per hour etc must be created and all missing values must be filled appropriately or be removed to maintain consistency. 70% time of any Data Analytics project goes into this.
Modelling: Here a mathematical model is applied to predict a value of an attribute. Various models can be tested here to see which fits best according to the prepared data and that is why the steps Data Preparation and modelling are bidirectional. This is done to obtain good accuracy. Example a linear regression can be applied on a transactional data to predict the number of days it will take for the cash to replenish.

Evaluation: The model is worked upon until the results produced don’t match the criteria required to be declared a success.

Deployment: After the model is evaluated and marked successful it is used to embed in an application requiring it and finally deployed. If the requirements change it has to go through the 6 processes again

Source:https://en.wikipedia.org/wiki/Crossindustry_standard_process_for_data_mining#/media/File:CRISP-DM_Process_Diagram.pngReferences:
https://cs.nyu.edu/~abari/PA/Ch1/Ch1_IntrotoPAandRelatedFieldsPP.pdfhttps://www.ibm.com/developerworks/bpm/library/techarticles/1407_chandran/index.htmlKafka
Kafka is used by many companies for building Realtime streaming apps that transform the streams of data and data pipeline that gets data between systems.
It works on distributed publish subscribe messaging system where messages are stored in a record and clustered to a topic and published for apps to subscribe. A record contains a key a value and a timestamp. Those who subscribe to this topic will get the related messages.

Kafka can handle huge amount of data and is fault tolerant i.e. the messages are stored in the disk and replicated in different clusters.

Kafka is Reliable, Durable, Scalable and has Zero Downtime.

Kafka can be used to store operational metrics, stream processing and log aggregation.

Kafka has 4 api’s namely:
Producer API: allows application to publish streams of records to a topic.

Consumer API allows application to subscribe to a topic and gives them streams of data
Streams API: here it consumes a particular type of stream transforms it accordingly and produces this transformed stream for others to consume.

Connector API building and running reusable producers and consumers connected to an application

Source: https://kafka.apache.org/introReferences:
https://kafka.apache.org/introhttps://www.tutorialspoint.com/apache_kafka/apache_kafka_introduction.htm5.Apache Spark
It is a big data distributed framework and a fast-unified analytical engine for machine learning and big data.

It provides native binding to Scala, python, java, R leading to scalability and supports graph processing, sql, machine learning and streaming data.

It has in memory data engine due to which it is faster that map-reduce systems
The spark api hides much of the complexity of distributed processing behind method calls making lives of developers easy.

It works on the idea of resilient distributed dataset (RDD). RDD is a immutable object collection which can be split in Computing cluster.

RDD provides support for joining datasets, filtering and aggregation.

A driver process splits the application into tasks and distributes it to the executors which can be scaled up or scaled down.

References:
https://www.infoworld.com/article/3236869/analytics/what-is-apache-spark-the-big-data-analytics-platform-explained.htmlhttps://databricks.com/spark/about6. Neural Network
It’s a system that takes inspiration and models on the human brain.

They perform tasks by learning through samples and then can perform the task of classification. For example, if you pass images of cats to a neural network labelled as cats and not its features like it has fur etc after few training samples if we send an image of a cat it can identify whether it’s a cat or not a cat. It generates the features to identify while learning the images.

It consists of layers of nodes connected to each other.
In a feedforward network the initial layer is assigned random weights.
all nodes in the consecutive layers are assigned final weights from the incoming nodes. The value of weight of a node is equal to the product of various numbers inputted from the previous layer nodes and the weight initially assigned to the nodes If the product is above a threshold then only data is forwarded.

While going forward the weights and the threshold are manipulated and adjusted until the input labels of image match the output image label.

Once this is achieved we have our model where we can pass a testing image to get the class it belongs to.

There are various algorithms to increase accuracy like back propagation algorithm.

Source: https://en.wikipedia.org/wiki/Artificial_neural_networkReferences:
http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414https://en.wikipedia.org/wiki/Artificial_neural_network7. Big Data
Data which cannot be managed, processed by traditional databases due to its size, type or speed are classified as Big Data.

Big Data has 3 properties namely:
Volume: It can be range from megabytes to gigabytes, terabytes or petabytes.

Velocity: Batch, Monthly, Weekly, Daily, Hourly, or Real Time
Variety: It can be sensor data, Images, Audio, Video and click stream
Big Data can be structured, semi-structured or unstructured.

Source:https://www.semanticscholar.org/paper/On-the-inequality-of-the-3V’s-of-Big-Data-A-case-Ivanov-Korfiatis/8b2837f15dfa82432fdf6dfbef0f370eeb67b6c7/figure/0References:
https://www.ibm.com/analytics/hadoop/big-data-analyticshttps://cs.nyu.edu/~abari/PA/Ch1/Ch1_IntrotoPAandRelatedFieldsPP.pdf8. Recommender Systems
It’s a system that predicts what item a customer/consumer will likely purchase or give a high rating out of many other items.

Recommendations are used to recommend items in an ecommerce store, movies, books and even dating sites.

There are three types recommendation systems:
Collaborative Filtering: It is of two types Item Based Collaborative Filtering and User Based Collaborative filtering. User Based Collaborative method assumes that If two people are similar in there likes and dislikes on the same item then there likes and dislikes will be same on the other items as well. For example, consider person A likes certain movies. Then we find Person B having similar taste like A and then recommend Person A movies that B has liked and A hasn’t watched. In item collaborative filtering similarity is found between items and if a user sees item A and it is similar to item 2,3 then the user is recommended with item 2,3.

Content-based Filtering: This method recommends items based on history of user preference as well as the items profile. So, if a person A needs to be recommended then his user profile is viewed to see his previous history of likes/purchase and then each item is matched with the to be recommended. The most matched is then recommended.

Hybrid Filtering: This is a mix of Collaborative filtering and Content based filtering.

References:
https://en.wikipedia.org/wiki/Collaborative_filteringhttps://en.wikipedia.org/wiki/Recommender_system#Content-based_filteringhttps://en.wikipedia.org/wiki/Recommender_systemhttps://cs.nyu.edu/~abari/PA/Ch1/Ch1_IntrotoPAandRelatedFieldsPP.pdf9. Trust Based Recommender Systems
10. Linear Regression
Linear Regression is a statistical concept showing relationship between dependent variable(Y) and independent variable(X). Linear regression may be classified as simple regression if there is one independent variable and is called multiple linear regression if many independent variables are present.

It is used in forecasting and in prediction. Any dataset if fit through a linear regression can predict the next outcome.

Consider the example of house square foot(X) vs house cost(Y) if we fit a linear model through it using the dataset we can predict the house price for any house with any square foot not present in the dataset using the following equation:
Y = mX+CAnd if there are multiple features (X1,X2,X3…)
Y = m1X1+m2X2…+C
References: https://en.wikipedia.org/wiki/Linear_regression11. Gradient Boosting
Gradient Boosting is a very important techniques to build predictive models. As the name suggests it is a technique used to improve a weak model i.e. to minimise the error of prediction by the model. Gradient boosting has three elements namely
A loss function: used to find out the difference between predicted value and the real value and minimise this difference. For Regression we may use a least squares method
A weak learning model which makes prediction: We can consider a regression tree or a decision tree.

An additive model to add all the weak learners to minimise loss function. After calculating the loss on the tree we must add the tree to the model that reduces the loss.

Fitting a model Y = F1F(x)
Fit models to the residuals h = y – F1(x)
Create a new model F2(x) = F1(x) + h
And similarly, Fm(x) = Fm-1(x) +hm-1(x)
So, we can improve our model like that.

References:
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/12. Knowledge Discovery
It is the process of extracting insights or knowledge from Data. Various methodologies are involved in extracting knowledge like statistics, machine learning and pattern recognition, query search optimisation. Data mining methods of data selection, data cleansing is applied before extracting knowledge. The steps for knowledge discovery are
Understand the user’s requirements of what data needs to be extracted.

Accordingly select the dataset where the required knowledge can be extracted.

Pre-process the data by using data cleansing strategy, feature reduction and filling out missing values to obtain clean tabularised data.

Extract the required knowledge from the cleaned data using data mining algorithms to find hidden patterns and compare it with the user requirements.

If they match document the steps for the process for further use.

Reference:
https://researcher.watson.ibm.com/researcher/view_group.php?id=144https://www.techopedia.com/definition/25827/knowledge-discovery-in-databases-kdd13.Class Label
Class label is used in supervised machine learning (classification) where given a set of attributes we need to predict the class label. In the initial training set the model is trained with the attributes as well as the class label which is given. On the test set only the attributes are given and we need to predict the class label. Class labels are having finite values and are discrete.

For example, given the following example that a student obtains 9,8,7 is the subjects s1, s2, s3 in the final exams and the class label is ‘pass’. Another student obtained 4,5,6 and the class label is ‘failed’. Then the goal is to predict the class label of the student obtaining marks 2,2,2 in s1,s2,s3. The class label predicted here is ‘failed’ according to the classification boundary line.

Reference:
https://stackoverflow.com/questions/36362541/in-data-mining-what-is-a-class-label-please-give-an-example14. KNN

Source: https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithmKnown as k Nearest Neighbours’ algorithm and is used in classification tasks for Machine learning.

It is a lazy and a non-parametric algorithm.

It uses datapoints which are classified into different classes and predicts the classification of the new datapoint to the class which is nearest to it.

As it is non-parametric it doesn’t take into consideration the distribution of the data and it should be considered when there is no prior knowledge on how the data has been distributed. It is good for nonlinear data.

As KNN is a lazy algorithm it doesn’t take the training data as a generalisation. It keeps the training data during testing phase.

Its works on the idea of Feature similarity. How close the test sample resembles the training set determines the classification of the point. The class of maximum neighbours which are near to the test point is assigned to it.

The computation cost is expensive as it stores all the data and requires high memory storage.

Algorithm:
Select a positive number” k” and find the K points nearest to the test example. The most common class out of those k points will be the class of the training phase.

Reference: https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm15. Analytic
Discovery and interpretation of meaningful insights from the data
It requires knowledge of Computer Science and Statistics to perform analytics.

Analytics are used by companies to show there describe, predict and take decisions on business performance.

It includes big data analytics, retail analytics, prescriptive analytics and many more.

It basically focuses on what, when and why it will happen for any case study.

Many tools and software like RapidMiner, weka can performance analytics on the data.

Reference: https://en.wikipedia.org/wiki/Analytics16. Hadoop 2.0
Hadoop 2.0 represents the general architectural shift from Hadoop.

Yarn has become the operating system and HDFS the file system of the new Hadoop system making it powerful.

Yarn allows multiple applications to run in Hadoop rather than the usual on Hadoop.

In 2.0 the entities jobTracker and TaskTracker no longer are there and now have Yarn has 3 components
Resource Manager: It schedules the allocation of available resources present in the cluster required by various applications at that time.

NodeManager: runs on each node of a cluster and manages resources on that node. It reports to the resource manager.

Application Master: runs a specific yarn job and is used for negotiating of resources from Resource manager. It also works with the NodeManager to execute and monitor containers. Container is the location where data processing takes place.

The HDFS has changes in 2.0
Name node has automated failover and resilience.

Snapshots are present for disaster recovery, backup and against any errors.

Federation supports multiple namespaces in the cluster to increase isolation and scalability.

Source: https://www.oreilly.com/ideas/an-introduction-to-hadoop-2-0-understanding-the-new-data-operating-systemReferences:
https://www.oreilly.com/ideas/an-introduction-to-hadoop-2-0-understanding-the-new-data-operating-system17. Deep Belief Networks
It’s an unsupervised pretrained network (UPN)
Its composed of multiple hidden layers where each layer is connected to one other but not the units.

It composed of two parts.

a) Belief Network
b) Restricted Boltzmann machine
18. Deep Learning
19. Convolutional Neural Network
20. Feature Selection
4.

DATA DRIVEN SOLUTION FOR MARKETING TEAM
Solving the approach through CRISP-DM
Agenda:
The business has five types of Customers
1. Loyal customer – Loyal Customers are those who make up at least 55% of the Business’ sales.

2. Discount customer – Discount Customers are customer who their purchase decision depends on the discount the business is offering.

3. Wandering customer – Wandering customers are customers whose purchase patterns are unpredictable.

4. Need-Based Customer – Need-based customers are customers who have a clear intention on buying specific products. The items they will buy can be predicted from their previous purchase history.

5. High-Returned-Items Customer – High-returned-items customers are customers that has the tendency to buy multiple items and return most the items they ordered and bought
Business requisites:
The business has collected a labelled Historical Data of several customers.
The data was mainly collected from online e-commerce site of the business.
Customers start by creating accounts (customer registration) that has their personal information.
All their transactions are being saved under their accounts
Output:
1. The business would like to predict the type of given customer after certain number of transactions made by the same customer
2. Depending on the customer’s type, the marketer will decide on the marketing strategy he or she will adopt to target the customer.

3. Provide an overview of the next steps in the data analytics lifecycle that would follow to predict the type of customer using the historical data collected by the business and the present data of the customer’s behaviour for a number of transactions.
Data Understanding
Data Source: The Ecommerce company website will have a database to store all transactions, user details, and their purchases. This will be used as a Data Source for this project.

Data Type: As the data is already stored in a structured format it will have Structured Data Type and as the customers are interacting directly with the Ecommerce site for purchase it’s a Behavioural Data Type.

Attributes:
Table Name Attribute Name Description
Customers_TableCustomerIDNominal, Customer Unique ID
FirstName Nominal, First Name of Customer
LastNameNominal, Last Name of Customer
Building Nominal, Building Name
Address1 Nominal, Billing Address
Address2 Nominal, Billing Address
City Nominal, Billing Address
State Nominal, Billing Address
Country Nominal, Billing Address
Postal Code Numeric, Billing Address
Phone Numeric, Customer Phone Number
Email Nominal, Customer Email Address
Password Nominal, Customer Password
CreditCardNumberNumeric, Customer Credit Card Details
CreditCardTypeIDNominal, Customer Credit Card Details
CardExpMonthNumeric, Customer Credit Card Details
CardExpYearNumeric, Customer Credit Card Details
items_purchased_per_monthNumeric, Customer total purchase quantity per month
Customer_Total_transactionNumeric, Customer total transactions till date
CustomerTypeNominal, Marketing Category Value
Orders_TableOrderID Numeric, Order Unique ID
CustomerIDNominal, Customer who purchased this order
OrderNumber Numeric
PaymentIDNumeric
OrderDateNumeric
Timestamp Numeric
TransactionStatusNominal
PaymentDateNumeric
OrderStatusNominal
OrderID ProductIDOrderNumberPrice Quantity Discount Total OrderDetailsID BillDateData Problems:
Data might not be clean Can be resolved by identifying the values to be cleaned and pre-process them
Data may be wrong, incomplete, irrelevant, noisy or inconsistent wrong data, irrelevant and noisy fields can delete to remove errors, incomplete data can be made complete by filling the values appropriately, inconsistent data can be made consistent by normalisation and standardisation, Ensure Quality Data check and validation
Data Matrix Architecture
The database will be:

Reference: https://www.princeton.edu/~rcurtis/ultradev/ecommdatabase.html
Data Matrix will contain only required fields
a. CustomerID – Unique for each customer
b. City – Customers living in urban areas shop more and can be parameter for Loyal customer
c. OrderNumber – Order number generated during purchase
d. OrderDate – Date of Purchase
e. Timestamp – Time of Purchase
f. OrderStatus – Delivered/Returned. Used to track High Return Customer
g. ProductName – Required to track need-based Customer as he will purchase specific products.

h. DiscountAvailable – Yes/No Required to track Discount Based Customer
i. Price – Cost Of Item
j. Quantity – Quantity Purchased by a customer. Will track total items purchased per month to find loyal customers
k. Discount – The discount Amount
l. Total – Total order total. Will be required per month to calculate percentage contribution of the customer to business sales to track loyal customer.

Data Reduction
Dimensionality Reduction – Pulling the record columns from a database containing multiple tables to a single table is dimension reduction and that has been done above.

Principal Component Analysis – PCA helps in finding the most important predictive variables.

Data Preparation
Techniques for Preparing Data
Clean the data:
Wrong data, irrelevant and noisy fields can be deleted to remove errors.

Incomplete data can be made complete by filling the values appropriately, inconsistent data can be made consistent by normalisation and standardisation
Ensure Quality Data check and validation
Feature Extraction
It creates new features from the original feature. For example, we can create a new feature items purchased per month which can calculated by sum of quantity per month for a particular customer.

Feature Selection
Returns a subset of all columns present mostly the important columns which can help in prediction.

Data Reduction
Can use Dimensionality Reduction or PCA to get the attributes contributing to the predictive solution.

Singular Value DecompositionDidn’t understand how to apply it
Its useful to calculate SVD if we can relate Columns to Rows in a Data Matrix.

16817193558971Supervised Learning
4000020000Supervised Learning
41877294862195Expected Label
4000020000Expected Label
52707394131418center1402127Feature Vectors
00Feature Vectors
center2951960Feature Vector
4000020000Feature Vector
46407592472162PredictiveModel00PredictiveModel38128742360067033442462141220235980423136888712682291056New Text
New Text
248440816268220522661317462504416724143078MachineLearning Algorithm
00MachineLearning Algorithm
3775075485008108692811782486895021608970ClassLabelClassLabel3191630616477349885013965333435753985652250896556883629728393245Data Matrix
Data Matrix
Modelling
The classification takes place in two levels namely the training level where the model is trained with the transactional data and input class label. In the testing phase the data is passed to check what the prediction comes.

Modelling Analysis can be done by Decision trees that show various possible outcomes and regression analysis can be done to find out the relationship between dependent and Independent variables. Cluster analysis groups items together based on patterns and similarity.

Data Classification:
To classify the customers to the 5 types we need to apply supervised machine learning algorithm namely support vector machine.

For calculating Loyal Customer, we have data about number of items the customer takes per month and the total cost. If the sum of quantity limit exceeds a threshold set or the total cost per month is more than 55 % sales then we can mark the customer as Loyal in the customer type.

For calculating Discount Customer, we can see that the attribute DiscountAvailable will always be true and so these customers can be marked as Discount customer.

If a customer’s product name in the data matrix is repetitive and vary from the normal then that means the customer is only interested in those things and can be marked need-based Customer.

If a customer’s Order Status in the data matrix is repetitive and has the value “Returned” more than a certain threshold then that means the customer can be marked High Return Customer and marketed accordingly.

Other customers can be those who created an account and didn’t purchase anything they can be classified into default and the rest customers are Wanderers.
Data Clustering
Can be used in deciding 5 types
by grouping the customers on the basis of city as the people in the urban areas may shop more and hence have chances of being loyal customers that rural areas.

Similarly, the attribute can be clustered to get two groups one with status return and others with delivered. Hence through this we can find the High Return Customers.

Involves selecting suitable modelling techniques and giving test design to test the models
Validation
The data can be divided into 60 % training set and 40%test set. The model is trained with the 60% data and the test data is applied on the value to get the predicted result. If there is a difference between expected class and predicted class this gives an error and lowers down the accuracy of the model.
A second type of validation is splitting the model in 60% train data 30%test data and 10% validation data. The model can be trained on the train data and the validated set can be passed to the model to get the results. To improve accuracy, we can modify the parameters until we get a nice accuracy and then send the test data for validation.

We should be sure of not overfitting the model with the data. We can use bootstrapping also.

Finally check whether the business success criteria matches your output.

Model Deployment
The model finally classifies the customers into 5 categories to help the marketing team send promotion mails accordingly. The model can be deployed in the marketing team internal website where they can feed in a new customers data and get the group he belongs to so that the required marketing strategy can be used. Deploying of this application should be accessible to everyone on the marketing team. This deployment would always require maintenance and updating of the model as new changes come by.

The model can also be deployed as a service so that it is consumable by any application who wants this feature in his application. The system would require new data to be fed to the models api and the api returns the expected label and this can be leveraged and integrated as per requirement. As per the size of the of the training data we can figure out to store in an RDBMS or use distributed architecture.

Team Skills And Hiring
For any Data scientist to work on the project must have the business knowledge and acumen.

He should have hands-on expertise on big-data tools and languages
Ability to see patterns in the data
Able to design simple yet scalable and robust design for the project.

Having good analytical and visual comprehension.

Having great communication and presentation skills to reach out the idea,
Be a problem solver and learn to figure out the issues independently.

Ability to listen to the points put in by everyone while driving it as well as directing team and Investors in the right direction
Know industry applications and the process the bigdata lifecycle works.

Supporting and Mentoring juniors to find out the Root cause analysis for any issue.
Reference: https://cs.nyu.edu/~abari/PA/Ch1/Ch1_IntrotoPAandRelatedFieldsPP.pdf