1.
Data
Mining
There
is huge amount of data available in Information Industry. This data is of no
use until converted into useful information. Analyzing this huge amount of data
and extracting useful information from it is necessary.
The
extraction of information is not the only process we need to perform; it also
involves other processes such as Data Cleaning, Data Integration, Data
Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all
these processes are over, we are now position to use this information in many
applications such as Fraud Detection, Market Analysis, Production Control,
Science Exploration etc.
Data
mining is the process of discovering actionable information from large sets of
data. Data mining uses mathematical analysis to derive patterns and trends that
exist in data. Typically, these patterns cannot be discovered by traditional
data exploration because the relationships are too complex or because there is
too much data.
Data
mining is concerned with the analysis of data and the use of software
techniques for finding hidden and unexpected patterns and relationships in sets
of data. The focus of data mining is to find the information that is hidden and
unexpected.
Data
mining can provide huge paybacks for companies who have made a significant
investment in data warehousing. Although data mining is still a relatively new
technology, it is already used in a number of industries
It is the process of collecting, searching through, and
analyzing a large amount of data in a database, as to discover patterns or
relationships
These
patterns and trends can be collected and defined as a data mining
model. Mining models can be applied to specific scenarios, such as:
a)
Forecasting: Estimating sales, predicting
server loads or server downtime
b)
Risk and probability: Choosing the best customers for
targeted mailings, determining the probable break-even point for risk
scenarios, assigning probabilities to diagnoses or other outcomes
c)
Recommendations: Determining which products are
likely to be sold together, generating recommendations
d)
Finding sequences: Analyzing customer selections
in a shopping cart, predicting next likely events
e)
Grouping: Separating customers or events into cluster of
related items, analyzing and predicting affinities
2. Data Mining Applications
Here is the list of applications of
Data Mining:
1.
Market Analysis and Management
2.
Corporate Analysis & Risk
Management
3.
Fraud Detection
4.
Other Applications
1) Market Analysis and Management
Following are the various fields of
market where data mining is used:
a)
Customer Profiling -
Data Mining helps to determine what kind of people buy what kind of products.
b)
Identifying Customer Requirements -
Data Mining helps in identifying the best products for different customers. It
uses prediction to find the factors that may attract new customers.
c)
Cross Market Analysis -
Data Mining performs Association/correlations between product sales.
d)
Target Marketing -
Data Mining helps to find clusters of model customers who share the same
characteristics such as interest, spending habits, income etc.
e)
Determining Customer purchasing pattern -
Data mining helps in determining customer purchasing pattern.
f)
Providing Summary Information -
Data Mining provide us various multidimensional summary reports
2) Corporate Analysis & Risk
Management
Following are the various fields of
Corporate Sector where data mining is used:
a)
Finance Planning and Asset Evaluation -
It involves cash flow analysis and prediction, contingent claim analysis to evaluate
assets.
b)
Resource Planning -
Resource Planning It involves summarizing and comparing the resources and
spending.
c)
Competition - It involves monitoring
competitors and market directions.
3) Fraud Detection
Data Mining is also used in fields
of credit card services and telecommunication to detect fraud. In fraud
telephone call it helps to find destination of call, duration of call, time of
day or week. It also analyze the patterns that deviate from an expected norms.
4)
Other
Applications
Data Mining also used in other
fields such as sports, astrology and Internet Web Surf-Aid.
3.
KDD
KDD
(Knowledge Discovery in Databases) is a field of computer science, which
includes the tools and theories to help humans in extracting useful and
previously unknown information (i.e. knowledge) from large collections of
digitized data. KDD consists of several steps, and Data Mining is one of them.
Data Mining is application of a specific algorithm in order to extract patterns
from data. Nonetheless, KDD and Data Mining are used interchangeably.
What is KDD?
As
mentioned above, KDD is a field of computer science, which deals with
extraction of previously unknown and interesting information from raw data. KDD
is the whole process of trying to make sense of data by developing appropriate
methods or techniques. This process deal with the mapping of low-level data
into other forms those are more compact, abstract and useful.
This
is achieved by creating short reports, modeling the process of generating data
and developing predictive models that can predict future cases. Due to the
exponential growth of data, especially in areas such as business, KDD has
become a very important process to convert this large wealth of data in to
business intelligence, as manual extraction of patterns has become seemingly
impossible in the past few decades. For example, it is currently been used for
various applications such as social network analysis, fraud detection, science,
investment, manufacturing, telecommunications, data cleaning, sports,
information retrieval and largely for marketing. KDD is usually used to answer
questions like what are the main products that might help to obtain high profit
next year in Wal-Mart.
This
process has several steps. It starts with developing an understanding of the
application domain and the goal and then creating a target dataset. This is
followed by cleaning, preprocessing, reduction and projection of data. Next
step is using Data Mining (explained below) to identify pattern. Finally,
discovered knowledge is consolidates by visualizing and/or interpreting.
What is the difference
between KDD and Data mining?
Although,
the two terms KDD and Data Mining are heavily used interchangeably, they refer
to two related yet slightly different concepts. KDD is the overall process of
extracting knowledge from data while Data Mining is a step inside the KDD
process, which deals with identifying patterns in data. In other words, Data
Mining is only the application of a specific algorithm based on the overall
goal of the KDD process.
Steps
involved in knowledge discovery process
Here is the list of steps involved
in knowledge discovery process:
a)
Data Cleaning -
In this step the noise and inconsistent data is removed.
b)
Data Integration -
In this step multiple data sources are combined.
c)
Data Selection -
In this step relevant to the analysis task are retrieved from the database.
d)
Data Transformation -
In this step data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
e)
Data Mining -
In this step intelligent methods are applied in order to extract data patterns.
f)
Pattern Evaluation -
In this step, data patterns are evaluated.
g)
Knowledge Presentation -
In this step, knowledge is represented.
4. Primary
goals of data mining
The two "high-level"
primary goals of data mining, in practice, are prediction and description.
Prediction
involves using some variables or fields in the database to predict unknown or
future values of other variables of interest.
Description
focuses on finding human-interpretable patterns describing the data.
The relative importance of
prediction and description for particular data mining applications can vary
considerably. However, in the context of KDD, description tends to be more
important than prediction. This is in contrast to pattern recognition and
machine learning applications (such as speech recognition) where prediction is
often the primary goal of the KDD process.
The goals of prediction and
description are achieved by using the following primary data mining tasks:
1)
CLASSIFICATION/PREDICTION
Classification
involves the discovery of a predictive learning function that classifies a data
item into one of several predefines classes. It involves examining the features
of a newly presented object and assigning to it a predefined class.
Classification
is a two-step process. First a model is built describing a predetermined set of
data classes or concepts and secondly, the model is used for classification.
Prediction
can be viewed as the construction and use of a model to assess the class of a
unlabeled sample, or to assess the value or value range of an attribute that a
given sample is likely to have.
According
to, any of the techniques used for classification can be adapted for use in
prediction by using training examples where the value of the variable to be
predicted is already known, along with historical data for those examples.
Typical business related questions that can be answered using classification or
prediction tasks are:
·
Which customers will buy?
·
Which products will customer by?
·
How much will customer buy?
2)
ESTIMATION
While
classification deals with discrete outcomes such as yes or no, debit card, home
loan or vehicle financing, estimation deals with continuously valued outcomes.
If some input data is available, estimation can be used to come up with some
unknown continuous variable such as income or height. In estimation, one wants
to come up with a plausible value or a range of plausible values for the
unknown parameters of a system. Typical examples of estimation and the business
related questions that can be addressed by making use it include the following:
·
How
many children are in a family?
·
Estimating
a family’s total household income.
·
Estimating
the value of a piece of property.
3)
SEGMENTATION
Segmentation
simply means making different offers to different markets segments; groups of
people defined by some combination of demographic variables such as age, gender
or income. Define the segmentation as a form of analysis used to for instance
break down the visitors to a website into unique groups with individual
behaviors.
The
grouping can then be used to make statistical projections, such as the
potential amount of purchases they are likely to make. Typical business
questions that can be answered using segmentation are:
·
What
are the different types of visitors attracted to our website?
·
In
which age groups do the listeners of a certain radio station fall into?
4)
CLUSTERING
Clustering
is the task of segmenting a diverse group into a number of similar subgroups or
clusters. Clusters of objects are formed so that objects within a cluster have
high similarity in comparison to one another, but are very dissimilar to
objects in other clusters.
Clustering
is commonly used to search for unique groupings within a data set. The
distinguishing factor between clustering and classification is that in
clustering there are no predefined classes and no examples. The objects are
grouped together based on self-similarity. Typical business question that can
be answered using clustering are:
·
What
are the groupings hidden in your data.
·
Which
customer should be grouped together for target marketing purposes?
Clustering
is grouped under descriptive data mining tasks. Clustering is best used for
finding groups of items that are similar for e.g. given a data set of
customers; identify subgroups of customers that have a similar buying behavior
5) DESCRIPTION AND VISUALIZATION
The
purpose of data mining is sometimes simply to describe what is going on in a
complicated database in a way that increased our understanding of the people,
products or processes that produced the data in the first place. They state
that a good enough description of behavior will often suggest an explanation
for it as well. One of the most powerful forms of descriptive data mining is
data visualization. Although visualization is not always easy, the right
picture can truly speak a thousand words since human beings are extremely
practiced at extracting meaning from visual scenes.
Visualization
can be useful in providing a visual representation of the location and
distribution of a company’s major clients on a map of a city or a province or
even a country
6)
Sequence/Temporal
Sequential
pattern functions analyze collections of related records and detect frequently
occurring patterns over a period of time. Difference between sequence rules and
other rules is the temporal factor
Example
- retailers database can be used to discover the set of purchases that
frequently precedes the purchase of a microwave oven
5. Steps
of Data Mining Process
In
order to systematically conduct data mining analysis, a general process is
usually followed. There is a Cross-Industry Standard Process for Data Mining
(CRISP-DM) widely used by industry members. This model consists of six phases
intended as a cyclical process
1)
Business Understanding Business understanding includes
determining business objectives, assessing the current situation, establishing
data mining goals, and developing a project plan. Tasks in this phase include:
·
Identifying
your business goals
·
Assessing
your situation
·
Defining
your data mining goals
·
Producing
your project plan
2) Data
Understanding
Once business
objectives and the project plan are established, data understanding considers
data requirements. This step can include initial data collection, data
description, data exploration, and the verification of data quality. Data
exploration such as viewing summary statistics (which includes the visual
display of categorical variables) can occur at the end of this phase. Models
such as cluster analysis can also be applied during this phase, with the intent
of identifying patterns in the data. Tasks for this phase include:
·
Gathering
data
·
Describing
·
Exploring
·
Verifying
quality
3)
Data Preparation Once
the data resources available are identified, they need to be selected, cleaned,
built into the form desired, and formatted. Data cleaning and data
transformation in preparation of data modeling needs to occur in this phase.
Data exploration at a greater depth can be applied during this phase, and
additional models utilized, again providing the opportunity to see patterns
based on business understanding. Tasks for this phase include:
·
Selecting
data
·
Cleaning
data
·
Constructing
·
Integrating
·
Formatting
4) Modeling Data mining software tools such
as visualization (plotting data and establishing relationships) and cluster
analysis (to identify which variables go well together) are useful for initial
analysis. Tools such as generalized rule induction can develop initial
association rules. Tasks for this phase include:
·
Selecting
techniques
·
Designing
tests
·
Building
models
·
Assessing
models
5)
Evaluation Model results should be
evaluated in the context of the business objectives established in the first
phase (business understanding). Gaining business understanding is an iterative procedure
in data mining, where the results of various visualization, statistical, and
artificial intelligence tools show the user new relationships that provide a
deeper understanding of organizational operations. Tasks for this phase
include:
·
Evaluating
results
·
Reviewing
the process
·
Determining
the next steps
6) Deployment Data mining can be used to both
verify previously held hypotheses, or for knowledge discovery (identification
of unexpected and useful relationships). These models need to be monitored for
changes in operating conditions, because what might be true today may not be
true a year from now. If significant changes do occur, the model should be
redone. It’s also wise to record the results of data mining projects so
documented evidence is available for future studies. Tasks for this phase
include:
·
Planning
deployment (your methods for integrating data mining discoveries into use)
·
Reporting
final results
·
Reviewing
final results