Sunday, March 5, 2017

Introduction to Data Mining



1.      Data Mining

There is huge amount of data available in Information Industry. This data is of no use until converted into useful information. Analyzing this huge amount of data and extracting useful information from it is necessary.

The extraction of information is not the only process we need to perform; it also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we are now position to use this information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration etc.

Data mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data.

Data mining is concerned with the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. The focus of data mining is to find the information that is hidden and unexpected.
Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Although data mining is still a relatively new technology, it is already used in a number of industries

It is the process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships

These patterns and trends can be collected and defined as a data mining model. Mining models can be applied to specific scenarios, such as:
a)      Forecasting: Estimating sales, predicting server loads or server downtime
b)      Risk and probability: Choosing the best customers for targeted mailings, determining the probable break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
c)      Recommendations: Determining which products are likely to be sold together, generating recommendations
d)      Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
e)      Grouping: Separating customers or events into cluster of related items, analyzing and predicting affinities

2.      Data Mining Applications
Here is the list of applications of Data Mining:
1.      Market Analysis and Management
2.      Corporate Analysis & Risk Management
3.      Fraud Detection
4.      Other Applications

1)      Market Analysis and Management

Following are the various fields of market where data mining is used:
a)      Customer Profiling - Data Mining helps to determine what kind of people buy what kind of products.
b)      Identifying Customer Requirements - Data Mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.
c)      Cross Market Analysis - Data Mining performs Association/correlations between product sales.
d)      Target Marketing - Data Mining helps to find clusters of model customers who share the same characteristics such as interest, spending habits, income etc.
e)      Determining Customer purchasing pattern - Data mining helps in determining customer purchasing pattern.
f)       Providing Summary Information - Data Mining provide us various multidimensional summary reports

2)      Corporate Analysis & Risk Management
Following are the various fields of Corporate Sector where data mining is used:
a)      Finance Planning and Asset Evaluation - It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.
b)      Resource Planning - Resource Planning It involves summarizing and comparing the resources and spending.
c)      Competition - It involves monitoring competitors and market directions.

3)      Fraud Detection
Data Mining is also used in fields of credit card services and telecommunication to detect fraud. In fraud telephone call it helps to find destination of call, duration of call, time of day or week. It also analyze the patterns that deviate from an expected norms.

4)      Other Applications
Data Mining also used in other fields such as sports, astrology and Internet Web Surf-Aid.

3.      KDD

KDD (Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i.e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them. Data Mining is application of a specific algorithm in order to extract patterns from data. Nonetheless, KDD and Data Mining are used interchangeably.

What is KDD?
As mentioned above, KDD is a field of computer science, which deals with extraction of previously unknown and interesting information from raw data. KDD is the whole process of trying to make sense of data by developing appropriate methods or techniques. This process deal with the mapping of low-level data into other forms those are more compact, abstract and useful.

This is achieved by creating short reports, modeling the process of generating data and developing predictive models that can predict future cases. Due to the exponential growth of data, especially in areas such as business, KDD has become a very important process to convert this large wealth of data in to business intelligence, as manual extraction of patterns has become seemingly impossible in the past few decades. For example, it is currently been used for various applications such as social network analysis, fraud detection, science, investment, manufacturing, telecommunications, data cleaning, sports, information retrieval and largely for marketing. KDD is usually used to answer questions like what are the main products that might help to obtain high profit next year in Wal-Mart.

This process has several steps. It starts with developing an understanding of the application domain and the goal and then creating a target dataset. This is followed by cleaning, preprocessing, reduction and projection of data. Next step is using Data Mining (explained below) to identify pattern. Finally, discovered knowledge is consolidates by visualizing and/or interpreting.

What is the difference between KDD and Data mining?
Although, the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts. KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.

Steps involved in knowledge discovery process

Here is the list of steps involved in knowledge discovery process:

a)      Data Cleaning - In this step the noise and inconsistent data is removed.
b)      Data Integration - In this step multiple data sources are combined.
c)      Data Selection - In this step relevant to the analysis task are retrieved from the database.
d)      Data Transformation - In this step data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.
e)      Data Mining - In this step intelligent methods are applied in order to extract data patterns.
f)       Pattern Evaluation - In this step, data patterns are evaluated.
g)      Knowledge Presentation - In this step, knowledge is represented.

4.      Primary goals of data mining

The two "high-level" primary goals of data mining, in practice, are prediction and description.

Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest.
Description focuses on finding human-interpretable patterns describing the data.

The relative importance of prediction and description for particular data mining applications can vary considerably. However, in the context of KDD, description tends to be more important than prediction. This is in contrast to pattern recognition and machine learning applications (such as speech recognition) where prediction is often the primary goal of the KDD process.
The goals of prediction and description are achieved by using the following primary data mining tasks:

1)      CLASSIFICATION/PREDICTION
Classification involves the discovery of a predictive learning function that classifies a data item into one of several predefines classes. It involves examining the features of a newly presented object and assigning to it a predefined class.

Classification is a two-step process. First a model is built describing a predetermined set of data classes or concepts and secondly, the model is used for classification.

Prediction can be viewed as the construction and use of a model to assess the class of a unlabeled sample, or to assess the value or value range of an attribute that a given sample is likely to have.

According to, any of the techniques used for classification can be adapted for use in prediction by using training examples where the value of the variable to be predicted is already known, along with historical data for those examples. Typical business related questions that can be answered using classification or prediction tasks are:
·         Which customers will buy?
·         Which products will customer by?
·         How much will customer buy?

2)      ESTIMATION
While classification deals with discrete outcomes such as yes or no, debit card, home loan or vehicle financing, estimation deals with continuously valued outcomes. If some input data is available, estimation can be used to come up with some unknown continuous variable such as income or height. In estimation, one wants to come up with a plausible value or a range of plausible values for the unknown parameters of a system. Typical examples of estimation and the business related questions that can be addressed by making use it include the following:
·         How many children are in a family?
·         Estimating a family’s total household income.
·         Estimating the value of a piece of property.

3) SEGMENTATION
Segmentation simply means making different offers to different markets segments; groups of people defined by some combination of demographic variables such as age, gender or income. Define the segmentation as a form of analysis used to for instance break down the visitors to a website into unique groups with individual behaviors.

The grouping can then be used to make statistical projections, such as the potential amount of purchases they are likely to make. Typical business questions that can be answered using segmentation are:
·         What are the different types of visitors attracted to our website?
·         In which age groups do the listeners of a certain radio station fall into?

4) CLUSTERING
Clustering is the task of segmenting a diverse group into a number of similar subgroups or clusters. Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.

Clustering is commonly used to search for unique groupings within a data set. The distinguishing factor between clustering and classification is that in clustering there are no predefined classes and no examples. The objects are grouped together based on self-similarity. Typical business question that can be answered using clustering are:
·         What are the groupings hidden in your data.
·         Which customer should be grouped together for target marketing purposes?

Clustering is grouped under descriptive data mining tasks. Clustering is best used for finding groups of items that are similar for e.g. given a data set of customers; identify subgroups of customers that have a similar buying behavior

5) DESCRIPTION AND VISUALIZATION
The purpose of data mining is sometimes simply to describe what is going on in a complicated database in a way that increased our understanding of the people, products or processes that produced the data in the first place. They state that a good enough description of behavior will often suggest an explanation for it as well. One of the most powerful forms of descriptive data mining is data visualization. Although visualization is not always easy, the right picture can truly speak a thousand words since human beings are extremely practiced at extracting meaning from visual scenes.

Visualization can be useful in providing a visual representation of the location and distribution of a company’s major clients on a map of a city or a province or even a country

6) Sequence/Temporal
Sequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of time. Difference between sequence rules and other rules is the temporal factor
Example - retailers database can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven

5.      Steps of Data Mining Process
In order to systematically conduct data mining analysis, a general process is usually followed. There is a Cross-Industry Standard Process for Data Mining (CRISP-DM) widely used by industry members. This model consists of six phases intended as a cyclical process

1) Business Understanding Business understanding includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a project plan. Tasks in this phase include:
·         Identifying your business goals
·         Assessing your situation
·         Defining your data mining goals
·         Producing your project plan

2) Data Understanding Once business objectives and the project plan are established, data understanding considers data requirements. This step can include initial data collection, data description, data exploration, and the verification of data quality. Data exploration such as viewing summary statistics (which includes the visual display of categorical variables) can occur at the end of this phase. Models such as cluster analysis can also be applied during this phase, with the intent of identifying patterns in the data. Tasks for this phase include:
·         Gathering data
·         Describing
·         Exploring
·         Verifying quality

3) Data Preparation Once the data resources available are identified, they need to be selected, cleaned, built into the form desired, and formatted. Data cleaning and data transformation in preparation of data modeling needs to occur in this phase. Data exploration at a greater depth can be applied during this phase, and additional models utilized, again providing the opportunity to see patterns based on business understanding. Tasks for this phase include:
·         Selecting data
·         Cleaning data
·         Constructing
·         Integrating
·         Formatting

4) Modeling Data mining software tools such as visualization (plotting data and establishing relationships) and cluster analysis (to identify which variables go well together) are useful for initial analysis. Tools such as generalized rule induction can develop initial association rules. Tasks for this phase include:
·         Selecting techniques
·         Designing tests
·         Building models
·         Assessing models

5) Evaluation Model results should be evaluated in the context of the business objectives established in the first phase (business understanding). Gaining business understanding is an iterative procedure in data mining, where the results of various visualization, statistical, and artificial intelligence tools show the user new relationships that provide a deeper understanding of organizational operations. Tasks for this phase include:
·         Evaluating results
·         Reviewing the process
·         Determining the next steps

6) Deployment Data mining can be used to both verify previously held hypotheses, or for knowledge discovery (identification of unexpected and useful relationships). These models need to be monitored for changes in operating conditions, because what might be true today may not be true a year from now. If significant changes do occur, the model should be redone. It’s also wise to record the results of data mining projects so documented evidence is available for future studies. Tasks for this phase include:
·         Planning deployment (your methods for integrating data mining discoveries into use)
·         Reporting final results
·         Reviewing final results


No comments:

Post a Comment