Sunday, March 5, 2017

Introduction to Data Mining



1.      Data Mining

There is huge amount of data available in Information Industry. This data is of no use until converted into useful information. Analyzing this huge amount of data and extracting useful information from it is necessary.

The extraction of information is not the only process we need to perform; it also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we are now position to use this information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration etc.

Data mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data.

Data mining is concerned with the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. The focus of data mining is to find the information that is hidden and unexpected.
Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Although data mining is still a relatively new technology, it is already used in a number of industries

It is the process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships

These patterns and trends can be collected and defined as a data mining model. Mining models can be applied to specific scenarios, such as:
a)      Forecasting: Estimating sales, predicting server loads or server downtime
b)      Risk and probability: Choosing the best customers for targeted mailings, determining the probable break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
c)      Recommendations: Determining which products are likely to be sold together, generating recommendations
d)      Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
e)      Grouping: Separating customers or events into cluster of related items, analyzing and predicting affinities

2.      Data Mining Applications
Here is the list of applications of Data Mining:
1.      Market Analysis and Management
2.      Corporate Analysis & Risk Management
3.      Fraud Detection
4.      Other Applications

1)      Market Analysis and Management

Following are the various fields of market where data mining is used:
a)      Customer Profiling - Data Mining helps to determine what kind of people buy what kind of products.
b)      Identifying Customer Requirements - Data Mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.
c)      Cross Market Analysis - Data Mining performs Association/correlations between product sales.
d)      Target Marketing - Data Mining helps to find clusters of model customers who share the same characteristics such as interest, spending habits, income etc.
e)      Determining Customer purchasing pattern - Data mining helps in determining customer purchasing pattern.
f)       Providing Summary Information - Data Mining provide us various multidimensional summary reports

2)      Corporate Analysis & Risk Management
Following are the various fields of Corporate Sector where data mining is used:
a)      Finance Planning and Asset Evaluation - It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.
b)      Resource Planning - Resource Planning It involves summarizing and comparing the resources and spending.
c)      Competition - It involves monitoring competitors and market directions.

3)      Fraud Detection
Data Mining is also used in fields of credit card services and telecommunication to detect fraud. In fraud telephone call it helps to find destination of call, duration of call, time of day or week. It also analyze the patterns that deviate from an expected norms.

4)      Other Applications
Data Mining also used in other fields such as sports, astrology and Internet Web Surf-Aid.

3.      KDD

KDD (Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i.e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them. Data Mining is application of a specific algorithm in order to extract patterns from data. Nonetheless, KDD and Data Mining are used interchangeably.

What is KDD?
As mentioned above, KDD is a field of computer science, which deals with extraction of previously unknown and interesting information from raw data. KDD is the whole process of trying to make sense of data by developing appropriate methods or techniques. This process deal with the mapping of low-level data into other forms those are more compact, abstract and useful.

This is achieved by creating short reports, modeling the process of generating data and developing predictive models that can predict future cases. Due to the exponential growth of data, especially in areas such as business, KDD has become a very important process to convert this large wealth of data in to business intelligence, as manual extraction of patterns has become seemingly impossible in the past few decades. For example, it is currently been used for various applications such as social network analysis, fraud detection, science, investment, manufacturing, telecommunications, data cleaning, sports, information retrieval and largely for marketing. KDD is usually used to answer questions like what are the main products that might help to obtain high profit next year in Wal-Mart.

This process has several steps. It starts with developing an understanding of the application domain and the goal and then creating a target dataset. This is followed by cleaning, preprocessing, reduction and projection of data. Next step is using Data Mining (explained below) to identify pattern. Finally, discovered knowledge is consolidates by visualizing and/or interpreting.

What is the difference between KDD and Data mining?
Although, the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts. KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.

Steps involved in knowledge discovery process

Here is the list of steps involved in knowledge discovery process:

a)      Data Cleaning - In this step the noise and inconsistent data is removed.
b)      Data Integration - In this step multiple data sources are combined.
c)      Data Selection - In this step relevant to the analysis task are retrieved from the database.
d)      Data Transformation - In this step data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.
e)      Data Mining - In this step intelligent methods are applied in order to extract data patterns.
f)       Pattern Evaluation - In this step, data patterns are evaluated.
g)      Knowledge Presentation - In this step, knowledge is represented.

4.      Primary goals of data mining

The two "high-level" primary goals of data mining, in practice, are prediction and description.

Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest.
Description focuses on finding human-interpretable patterns describing the data.

The relative importance of prediction and description for particular data mining applications can vary considerably. However, in the context of KDD, description tends to be more important than prediction. This is in contrast to pattern recognition and machine learning applications (such as speech recognition) where prediction is often the primary goal of the KDD process.
The goals of prediction and description are achieved by using the following primary data mining tasks:

1)      CLASSIFICATION/PREDICTION
Classification involves the discovery of a predictive learning function that classifies a data item into one of several predefines classes. It involves examining the features of a newly presented object and assigning to it a predefined class.

Classification is a two-step process. First a model is built describing a predetermined set of data classes or concepts and secondly, the model is used for classification.

Prediction can be viewed as the construction and use of a model to assess the class of a unlabeled sample, or to assess the value or value range of an attribute that a given sample is likely to have.

According to, any of the techniques used for classification can be adapted for use in prediction by using training examples where the value of the variable to be predicted is already known, along with historical data for those examples. Typical business related questions that can be answered using classification or prediction tasks are:
·         Which customers will buy?
·         Which products will customer by?
·         How much will customer buy?

2)      ESTIMATION
While classification deals with discrete outcomes such as yes or no, debit card, home loan or vehicle financing, estimation deals with continuously valued outcomes. If some input data is available, estimation can be used to come up with some unknown continuous variable such as income or height. In estimation, one wants to come up with a plausible value or a range of plausible values for the unknown parameters of a system. Typical examples of estimation and the business related questions that can be addressed by making use it include the following:
·         How many children are in a family?
·         Estimating a family’s total household income.
·         Estimating the value of a piece of property.

3) SEGMENTATION
Segmentation simply means making different offers to different markets segments; groups of people defined by some combination of demographic variables such as age, gender or income. Define the segmentation as a form of analysis used to for instance break down the visitors to a website into unique groups with individual behaviors.

The grouping can then be used to make statistical projections, such as the potential amount of purchases they are likely to make. Typical business questions that can be answered using segmentation are:
·         What are the different types of visitors attracted to our website?
·         In which age groups do the listeners of a certain radio station fall into?

4) CLUSTERING
Clustering is the task of segmenting a diverse group into a number of similar subgroups or clusters. Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.

Clustering is commonly used to search for unique groupings within a data set. The distinguishing factor between clustering and classification is that in clustering there are no predefined classes and no examples. The objects are grouped together based on self-similarity. Typical business question that can be answered using clustering are:
·         What are the groupings hidden in your data.
·         Which customer should be grouped together for target marketing purposes?

Clustering is grouped under descriptive data mining tasks. Clustering is best used for finding groups of items that are similar for e.g. given a data set of customers; identify subgroups of customers that have a similar buying behavior

5) DESCRIPTION AND VISUALIZATION
The purpose of data mining is sometimes simply to describe what is going on in a complicated database in a way that increased our understanding of the people, products or processes that produced the data in the first place. They state that a good enough description of behavior will often suggest an explanation for it as well. One of the most powerful forms of descriptive data mining is data visualization. Although visualization is not always easy, the right picture can truly speak a thousand words since human beings are extremely practiced at extracting meaning from visual scenes.

Visualization can be useful in providing a visual representation of the location and distribution of a company’s major clients on a map of a city or a province or even a country

6) Sequence/Temporal
Sequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of time. Difference between sequence rules and other rules is the temporal factor
Example - retailers database can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven

5.      Steps of Data Mining Process
In order to systematically conduct data mining analysis, a general process is usually followed. There is a Cross-Industry Standard Process for Data Mining (CRISP-DM) widely used by industry members. This model consists of six phases intended as a cyclical process

1) Business Understanding Business understanding includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a project plan. Tasks in this phase include:
·         Identifying your business goals
·         Assessing your situation
·         Defining your data mining goals
·         Producing your project plan

2) Data Understanding Once business objectives and the project plan are established, data understanding considers data requirements. This step can include initial data collection, data description, data exploration, and the verification of data quality. Data exploration such as viewing summary statistics (which includes the visual display of categorical variables) can occur at the end of this phase. Models such as cluster analysis can also be applied during this phase, with the intent of identifying patterns in the data. Tasks for this phase include:
·         Gathering data
·         Describing
·         Exploring
·         Verifying quality

3) Data Preparation Once the data resources available are identified, they need to be selected, cleaned, built into the form desired, and formatted. Data cleaning and data transformation in preparation of data modeling needs to occur in this phase. Data exploration at a greater depth can be applied during this phase, and additional models utilized, again providing the opportunity to see patterns based on business understanding. Tasks for this phase include:
·         Selecting data
·         Cleaning data
·         Constructing
·         Integrating
·         Formatting

4) Modeling Data mining software tools such as visualization (plotting data and establishing relationships) and cluster analysis (to identify which variables go well together) are useful for initial analysis. Tools such as generalized rule induction can develop initial association rules. Tasks for this phase include:
·         Selecting techniques
·         Designing tests
·         Building models
·         Assessing models

5) Evaluation Model results should be evaluated in the context of the business objectives established in the first phase (business understanding). Gaining business understanding is an iterative procedure in data mining, where the results of various visualization, statistical, and artificial intelligence tools show the user new relationships that provide a deeper understanding of organizational operations. Tasks for this phase include:
·         Evaluating results
·         Reviewing the process
·         Determining the next steps

6) Deployment Data mining can be used to both verify previously held hypotheses, or for knowledge discovery (identification of unexpected and useful relationships). These models need to be monitored for changes in operating conditions, because what might be true today may not be true a year from now. If significant changes do occur, the model should be redone. It’s also wise to record the results of data mining projects so documented evidence is available for future studies. Tasks for this phase include:
·         Planning deployment (your methods for integrating data mining discoveries into use)
·         Reporting final results
·         Reviewing final results


Friday, March 3, 2017

Algorithm Analysis

Efficiency of an algorithm can be analyzed at two different stages, before implementation and after implementation. They are the following −
  • A Priori Analysis − This is a theoretical analysis of an algorithm. Efficiency of an algorithm is measured by assuming that all other factors, for example, processor speed, are constant and have no effect on the implementation.
  • A Posterior Analysis − This is an empirical analysis of an algorithm. The selected algorithm is implemented using programming language. This is then executed on target computer machine. In this analysis, actual statistics like running time and space required, are collected.
We shall learn about a priori algorithm analysis. Algorithm analysis deals with the execution or running time of various operations involved. The running time of an operation can be defined as the number of computer instructions executed per operation.

Algorithm Complexity

Suppose X is an algorithm and n is the size of input data, the time and space used by the algorithm X are the two main factors, which decide the efficiency of X.
  • Time Factor − Time is measured by counting the number of key operations such as comparisons in the sorting algorithm.
  • Space Factor − Space is measured by counting the maximum memory space required by the algorithm.
The complexity of an algorithm f(n) gives the running time and/or the storage space required by the algorithm in terms of n as the size of input data.

Space Complexity

Space complexity of an algorithm represents the amount of memory space required by the algorithm in its life cycle. The space required by an algorithm is equal to the sum of the following two components −
  • A fixed part that is a space required to store certain data and variables, that are independent of the size of the problem. For example, simple variables and constants used, program size, etc.
  • A variable part is a space required by variables, whose size depends on the size of the problem. For example, dynamic memory allocation, recursion stack space, etc.
Space complexity S(P) of any algorithm P is S(P) = C + SP(I), where C is the fixed part and S(I) is the variable part of the algorithm, which depends on instance characteristic I. Following is a simple example that tries to explain the concept −
Algorithm: SUM(A, B)
Step 1 -  START
Step 2 -  C ← A + B + 10
Step 3 -  Stop
Here we have three variables A, B, and C and one constant. Hence S(P) = 1 + 3. Now, space depends on data types of given variables and constant types and it will be multiplied accordingly.

Time Complexity

Time complexity of an algorithm represents the amount of time required by the algorithm to run to completion. Time requirements can be defined as a numerical function T(n), where T(n) can be measured as the number of steps, provided each step consumes constant time.
For example, addition of two n-bit integers takes n steps. Consequently, the total computational time is T(n) = c ∗ n, where c is the time taken for the addition of two bits. Here, we observe that T(n) grows linearly as the input size increases.

Thursday, March 2, 2017

Research Aptitude for Paper 1- Part 1

1.The main purpose of research in education is to _________

a) Help in the personal growth of an individual
b) Help the candidate become an eminent educationist
c) Increase job prospects of an individual
d) Increase social status of an individual 
ANSWER: b) Help the candidate become an eminent educationist
2._______ refers to inferring about the whole population based on the observations made on a small part.

a) Deductive inference
b) Inductive inference
c) Pseudo-inference
d) Objective inference
ANSWER: b) Inductive inference
3.Sampling is advantageous as it ________

a) Helps in capital-saving
b) Saves time
c) Increases accuracy
d) Both (a) and (b)
ANSWER: d) Both (a) and (b)

Sampling is the method of drawing a certain number of the individuals from a particular set.


4.Random sampling is helpful as it is __________.

a) An economical method of data collection
b) Free from personal biases
c) Reasonably accurate 
d) All the above
ANSWER: d) All the above

In random sampling, each member of the set has equal chance of selection.


5.Tippit table refers to ____________

a) Table of random digits
b) Table used in sampling methods
c) Table used in statistical investigations
d) All the above
ANSWER: d) All the above

This table was first published by L.H.C Tippett in 1927.

.
6.Type-I Error occurs if ___________________

a) the null hypothesis is rejected even though it is true
b) the null hypothesis is accepted even though it is false
c) both the null hypothesis as well as alternative hypothesis are rejected
d) None of the above
ANSWER: a) the null hypothesis is rejected even though it is true

Type-I Error is also known as an error of the first kind.


7._________ is a preferred sampling method for the population with finite size.

a) Area sampling
b) Cluster sampling
c) Purposive sampling
d) Systematic sampling
ANSWER: d) Systematic sampling

Random selection is done in systematic sampling.


8.Research and Development become the index of development of country. Which of the following reasons are true with regards to this statement?

a) Because R&D reflect the true economic and social conditions prevailing in a country
b) Because R&D targets the human development
c) Because R&D can improve the standard of living of the people in a country
d) All the above
ANSWER: d) All the above

9.The data of research is ______

a) Qualitative only
b) Quantitative only
c) Both (a) and (b)
d) Neither (a) nor (b)
ANSWER: c) Both (a) and (b)

Qualitative data deals with descriptive data and quantitative data deals with numbers.

10.The longitudinal approach of research deals with _________.

a) Horizontal researches
b) Long-term researches
c) Short-term researches
d) None of the above
ANSWER: b) Long-term researches

Longitudinal studies are often targeted at the same people.

Wednesday, March 1, 2017

COCOMO Model

    Basic COCOMO Model


    The basic COCOMO model helps to obtain a rough estimate of the project parameters. It estimates effort and time required for development in the following way: Effort = a * (KDSI)b PMTdev = 2.5 * (Effort)c Monthswhere
    • KDSI is the estimated size of the software expressed in Kilo Delivered Source Instructions
    • a, b, c are constants determined by the category of software project
    • Effort denotes the total effort required for the software development, expressed in person months (PMs)
    • Tdev denotes the estimated time required to develop the software (expressed in months)

    The value of the constants a, b, and c are given below: 
    Software projectabc
    Organic2.41.050.38
    Semi-detached3.01.120.35
    Embedded3.61.200.32