Wednesday, March 15, 2017

ER Diagram Representation

Let us  learn how the ER Model is represented by means of an ER diagram. Any object, for example, entities, attributes of an entity, relationship sets, and attributes of relationship sets, can be represented with the help of an ER diagram.

Entity

Entities are represented by means of rectangles. Rectangles are named with the entity set they represent.
Entities in a school database

Attributes

Attributes are the properties of entities. Attributes are represented by means of ellipses. Every ellipse represents one attribute and is directly connected to its entity (rectangle).
Simple Attributes
If the attributes are composite, they are further divided in a tree like structure. Every node is then connected to its attribute. That is, composite attributes are represented by ellipses that are connected with an ellipse.
Composite Attributes
Multivalued attributes are depicted by double ellipse.
Multivalued Attributes
Derived attributes are depicted by dashed ellipse.
Derived Attributes

Relationship

Relationships are represented by diamond-shaped box. Name of the relationship is written inside the diamond-box. All the entities (rectangles) participating in a relationship, are connected to it by a line.

Binary Relationship and Cardinality

A relationship where two entities are participating is called a binary relationship. Cardinality is the number of instance of an entity from a relation that can be associated with the relation.
  • One-to-one − When only one instance of an entity is associated with the relationship, it is marked as '1:1'. The following image reflects that only one instance of each entity should be associated with the relationship. It depicts one-to-one relationship.
  • One-to-one
  • One-to-many − When more than one instance of an entity is associated with a relationship, it is marked as '1:N'. The following image reflects that only one instance of entity on the left and more than one instance of an entity on the right can be associated with the relationship. It depicts one-to-many relationship.
  • One-to-many
  • Many-to-one − When more than one instance of entity is associated with the relationship, it is marked as 'N:1'. The following image reflects that more than one instance of an entity on the left and only one instance of an entity on the right can be associated with the relationship. It depicts many-to-one relationship.
  • Many-to-one
  • Many-to-many − The following image reflects that more than one instance of an entity on the left and more than one instance of an entity on the right can be associated with the relationship. It depicts many-to-many relationship.
  • Many-to-many

Participation Constraints

  • Total Participation − Each entity is involved in the relationship. Total participation is represented by double lines.
  • Partial participation − Not all entities are involved in the relationship. Partial participation is represented by single lines.
Participation Constraints

Tuesday, March 14, 2017

DBMS - Architecture

The design of a DBMS depends on its architecture. It can be centralized or decentralized or hierarchical. The architecture of a DBMS can be seen as either single tier or multi-tier. An n-tier architecture divides the whole system into related but independent n modules, which can be independently modified, altered, changed, or replaced.
In 1-tier architecture, the DBMS is the only entity where the user directly sits on the DBMS and uses it. Any changes done here will directly be done on the DBMS itself. It does not provide handy tools for end-users. Database designers and programmers normally prefer to use single-tier architecture.
If the architecture of DBMS is 2-tier, then it must have an application through which the DBMS can be accessed. Programmers use 2-tier architecture where they access the DBMS by means of an application. Here the application tier is entirely independent of the database in terms of operation, design, and programming.

3-tier Architecture

A 3-tier architecture separates its tiers from each other based on the complexity of the users and how they use the data present in the database. It is the most widely used architecture to design a DBMS.

  • Database (Data) Tier − At this tier, the database resides along with its query processing languages. We also have the relations that define the data and their constraints at this level.
  • Application (Middle) Tier − At this tier reside the application server and the programs that access the database. For a user, this application tier presents an abstracted view of the database. End-users are unaware of any existence of the database beyond the application. At the other end, the database tier is not aware of any other user beyond the application tier. Hence, the application layer sits in the middle and acts as a mediator between the end-user and the database.
  • User (Presentation) Tier − End-users operate on this tier and they know nothing about any existence of the database beyond this layer. At this layer, multiple views of the database can be provided by the application. All views are generated by applications that reside in the application tier.
Multiple-tier database architecture is highly modifiable, as almost all its components are independent and can be changed independently.

Monday, March 13, 2017

Important Questions Series 1

1)       Partial order set is:
a. Reflexive, Transitive
b. Anti Symmetric
c. Symmetric
d. a and b
e. a and c
f. None

if you consider option e then it will become equivalence set..

 2) The Relation R={(1,2),(2,2),(2,3),(3,3),(3,2),(4,4),(4,2)} is Reflexive (True / False)
false a relation will be reflexive if for all x,(x,x) is present here in R there are (2,2)(3,3)(4,4) but no (1,1) so the relation R is not reflexive

3)A B-tree of minimum degree t has maximum pointer in a node
(1) t-1
(2) 2t-1
(3) 2t
(4)t
4) Unix operating system is an example of
(a) batch os
(b) multi user os
(c) partially multi user os
(d) single user os
5) Which of the following allows a simple email service and is responsible for moving messages from one mail server to another?
A. IMAP B. DHCP
C. SMTP D. FTP
E. POP3
6) Device on one network can communicate with devices in another network via a
1. Hub/Switch
2.utility Server
3. File Server
4. Gateway
7) Congestion control is done by:
1.Network layer
2.Physical Layer
3.Transport Layer
4.Application Layer
8) Which storage device is mounted on 'reels'?
1 Floppy Disk
2 Hard Disk
3 Magnetic Tapes
4 CDROM
9) Consider a relation (A,B,C,D,E,F,G,H).And its Functional Dependencies are ABC ->DE, F->FG, H->G, G->H,ABCD->EF.Here the answer is it is in 2nf.
as you see abc is the candidate key and E,F,G,H are the non prime attributes. so no partial dependency is present here as no proper subset of ABC points to any non prime attribute.and here non key points to another non key(condition of 2nf)and transitive dependency also present F--G , G---H SO ANS IS 2NF
10)which type of links are used for a connection between two DTE devices
1. X.21
2. Frame relay
3. ATM
4. Modem
11) what does 'B' in B tree stands for:                          its "balanced tree"........
12) One can convert a binary tree into its mirror image by traversing it in
(1) postorder
(2) inorder
(3) preorder
(4)any order
post order..inorder gives nodes in ascendng order..., preorder gives depth first ordering...
13) Given the following expressions of a
grammar
E --> E * F / F + E / F
F --> F – F / id
Which of the following is true ?
(A) * has higher precedence than +
(B) – has higher precedence than *
(C) + and – have same precedence
(D) + has higher precedence than *
lower priority operator is alwayas comes at first level and then higher priority and so on .since * and + comes at first level and then - is at next level .hence - is having higher priority then *.
14) The amount of uncertainty in a system of symbol is called
1.Bandwidth
2.Entropy
3.Loss
4.Quantum
15) A 10Base-2 network is limited to
1.20 bytes per data field
2.30 stations per segment
3.40 segments
4.50 feet of cable
16) 76% like UGC exam and 63% like Gate exam; how many like both
76+63-100=39
17) RAID stands for
(1) rapid action in disaster
(2) random access internal disks
(3) rapid application interface design
(4) redundant arrays of independent disks
18) Which of the following statements is/are true ?
1 Cache Memories are bigger than RAM
2 Cache Memories are smaller than RAM
3 ROM are faster than RAM
4 Information in ROM can be written by users
19) The graphic method of LP uses
a. Objective function equation
b. Constraint equation
c. linear equations.
d. all of the above
20) In a high resolution mode, the number of dots in a line will usually be ----
1 320
2 640
3 760
4 900

Wednesday, March 8, 2017

Asymptotic Analysis

Asymptotic analysis of an algorithm refers to defining the mathematical boundation/framing of its run-time performance. Using asymptotic analysis, we can very well conclude the best case, average case, and worst case scenario of an algorithm.
Asymptotic analysis is input bound i.e, if there's no input to the algorithm, it is concluded to work in a constant time. Other than the "input" all other factors are considered constant.
Asymptotic analysis refers to computing the running time of any operation in mathematical units of computation. For example, the running time of one operation is computed as f(n) and may be for another operation it is computed as g(n2). This means the first operation running time will increase linearly with the increase in n and the running time of the second operation will increase exponentially when n increases. Similarly, the running time of both operations will be nearly the same if n is significantly small.
Usually, the time required by an algorithm falls under three types −
  • Best Case − Minimum time required for program execution.
  • Average Case − Average time required for program execution.
  • Worst Case − Maximum time required for program execution.

Asymptotic Notations

Following are the commonly used asymptotic notations to calculate the running time complexity of an algorithm.
  • Ο Notation
  • Ω Notation
  • θ Notation

Big Oh Notation, Ο

The notation Ο(n) is the formal way to express the upper bound of an algorithm's running time. It measures the worst case time complexity or the longest amount of time an algorithm can possibly take to complete.
Big O Notation
For example, for a function f(n)
Ο(f(n)) = { g(n) : there exists c > 0 and n0 such that f(n) ≤ c.g(n) for all n > n0. }

Omega Notation, Ω

The notation Ω(n) is the formal way to express the lower bound of an algorithm's running time. It measures the best case time complexity or the best amount of time an algorithm can possibly take to complete.
Omega Notation
For example, for a function f(n)
Ω(f(n)) ≥ { g(n) : there exists c > 0 and n0 such that g(n) ≤ c.f(n) for all n > n0. }

Theta Notation, θ

The notation θ(n) is the formal way to express both the lower bound and the upper bound of an algorithm's running time. It is represented as follows −
Theta Notation
θ(f(n)) = { g(n) if and only if g(n) =  Ο(f(n)) and g(n) = Ω(f(n)) for all n > n0. }

Common Asymptotic Notations

Following is a list of some common asymptotic notations −
constantΟ(1)
logarithmicΟ(log n)
linearΟ(n)
n log nΟ(n log n)
quadraticΟ(n2)
cubicΟ(n3)
polynomialnΟ(1)
exponential2Ο(n)

Sunday, March 5, 2017

Introduction to Data Mining



1.      Data Mining

There is huge amount of data available in Information Industry. This data is of no use until converted into useful information. Analyzing this huge amount of data and extracting useful information from it is necessary.

The extraction of information is not the only process we need to perform; it also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we are now position to use this information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration etc.

Data mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data.

Data mining is concerned with the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. The focus of data mining is to find the information that is hidden and unexpected.
Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Although data mining is still a relatively new technology, it is already used in a number of industries

It is the process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships

These patterns and trends can be collected and defined as a data mining model. Mining models can be applied to specific scenarios, such as:
a)      Forecasting: Estimating sales, predicting server loads or server downtime
b)      Risk and probability: Choosing the best customers for targeted mailings, determining the probable break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
c)      Recommendations: Determining which products are likely to be sold together, generating recommendations
d)      Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
e)      Grouping: Separating customers or events into cluster of related items, analyzing and predicting affinities

2.      Data Mining Applications
Here is the list of applications of Data Mining:
1.      Market Analysis and Management
2.      Corporate Analysis & Risk Management
3.      Fraud Detection
4.      Other Applications

1)      Market Analysis and Management

Following are the various fields of market where data mining is used:
a)      Customer Profiling - Data Mining helps to determine what kind of people buy what kind of products.
b)      Identifying Customer Requirements - Data Mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.
c)      Cross Market Analysis - Data Mining performs Association/correlations between product sales.
d)      Target Marketing - Data Mining helps to find clusters of model customers who share the same characteristics such as interest, spending habits, income etc.
e)      Determining Customer purchasing pattern - Data mining helps in determining customer purchasing pattern.
f)       Providing Summary Information - Data Mining provide us various multidimensional summary reports

2)      Corporate Analysis & Risk Management
Following are the various fields of Corporate Sector where data mining is used:
a)      Finance Planning and Asset Evaluation - It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.
b)      Resource Planning - Resource Planning It involves summarizing and comparing the resources and spending.
c)      Competition - It involves monitoring competitors and market directions.

3)      Fraud Detection
Data Mining is also used in fields of credit card services and telecommunication to detect fraud. In fraud telephone call it helps to find destination of call, duration of call, time of day or week. It also analyze the patterns that deviate from an expected norms.

4)      Other Applications
Data Mining also used in other fields such as sports, astrology and Internet Web Surf-Aid.

3.      KDD

KDD (Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i.e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them. Data Mining is application of a specific algorithm in order to extract patterns from data. Nonetheless, KDD and Data Mining are used interchangeably.

What is KDD?
As mentioned above, KDD is a field of computer science, which deals with extraction of previously unknown and interesting information from raw data. KDD is the whole process of trying to make sense of data by developing appropriate methods or techniques. This process deal with the mapping of low-level data into other forms those are more compact, abstract and useful.

This is achieved by creating short reports, modeling the process of generating data and developing predictive models that can predict future cases. Due to the exponential growth of data, especially in areas such as business, KDD has become a very important process to convert this large wealth of data in to business intelligence, as manual extraction of patterns has become seemingly impossible in the past few decades. For example, it is currently been used for various applications such as social network analysis, fraud detection, science, investment, manufacturing, telecommunications, data cleaning, sports, information retrieval and largely for marketing. KDD is usually used to answer questions like what are the main products that might help to obtain high profit next year in Wal-Mart.

This process has several steps. It starts with developing an understanding of the application domain and the goal and then creating a target dataset. This is followed by cleaning, preprocessing, reduction and projection of data. Next step is using Data Mining (explained below) to identify pattern. Finally, discovered knowledge is consolidates by visualizing and/or interpreting.

What is the difference between KDD and Data mining?
Although, the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts. KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.

Steps involved in knowledge discovery process

Here is the list of steps involved in knowledge discovery process:

a)      Data Cleaning - In this step the noise and inconsistent data is removed.
b)      Data Integration - In this step multiple data sources are combined.
c)      Data Selection - In this step relevant to the analysis task are retrieved from the database.
d)      Data Transformation - In this step data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.
e)      Data Mining - In this step intelligent methods are applied in order to extract data patterns.
f)       Pattern Evaluation - In this step, data patterns are evaluated.
g)      Knowledge Presentation - In this step, knowledge is represented.

4.      Primary goals of data mining

The two "high-level" primary goals of data mining, in practice, are prediction and description.

Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest.
Description focuses on finding human-interpretable patterns describing the data.

The relative importance of prediction and description for particular data mining applications can vary considerably. However, in the context of KDD, description tends to be more important than prediction. This is in contrast to pattern recognition and machine learning applications (such as speech recognition) where prediction is often the primary goal of the KDD process.
The goals of prediction and description are achieved by using the following primary data mining tasks:

1)      CLASSIFICATION/PREDICTION
Classification involves the discovery of a predictive learning function that classifies a data item into one of several predefines classes. It involves examining the features of a newly presented object and assigning to it a predefined class.

Classification is a two-step process. First a model is built describing a predetermined set of data classes or concepts and secondly, the model is used for classification.

Prediction can be viewed as the construction and use of a model to assess the class of a unlabeled sample, or to assess the value or value range of an attribute that a given sample is likely to have.

According to, any of the techniques used for classification can be adapted for use in prediction by using training examples where the value of the variable to be predicted is already known, along with historical data for those examples. Typical business related questions that can be answered using classification or prediction tasks are:
·         Which customers will buy?
·         Which products will customer by?
·         How much will customer buy?

2)      ESTIMATION
While classification deals with discrete outcomes such as yes or no, debit card, home loan or vehicle financing, estimation deals with continuously valued outcomes. If some input data is available, estimation can be used to come up with some unknown continuous variable such as income or height. In estimation, one wants to come up with a plausible value or a range of plausible values for the unknown parameters of a system. Typical examples of estimation and the business related questions that can be addressed by making use it include the following:
·         How many children are in a family?
·         Estimating a family’s total household income.
·         Estimating the value of a piece of property.

3) SEGMENTATION
Segmentation simply means making different offers to different markets segments; groups of people defined by some combination of demographic variables such as age, gender or income. Define the segmentation as a form of analysis used to for instance break down the visitors to a website into unique groups with individual behaviors.

The grouping can then be used to make statistical projections, such as the potential amount of purchases they are likely to make. Typical business questions that can be answered using segmentation are:
·         What are the different types of visitors attracted to our website?
·         In which age groups do the listeners of a certain radio station fall into?

4) CLUSTERING
Clustering is the task of segmenting a diverse group into a number of similar subgroups or clusters. Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.

Clustering is commonly used to search for unique groupings within a data set. The distinguishing factor between clustering and classification is that in clustering there are no predefined classes and no examples. The objects are grouped together based on self-similarity. Typical business question that can be answered using clustering are:
·         What are the groupings hidden in your data.
·         Which customer should be grouped together for target marketing purposes?

Clustering is grouped under descriptive data mining tasks. Clustering is best used for finding groups of items that are similar for e.g. given a data set of customers; identify subgroups of customers that have a similar buying behavior

5) DESCRIPTION AND VISUALIZATION
The purpose of data mining is sometimes simply to describe what is going on in a complicated database in a way that increased our understanding of the people, products or processes that produced the data in the first place. They state that a good enough description of behavior will often suggest an explanation for it as well. One of the most powerful forms of descriptive data mining is data visualization. Although visualization is not always easy, the right picture can truly speak a thousand words since human beings are extremely practiced at extracting meaning from visual scenes.

Visualization can be useful in providing a visual representation of the location and distribution of a company’s major clients on a map of a city or a province or even a country

6) Sequence/Temporal
Sequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of time. Difference between sequence rules and other rules is the temporal factor
Example - retailers database can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven

5.      Steps of Data Mining Process
In order to systematically conduct data mining analysis, a general process is usually followed. There is a Cross-Industry Standard Process for Data Mining (CRISP-DM) widely used by industry members. This model consists of six phases intended as a cyclical process

1) Business Understanding Business understanding includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a project plan. Tasks in this phase include:
·         Identifying your business goals
·         Assessing your situation
·         Defining your data mining goals
·         Producing your project plan

2) Data Understanding Once business objectives and the project plan are established, data understanding considers data requirements. This step can include initial data collection, data description, data exploration, and the verification of data quality. Data exploration such as viewing summary statistics (which includes the visual display of categorical variables) can occur at the end of this phase. Models such as cluster analysis can also be applied during this phase, with the intent of identifying patterns in the data. Tasks for this phase include:
·         Gathering data
·         Describing
·         Exploring
·         Verifying quality

3) Data Preparation Once the data resources available are identified, they need to be selected, cleaned, built into the form desired, and formatted. Data cleaning and data transformation in preparation of data modeling needs to occur in this phase. Data exploration at a greater depth can be applied during this phase, and additional models utilized, again providing the opportunity to see patterns based on business understanding. Tasks for this phase include:
·         Selecting data
·         Cleaning data
·         Constructing
·         Integrating
·         Formatting

4) Modeling Data mining software tools such as visualization (plotting data and establishing relationships) and cluster analysis (to identify which variables go well together) are useful for initial analysis. Tools such as generalized rule induction can develop initial association rules. Tasks for this phase include:
·         Selecting techniques
·         Designing tests
·         Building models
·         Assessing models

5) Evaluation Model results should be evaluated in the context of the business objectives established in the first phase (business understanding). Gaining business understanding is an iterative procedure in data mining, where the results of various visualization, statistical, and artificial intelligence tools show the user new relationships that provide a deeper understanding of organizational operations. Tasks for this phase include:
·         Evaluating results
·         Reviewing the process
·         Determining the next steps

6) Deployment Data mining can be used to both verify previously held hypotheses, or for knowledge discovery (identification of unexpected and useful relationships). These models need to be monitored for changes in operating conditions, because what might be true today may not be true a year from now. If significant changes do occur, the model should be redone. It’s also wise to record the results of data mining projects so documented evidence is available for future studies. Tasks for this phase include:
·         Planning deployment (your methods for integrating data mining discoveries into use)
·         Reporting final results
·         Reviewing final results