Data Mining and Data Warehousing

Principles and Practical Techniques

Summary

Chapter Objectives

✓ To understand the need of an operational data store in OLTP and OLAP systems

✓ To understand data warehousing, its benefits and architecture

✓ To do a comparative study on OLAP, OLTP and ODS.

✓ To comprehend the concept of data mart.

In order to understand what an Operational Data Store is, it is very important to know what led to such Operational Data Stores and what the limitations were of OLTP systems to answer management queries.

The Need for an Operational Data Store (ODS)

Online Transaction Processing (OLTP) systems have become popular due to their versatility. Right from financial transactions to daily log operations, these are being used by myriad multinational companies and world level organizations to record transaction details.

In the present scenario, organizations and their branches operate in many locations across the world. Each, such, branch generates massive amounts of data. The management of a large retail chain operating from multiple locations, for example, will at the end of day, want to know about transactions done that day. One could take the case of a Dominos Pizza store, where the management needs to know the total sales for that day or other details such as the number and types of pizzas sold. Such companies rely on OLTP systems to get data from multiple stores spanning the world. On these OLTP systems, queries usually run on an indexed database, as this makes the searching fast and efficient. But unfortunately, data spread over multiple systems leads to a plethora of technical errors when carrying out the simple task of running queries on data stored on OLTP systems.

When dealing with corporate data it is also necessary to maintain correct and accurate information, in order to provide swift customer support services. This is possible if the data is acquired from all information sources.

It is very important for any organization's management to know the real state of affairs of their organization. However, OLTP systems fail badly, as they were not designed to support management queries. Management queries are very complex and require multiple joins and aggregations while being written. To overcome this limitation of OLTP systems some solutions were proposed, which are as follows.

Summary

Chapter Objectives

✓ To comprehend the concept of association mining and its applications

✓ To understand the role of support, confidence and lift

✓ To comprehend the Naïve algorithm, Its limitations and improvements

✓ To learn about approaches for transaction database storage

✓ To understand and demonstrate the use of the apriori algorithm and the direct hashing and pruning algorithm

✓ To use dynamic Itemset counting to identify association rules

✓ To use FP growth for mining frequent patterns without candidate generation

Introduction to Association Rule Mining

Association rule mining often known as ‘market basket’ analysis is very effective technique to find the association of sale of item X with item Y. In simple words, market basket analysis consists of examining the items in the baskets of shoppers checking out at a market to see what types of items ‘go together’ as illustrated in Figure 9.1.

It would be useful to know, when people make a trip to the store, what kind of items do they tend to buy during that same shopping trip? For example, as shown in Figure 9.1, a database of customer transactions (i.e., shopping baskets) is given where each transaction consists of a set of items (i.e., products purchased during a visit). Association rule mining is used to identify groups of items which are frequently purchased together (customers’ purchasing behavior). For example, ‘IF one buys bread and milk, THEN he/she also buys eggs with high probability.’ This information is useful for the store manager for better planning of stocking items in the store to improve its sale and efficiency.

Let us suppose that the store manager, receives customer complaints about heavy rush in his store and its consequent slow working. He may then decide to place associated items such as bread and milk together, so that customers can buy the items easier and faster than if these were at a distance. It also improves the sale of each product. In another scenario, let us suppose his store is new and the store manager wishes to display all its range of products to prospective customers.

Summary

Chapter Objectives

✓ To understand the need for online analytical processing

✓ To distinguish OLAP from OLTP and data mining

✓ To comprehend the representation of multi-dimensional view of data

✓ To understand and apply the concept of data cube in real world applications

✓ To implement multi-dimensional view of data in Oracle

✓ To understand different types of OLAP Servers such as ROLAP and MOLAP

✓ To able to perform different OLAP Operations on the database

Introduction to Online Analytical Processing

As there is a huge growth in data warehousing in recent past, the demand for more powerful access tools that help in advanced analytical processing from historical data has increased. Online Analytical Processing (OLAP) and data mining are two types of access tools that have been developed to meet the demands of management users. OLAP and data mining are distinct in what they offer the user and due to this they are complementary technologies.

OLAP is a design paradigm that provides a method to extract useful information from a physical data store. It aggregates information from multiple systems and provides summarized information/view to the management, while data mining is used to find hidden patterns within the data.

OLAP summarizes data and makes forecasts. For example, it answers operational questions like ‘What are the average sales of cycles, by region and by year?’ Data mining discovers hidden patterns in data and operates at a detailed level instead of a summary level. For instance, in a telecom industry where customer churn is a key factor, data mining would answer questions like, ‘Who is likely to shift service providers and what are the reasons for that?’ In conclusion, OLAP answers questions such as ‘Is this true?’ while data mining answers questions as ‘Why is this happening? And what might happen if …?’

Thus, OLAP and data mining complement each other and help the management in better decision taking. An environment that consists of a data warehouse/data marts together including tools such as OLAP and/or data mining are collectively known as Business Intelligence (BI) technologies.

Defining OLAP

Dr E.F. Codd (1993) has defined OLAP as ‘the dynamic synthesis, analysis, and consolidation of large volumes of multi-dimensional data’.

From this definition, it is clear that OLAP deals with multi-dimensional view of data as compared to a simple relational database view.

Summary

Chapter Objectives

✓ To apply the K-means algorithm in Weka and R language

✓ To interpret the results of clustering

✓ To identify the optimum number of clusters

✓ To apply classification on un-labeled data by using clustering as an intermediate step

Introduction

As discussed earlier, if data is not labeled then we can analyze this data by performing a clustering analysis, where clustering refers to the task of grouping a set of objects into classes of similar objects.

In this chapter, we will apply clustering on Fisher's Iris dataset. We will use clustering algorithms to group flower samples into clusters with similar flower dimensions. These clusters then become possible ways to group flowers samples into species. We will implement a simple k-means algorithm to cluster numerical attributes with the help of Weka and R.

In the case of classification, we know the attributes and classes of instances. For example, the flower dimensions and classes were already known to us for the Iris dataset. Our goal was to predict the class of an unknown sample as shown in Figure 8.1.

Earlier, we used the Weka J48 classification algorithm to build a decision tree on Fisher's Iris dataset using samples with known class, which helped in predicting the class of unknown samples. We used the flower's Sepal length and width, and the Petal length and width as the specific attributes for this. Based on flower dimensions and using this tree, we can identify an unknown Iris as one of three species, Setosa, Versicolor, and Virginica.

In clustering, we know the attributes for the instances, but we don't know the classes. For example, we know the flower dimensions for samples of the Iris dataset but we don't know what classes exist as shown in Figure 8.2. Therefore, our goal is to group instances into clusters with similar attributes or dimensions and then identify the class.

In this chapter, we will learn what happens if we don't know what classes the samples belong to, or even how many classes there are, or even what defines a class? Since, Fisher's Iris dataset is already labeled, we will first make this dataset unlabeled by removing the class attribute, i.e., the species column. Then, we will apply clustering algorithms to cluster this data on the basis of its input attributes, i.e., Sepal length, Sepal width, Petal length, and Petal width.

Summary

Chapter Objectives

✓ To comprehend the concept, types and working of classification

✓ To identify the major differences between classification and regression problems

✓ To become familiar about the working of classification

✓ To introduce the decision tree classification system with concepts of information gain and Gini Index

✓ To understand the workings of the Naïve Bayes method

Introduction to Classification

Nowadays databases are used for making intelligent decisions. Two forms of data analysis namely classification and regression are used for predicting future trends by analyzing existing data. Classification models predict discrete value or class, while Regression models predict a continuous value. For example, a classification model can be built to predict whether India will win a cricket match or not, while regression can be used to predict the runs that will be scored by India in a forthcoming cricket match.

Classification is a classical method which is used by machine learning researchers and statisticians for predicting the outcome of unknown samples. It is used for categorization of objects (or things) into given discrete number of classes. Classification problems can be of two types, either binary or multiclass. In binary classification the target attribute can only have two possible values. For example, a tumor is either cancerous or not, a team will either win or lose, a sentiment of a sentence is either positive or negative and so on. In multiclass classification, the target attribute can have more than two values. For example, a tumor can be of type 1, type 2 or type 3 cancer; the sentiment of a sentence can be happy, sad, angry or of love; news stories can be classified as weather, finance, entertainment or sports news.

Some examples of business situations where the classification technique is applied are:

Summary

Chapter Objectives

✓ To understand the need for data preprocessing.

✓ To identify different phases of data preprocessing such as data cleaning, data integration, data transformation and data reduction

Need for Data Preprocessing

For any data analyst, one of the most important concerns is ‘Data’. In fact, the representation and quality of data which is being used for carrying out an analysis is the first and foremost concern to be addressed by any analyst. In the context of data mining and machine learning, ‘Garbage in, Garbage out’, is a popular saying while working with large quantities of data.

Commonly, we end up having lot of noisy data; as an example, income: -400 i.e. negative income. Sometimes, we may have unrealistic and impossible combinations of data, for example, in a record, Gender-Male may be entered as Pregnant-Yes. Obviously absurd! because males do not get pregnant. We also suffer due to missing values and other data anomalies. Analyzing such sets of data, that have not been screened before analysis can cause misleading results. Hence, data preprocessing is the first step for any data mining process.

Data preprocessing is a data mining technique that involves transformation of raw data into an understandable format, because real world data can often be incomplete, inconsistent or even erroneous in nature. Data preprocessing resolves such issues. Data preprocessing ensures that further data mining process are free from errors. It is a prerequisite preparation for data mining, it prepares raw data for the core processes.

The University Management System Example

For any University Management System, a correct set of information about their students or vendors is of utmost importance in order to contact them. Hence, accurate and up to date student information is always maintained by a university. Correspondence sent to wrong address would, for instance, lead to a bad impression about the respective university.

Millions of Customer Support Centres across the globe also maintain correct and consistent data about their customers. Imagine a case where a call centre executive is not able to identify the client or customer that he is handling, through his phone number. Such scenarios suggest how reputation is at stake when it comes to accuracy of data. But, on the other hand obtaining the correct details of students or customers is a very challenging task.

Summary

Chapter Objectives

✓ To discuss the major issues of relational databases

✓ To understand the need for NoSQL

✓ To comprehend the characteristics of NoSQL

✓ To understand different data models of NoSQL

✓ To understand the concept of the CAP theorem

✓ To discuss the future of NoSQL

After about half a century of dominance of relational database, the current excitement about NoSQL databases comes as a big surprise. In this chapter, we'll explore the challenges faced by relational databases due to changing technological paradigms and why the current rise of NoSQL databases is not a flash in the pan.

Let us start our discussion by looking at relational databases.

The Rise of Relational Databases

Dr E. F Codd proposed the relational model in 1969. It was soon adopted by the mainstream software industries due to its simplicity and efficiency replacing hierarchical and network models that were prevalent at that time. The timeline showing the rise of the relational model is depicted in Figure 15.1.

The reasons for the success of relational databases were their simplicity, the power of SQL, support for transaction management, concurrency control, and recovery management.

Major Issues with Relational Databases

The relational data model organizes data in rows and columns that are arranged in a tabular form. In the relational model, a row is known as a tuple which is a set of key-value pairs and a relation is a set of these tuples. All operations in SQL consume and return relations. This foundation based on relations provides a certain elegance and simplicity, but it also suffers some limitations. The values in a relational tuple have to be simple (atomic)—they cannot contain any structure, such as a nested record or a list.

This limitation is not true for in-memory data structures, which can take on much richer structures than relations. As a result, if you want to use a richer in-memory data structure, you would have to translate it to a relational representation to store it on disk. This problem is known as impedance mismatch i.e. two different representations that require inter-translation as shown in Figure 15.2.

The impedance mismatch is a major source of frustration for application developers. In the 1990s many experts believed that impedance mismatch would lead to relational databases being replaced with databases that replicate the in-memory data structures to disk.

Summary

Chapter Objectives

✓ To demonstrate the use of the association mining algorithm.

✓ To apply association mining on numeric data

✓ To comprehend the use of class association rules

✓ To compare the decision tree classifier with association mining

✓ To conduct association mining with R language

Association Mining with Weka

Let us consider the ‘to-play-or-not-to-play’ dataset given in Figure 10.1 for getting hands on experience with association mining in Weka. This dataset is available as default dataset in the data folder of Weka with the file name weather.nominal.arff.

This dataset has four attributes describing weather conditions and a fifth attribute is a class attribute that indicates based on the weather conditions of the day, whether Play was held or not. There are 14 instances, or samples in this dataset.

It is important to note that in classification, we are interested in assigning the output attribute to play or no play. But in Association mining we are interested in finding association rules based on the associations between all the attributes that came together. Thus, in association we do not take class attributes into consideration.

If we compare this dataset with the transactions dataset discussed in the last chapter for market basket analysis, you can find equivalence between transaction id and data items purchased in that transaction.

Here, No. 1 to 14, i.e. the instances act as transaction ids and the values of attributes given in the row corresponding to the given instance are acting as data items for that instance. Here we are interested in finding associations by observing the facts like Outlook = sunny AND Temperature = hot is more common than the association of Outlook = sunny AND Temperature = cooloccurring together as shown in Figure 10.2.

Weka contains an Associate tab which aids in applying different association algorithms in order to find association rules from datasets. One such algorithm is the Predictive Apriori association algorithm that optimally combines support and confidence to calculate a value called predictive accuracy as depicted in Figure 10.3.

The user only needs to specify how many rules they would like the algorithm to generate, and the algorithm takes care of optimizing support and confidence to find the best rules.

Summary

Chapter Objectives

✓ To understand what is meant by web mining and its types

✓ To understand the working of the HITS algorithm

✓ To know the brief history of search engines

✓ To understand a search engine's architecture and its working

✓ To understand the PageRank algorithm and its working

✓ To understand the concepts of precision and recall

Introduction

Since Berners-Lee (inventor of the World Wide Web) created the first web page in 1991, there has been an exponential growth in the number of websites worldwide. As of 2018, there were 1.8 billion websites in the world. This growth has been accompanied with another exponential increase in the amount of data available and the need to organize this data in order to extract useful information from it.

Early attempts to organize such data included creation of web directories to group together similar web pages. The web pages in these directories were often manually reviewed and tagged based on keywords. As time passed by, search engines became available which employed a variety of techniques in order to extract the required information from the web pages. These techniques are called web mining. Formally, web mining is the application of data mining techniques and machine learning to find useful information from the data present in web pages.

Web mining is divided into three parts, i.e. web content mining, structure mining, and usage mining as shown in Figure 11.1.

We will discuss each type of web mining in brief.

Web Content Mining

Web content mining deals with extracting relevant knowledge from the contents of a web page. During content mining, we totally ignore how other web pages link to a given web page or how users interact with it. A trivial approach to web content mining is based on location and frequency of keywords. But this gives rise to two problems: first, the problem of scarcity and second, the problem of abundance. The problem of scarcity occurs with those queries that either generate a few results or no results at all. The problem of abundance occurs with the queries that generate too many search results. The root cause of both the problems is the nature of data present on the web. The data is usually present in the form of HTML which is semi-structured and useful information is generally scattered across multiple web pages.

Summary

In the modern age of artificial intelligence and business analytics, data is considered as the oil of this cyber world. The mining of data has huge potential to improve business outcomes, and to carry out the mining of data there is a growing demand for database mining experts. This book intends training learners to fill this gap.

This book will give learners sufficient information to acquire mastery over the subject. It covers the practical aspects of data mining, data warehousing, and machine learning in a simplified manner without compromising on the details of the subject. The main strength of the book is the illustration of concepts with practical examples so that the learners can grasp the contents easily. Another important feature of the book is illustration of data mining algorithms with practical hands-on sessions on Weka and R language (a major data mining tool and language, respectively). In this book, every concept has been illustrated through a step-by-step approach in tutorial form for self-practice in Weka and R. This textbook includes many pedagogical features such as chapter wise summary, exercises including probable problems, question bank, and relevant references, to provide sound knowledge to learners. It provides the students a platform to obtain expertise on technology, for better placements.

Video sessions on data mining, machine learning, big data and DBMS are also available on my YouTube channel. Learners are requested to subscribe to this channel https://www.youtube.com/user/parteekbhatia to get the latest updates through video sessions on these topics.

Your suggestions for further improvements to the book are always welcome. Kindly e-mail your suggestions to parteek.bhatia@gmail.com.

I hope you enjoy learning from this book as much as I enjoyed writing it.

Summary

Chapter Objectives

✓ To comprehend the concept of clustering, its applications, and features.

✓ To understand various distance metrics for clustering of data.

✓ To comprehend the process of K-means clustering.

✓ To comprehend the process of hierarchical clustering algorithms.

✓ To comprehend the process of DBSCAN algorithms.

Introduction to Cluster Analysis

Generally, in the case of large datasets, data is not labeled because labeling a large number of records requires a great deal of human effort. The unlabeled data can be analyzed with the help of clustering techniques. Clustering is an unsupervised learning technique which does not require a labeled dataset.

Clustering is defined as grouping a set of similar objects into classes or clusters. In other words, during cluster analysis, the data is grouped into classes or clusters, so that records within a cluster (intra-cluster) have high similarity with one another but have high dissimilarities in comparison to objects in other clusters (inter-cluster), as shown in Figure 7.1.

The similarity of records is identified on the basis of values of attributes describing the objects. Cluster analysis is an important human activity. The first human beings Adam and Eve actually learned through the process of clustering. They did not know the name of any object, they simply observed each and every object. Based on the similarity of their properties, they identified these objects in groups or clusters. For example, one group or cluster was named as trees, another as fruits and so on. They further classified the fruits on the basis of their properties like size, colour, shape, taste, and others. After that, people assigned labels or names to these objects calling them mango, banana, orange, and so on. And finally, all objects were labeled. Thus, we can say that the first human beings used clustering for their learning and they made clusters or groups of physical objects based on the similarity of their attributes.

Applications of Cluster Analysis

Cluster analysis has been widely used in various important applications such as:

Summary

Chapter Objectives

✓ To demonstrate the use of the decision tree

✓ To apply the decision tree on a sample dataset

✓ To implement a decision tree process using Weka and R

Building a Decision Tree Classifier in Weka

In this chapter, we will learn how Weka's decision tree feature helps to classify unknown samples of a dataset based on its attribute values. When Weka's decision tree is applied to an unknown sample, the decision tree classifies the sample into different classes such as Class A, Class B and Class C as shown in Figure 6.1.

For example, if we want to predict the class of an unknown sample of a flower based on the length and width dimensions of its Sepal and Petal. The first step would be to measure Sepal length and width and Petal length and width of an unknown flower and compare these dimensions to the values of the samples in our dataset of known species. The decision tree algorithm of Weka will help in creating decision rules to predict the class of unknown flower automatically as shown in Figure 6.2.

As shown in Figure 6.2, the dimensions of an unknown sample of flower will be matched with the rules generated by the decision tree. First, the rules will be matched to determine whether the sample belongs to Setosa class or not, if yes, the unknown sample will be classified as setosa. If not, the unknown sample will be checked for being of the Virginica class. If it matches with the conditions of the Virginica class, it will be labeled as Virginica, otherwise Versicolor. It is important to note that it would not be simple to create these rules on the basis of the values of single attribute as shown in Table 6.1. It is clear that for the same Sepal width, the flower may be of Setosa or Versicolor or Virginica, making it unclear which species an unknown flower belongs to on the basis of Sepal width alone. Thus, the decision tree must make its prediction based on all four flower dimensions.

Due to such overlaps, the decision tree cannot predict with 100% accuracy the class of flower, but can only determine the likelihood of an unknown sample belonging to a particular class. In real situations the decision tree algorithm works on the basis of probability.

Summary

Chapter Objectives

✓ To learn about the concepts of data mining.

✓ To understand the need for, and the applications of data mining

✓ To differentiate between data mining and machine learning

✓ To understand the process of data mining.

✓ To understand the difference between data mining and machine learning.

Introduction to Data Mining

In the age of information, an enormous amount of data is available in different industries and organizations. The availability of this massive data is of no use unless it is transformed into valuable information. Otherwise, we are sinking in data, but starving for knowledge. The solution to this problem is data mining which is the extraction of useful information from the huge amount of data that is available.

Data mining is defined as follows:

‘Data mining is a collection of techniques for efficient automated discovery of previously unknown, valid, novel, useful and understandable patterns in large databases. The patterns must be actionable so they may be used in an enterprise's decision making.’

From this definition, the important take aways are:

Summary

Chapter Objectives

✓ To understand the concept of dimension, measure and the fact table

✓ To able to apply different schema of data warehouse designs such as Star Schema, Snowflake Schema and Fact Constellation Schema on real world applications.

✓ To understand the differences between these schemas, their strengths and weakness.

Introduction to Data Warehouse Schema

Logical descriptions of database are known as Schema. It is the blueprint of the entire database. It defines how the data are organized and how the relations among them are associated. Data warehouse schema consists of the name and description of records including associated data items and aggregates. A database uses relational models whereas a data warehouse uses different types of schema, namely, Star, Snowflake, and Fact Constellation.

To start discussion on these schemas, it is important to understand the basic terminology used in this process, which is discussed below.

Dimension

The term ‘dimension’ in data warehousing is a collection of reference information about a measurable event. These events are stored in a fact table and are known as facts. The dimensions are generally the entities for which an organization wants to preserve records. The descriptive attributes are organized as columns in dimension tables by a data warehouse. For example, a student's dimension attributes could consist of first and last name, roll number, age, gender, or an address dimension that would include street name, state, and country attributes.

A dimension table consists of a primary key column that uniquely identifies each record (row) of dimension. A dimension is a framework that consists of one or more hierarchies that classify data. Usually dimensions are de-normalized tables and may have redundant data.

Let us take a quick recap of the concepts of normalization and de-normalization, as they will be used in this chapter. Normalization is a process of breaking up a larger table into smaller tables free of any possible insertion, updation or deletion anomalies. Normalized tables have reduced redundancy of data. In order to get full information, these tables are usually joined.

In de-normalization, smaller tables are merged to form larger tables to reduce joining operations. De-normalization is particularly performed in those cases where retrieval is a major requirement and insert, update, and delete operations are minimal, as in case of historical data or data warehouse. These de-normalized tables will have redundancy of data.

Data Mining and Data Warehousing

Principles and Practical Techniques

Refine listing

Refine listing

Actions for selected content:

24 results in Data Mining and Data Warehousing

12 - Data Warehouse

Summary

List of Figures

9 - Association Mining

Summary

List of Tables

Dedication

14 - Online Analytical Processing

Summary

8 - Implementing Clustering with Weka and R

Summary

Index

5 - Classification

Summary

4 - Data Preprocessing

Summary

Frontmatter

15 - Big Data and NoSQL

Summary

10 - Implementing Association Mining with Weka and R

Summary

11 - Web Mining and Search Engines

Summary

Contents

Preface

Summary

7 - Cluster Analysis

Summary

6 - Implementing Classification in Weka and R

Summary

2 - Introduction to Data Mining

Summary

13 - Data Warehouse Schema

Summary

Principles and Practical Techniques

Refine listing

Refine listing

Actions for selected content:

Save Search

24 results in Data Mining and Data Warehousing

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary