Concepts and Techniques
(3rd ed.)
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
Adapted for CSE 347-447, Lecture 1b, Spring 2015
11
Introduction n Why Data Mining?
n
What Is Data Mining?
n
A Multi-Dimensional View of Data Mining
n
What Kind of Data Can Be Mined?
n
What Kinds of Patterns Can Be Mined?
n
What Technologies Are Used?
n
What Kind of Applications Are Targeted?
n
Major Issues in Data Mining
n
A Brief History of Data Mining and Data Mining Society
n
Summary
2
Why Data Mining? n The Explosive Growth of Data: from terabytes to petabytes n Data collection and data availability n Automated data collection tools, database systems, Web, computerized society
n
Major sources of abundant data n Business: Web, e-commerce, transactions, stocks, …
n
Science: Remote sensing, bioinformatics, scientific simulation, …
n
Society and everyone: news, digital cameras, YouTube
n
We are drowning in data, but starving for knowledge!
n
“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
3
Evolution of Sciences: New Data Science Era n Before 1600: Empirical science
n
1600-1950s: Theoretical science n n
1950s-1990s: Computational science n n
n
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
1990-now: Data science n The flood of data from new scientific instruments and simulations
n
The ability to economically store and manage petabytes of data online
n
The Internet and computing Grid that makes all these archives universally accessible
n
n
n
Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes
Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
4
Introduction n Why Data Mining?
n
What Is Data Mining?
n
A Multi-Dimensional View of Data Mining
n
What Kind of Data Can Be Mined?
n
What Kinds of Patterns Can Be Mined?
n
What Technologies Are Used?
n
What Kind of Applications Are Targeted?
n
Major Issues in Data Mining
n
A Brief History of Data Mining and Data Mining Society
n
Summary
5
What Is Data Mining? n Data mining (knowledge discovery from data) n Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
n
n
Alternative names n n
Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”? n Simple search and query processing
n
(Deductive) expert systems
6
Knowledge Discovery (KDD) Process n n
This is a view from typical database systems and data
Pattern Evaluation warehousing communities
Data mining plays an essential role in the knowledge discovery
Data Mining process Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
7
Example: A Web Mining Framework n Web mining usually involves n Data cleaning
n
Data integration from multiple sources
n
Warehousing the data
n
Data cube construction
n
Data selection for data mining
n
Data mining
n
Presentation of the mining results
n
Patterns and