CSM 6404: Data Mining
Dr. Md Geaur Rahman
Director, PGD in ICT
&
Associate Professor in Computer Science
Department of Computer Science and Mathematics
Bangladesh Agricultural University
Email: gea_bau@yahoo.com
Lecture 1
Outline
Definition,motivation & application
Branches of data mining
Major issues in data mining
What Is Data Mining?
Data mining (knowledge discovery in databases):
◦ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
Alternative names and their “inside stories”:
◦ Data mining: a misnomer?
◦ Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
Data Mining Definition
Finding hidden information in a database
Fit data to a model
Similar terms
◦ Exploratory data analysis
◦ Data driven discovery
◦ Deductive learning
Motivation:
Data explosion problem
◦ Automated data collection tools and mature database technology lead to
tremendous amounts of data stored in databases, data warehouses and
other information repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
◦ Data warehousing and on-line analytical processing
◦ Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
Why Mine Data? Commercial Viewpoint
Lotsof data is being collected
and warehoused
◦ Web data, e-commerce
◦ purchases at department/
grocery stores
◦ Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
◦ Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Let us look at some examples of data
sources
Netflix
Amazon
Wal-Mart
Algorithmic Trading/High Frequency Trading
Banks (Segmint)
Google/Yahoo/Microsoft/IBM
CRM/Consumer Behavior Profiling
Consumer Review
Mobile Ads
Social Network (Facebook/Twitter/Google+)
…
Why Mine Data? Scientific Viewpoint
Datacollected and stored at
enormous speeds (GB/hour)
◦ remote sensors on a satellite
◦ telescopes scanning the skies
◦ microarrays generating gene
expression data
◦ scientific simulations
generating terabytes of data
Traditional
techniques infeasible for raw data
Data mining may help scientists
◦ in classifying and segmenting data
◦ in Hypothesis Formation
Examples: What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?
– Look up phone – Certain names are more prevalent in
number in phone certain US locations (O’Brien,
directory O’Rurke, O’Reilly… in Boston area)
– Group together similar documents
– Query a Web search returned by search engine according
engine for information to their context (e.g. Amazon
about “Amazon” rainforest, Amazon.com,)
Database Processing vs. Data Mining
Processing
Query Query
◦ Well defined ◦ Poorly defined
◦ SQL ◦ No precise query language
• Data • Data
• Operational data • Not operational data
• Output • Output
• Precise • Fuzzy
• Subset of database • Not a subset of database
Evolution of Database Technology
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the last
month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk. (association
rules)
Potential Applications
Data analysis and decision support
◦ Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
◦ Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
◦ Fraud detection and detection of unusual patterns (outliers)
Other Applications
◦ Text mining (news group, email, documents) and Web mining
◦ Stream data mining
◦ Bioinformatics and bio-data analysis
Ex.: Market Analysis and Management
Where does the data come from?—Credit card
transactions, loyalty cards, discount coupons, customer
complaint calls, surveys …
Target marketing
◦ Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.,
E.g. Most customers with income level 60k – 80k with food expenses $600 - $800
a month live in that area
◦ Determine customer purchasing patterns over time
E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k
usually buy this type of CD player
Cross-market analysis—Find associations/co-relations between
product sales, & predict based on such association
◦ E.g. Customers who buy computer A usually buy software B
15
Ex.: Market Analysis and Management (2)
Customer requirement analysis
◦ Identify the best products for different customers
◦ Predict what factors will attract new customers
Provision of summary information
◦ Multidimensional summary reports
E.g. Summarize all transactions of the first quarter from three different branches
Summarize all transactions of last year from a particular branch
Summarize all transactions of a particular product
◦ Statistical summary information
E.g. What is the average age for customers who buy product A?
Fraud detection
◦ Find outliers of unusual transactions
Financial planning
◦ Summarize and compare the resources and spending
16
Data Mining Tasks
Prediction Tasks
◦ Use some variables to predict unknown or future values of other
variables
Description Tasks
◦ Find human-interpretable patterns that describe the data.
Common data mining tasks
◦ Classification [Predictive]
◦ Clustering [Descriptive]
◦ Association Rule Discovery [Descriptive]
◦ Sequential Pattern Discovery [Descriptive]
◦ Regression [Predictive]
◦ Deviation Detection [Predictive]
Data Mining Models and Tasks
Decisions in Data Mining
Databases to be mined
◦ Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
Knowledge to be mined
◦ Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
◦ Multiple/integrated functions and mining at multiple levels
Techniques utilized
◦ Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
Applications adapted
◦ Retail, telecommunication, banking, fraud analysis, DNA mining, stock market
analysis, Web mining, Weblog analysis, etc.
Knowledge Discovery (KDD) Process
KDD Process: Several Key Steps
Learning the application domain
◦ relevant prior knowledge and goals of application
Identifying a target data set: data selection
Data processing
◦ Data cleaning (remove noise and inconsistent data)
◦ Data integration (multiple data sources maybe combined)
◦ Data selection (data relevant to the analysis task are retrieved from database)
◦ Data transformation (data transformed or consolidated into forms appropriate for
mining)
(Done with data preprocessing)
◦ Data mining (an essential process where intelligent methods are applied to extract
data patterns)
◦ Pattern evaluation (indentify the truly interesting patterns)
◦ Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
21
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
A typical DM System Architecture
Database, data warehouse, WWW or other information
repository (store data)
Database or data warehouse server (fetch and
combine data)
Knowledge base (turn data into meaningful groups
according to domain knowledge)
Data mining engine (perform mining tasks)
Pattern evaluation module (find interesting patterns)
User interface (interact with the user)
A typical DM System Architecture (2)
Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition, statistics,
and database systems
Traditional Techniques
may be unsuitable due to
◦ Enormity of data Statistics/ Machine Learning/
◦ High dimensionality AI Pattern
of data Recognition
◦ Heterogeneous,
distributed nature Data Mining
of data
Database
systems
Major Issues in Data Mining
Mining methodology and User interaction
◦ Mining different kinds of knowledge
DM should cover a wide spectrum of data analysis and knowledge discovery tasks
Enable to use the database in different ways
Require the development of numerous data mining techniques
◦ Interactive mining of knowledge at multiple levels of abstraction
Difficult to know exactly what will be discovered
Allow users to focus the search, refine data mining requests
◦ Incorporation of background knowledge
Guide the discovery process
Allow discovered patterns to be expressed in concise terms and different levels of
abstraction
◦ Data mining query languages and ad hoc data mining
High-level query languages need to be developed
Should be integrated with a DB/DW query language
26
Major Issues in Data Mining (Contd..)
◦ Presentation and visualization of results
Knowledge should be easily understood and directly usable
High level languages, visual representations or other expressive forms
Require the DM system to adopt the above techniques
◦ Handling noisy or incomplete data
Require data cleaning methods and data analysis methods that can handle noise
◦ Pattern evaluation – the interestingness problem
How to develop techniques to access the interestingness of discovered patterns, especially
with subjective measures bases on user beliefs or expectations
27
Major Issues in Data Mining (contd..)
Performance Issues
◦ Efficiency and scalability
Huge amount of data
Running time must be predictable and acceptable
◦ Parallel, distributed and incremental mining algorithms
Divide the data into partitions and processed in parallel
Incorporate database updates without having to mine the entire data again from
scratch
Diversity of Database Types
◦ Other database that contain complex data objects, multimedia data,
spatial data, etc.
◦ Expect to have different DM systems for different kinds of data
◦ Heterogeneous databases and global information systems
Web mining becomes a very challenging and fast-evolving field in data mining
28
Thank you