Data Warehousing and OLAP

Description
DW collects information about subjects that span the entire organization, such as customers, products, sales, assets, and personnel. Its scope is enterprise-wide.

Data Warehousing and OLAP Data Warehousing and OLAP
Lecture 2/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA ([email protected])
Faculty of Computer Science, University of Indonesia
Objectives Objectives
Motivation: Why data warehouse?
What is a data warehouse?
Why separate DW? y p
Conceptual modeling of DW
Data Mart Data Mart
Data Warehousing Architectures
Data Warehouse Development Data Warehouse Development
Data Warehouse Vendors
R l DW Real-time DW
2
Motivation: Why data warehouse? Motivation: Why data warehouse?
Construction of data warehouses (DW) involves data
cleaning and data integration important
preprocessing step for data mining (DM).
DW provide OLAP for the interactive analysis of
multidimensional data, which facilitates effective DM. ,
Data mining functions can be integrated with OLAP
operations to enhance interactive mining of knowledge. operations to enhance interactive mining of knowledge.
DW will provide an effective platform for DM.
Whil DW t i t t d DM DW t While DWs are not requirements to do DM, DW store
massive amounts of data that can be uses for DM. [DO]
3
What is a data warehouse? [JH] What is a data warehouse? [JH]
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately
from the organization’s ODB.
Support information processing by providing a solid platform
of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making process.” —
W. H. Inmon
Case Study 2: Continental Airlines flies high with its
real-time data warehouse
4
What is a data warehouse? [ET] What is a data warehouse? [ET]
Data warehouse
A physical repository where relational data are specially
organized to provide enterprise-wide, cleansed data in a
standardized format.
Characteristics
Subject oriented, Integrated, Time Variant, Non-volatile
Web-based, Relational/multidimensional, Client/server, Real-time
Include metadata
Data warehousing
Process of constructing and using data warehouses.
Requires data integration, data cleaning, and data consolidation.
5
Subject Oriented Subject Oriented
Organized around major subjects, such as Organized around major subjects, such as
customer, product, sales.
P id i l d i i d Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
Focusing on the modeling and analysis of data Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing transaction processing.
6
Integrated Integrated
Integrate multiple, heterogeneous data sources
Relational databases, flat-files, on-line transaction records
Data cleaning and data integration techniques are g g q
applied
Ensure consistency in naming conventions, encoding Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
Wh d i d h h i i d When data is moved to the warehouse, it is converted.
7
Time Variant Time Variant
The time horizon for the data warehouse is significantly
longer than that of operational systems.
Operational database: current value data. Operational database: current value data.
Data warehouse data: provide information from a historical
perspective (e g past 5-10 years) perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time
element”.
8
Non volatile Non-volatile
A physically separate store of data transformed from the p y y p
operational environment.
O i l d f d d i h d Operational update of data does not occur in the data
warehouse environment.
Does not require transaction processing, recovery, and
concurrency control mechanisms y
Requires only two operations in data accessing:
i i i l l di f d d f d initial loading of data and access of data.
9
Data Warehouse vs Heterogeneous DBMS Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration:
Build wrappers/mediators on top of multiple, heterogeneous databases.
Ex: IBM Data Joiner, Informix DataBlade
Q d i h Query driven approach:
When a query is posed to a client site, a metadata-dictionary is used
to translate the query into queries appropriate for the individual to translate the query into queries appropriate for the individual
heterogeneous sites involved. There queries are then mapped and sent
to local query processors. The results returned from the different
d l b l sites are integrated into a global answer set.
Complex information filtering and integration processes, compete for
resources resources.
Inefficient and potentially expensive for frequent queries, especially for
queries requireing aggregations. q q g gg g
10
Data Warehouse vs Heterogeneous DBMS (2) Data Warehouse vs. Heterogeneous DBMS (2)
Using DW update-driven approach
Information from multiple, heterogeneous sources is integrated in advance
and stored in a warehouse for direct querying and analysis.
Unlike OLTP DW do not contain the most current information Unlike OLTP, DW do not contain the most current information.
DW brings high performance to the integrated heterogeneous
DB system since data are copied preprocessed integrated DB system since data are copied, preprocessed, integrated,
annotated, summarized, and restructured into one data store.
Query processing in DW does not interfere with the processing Query processing in DW does not interfere with the processing
at local sources
DW can store and integrate historical information and support g pp
complex multidimensional queries.
11
DW vs ODB DW vs. ODB
Major task of ODB OLTP:
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
DW f d l i d d i i ki OLAP DWserve for data analysis and decision making OLAP
Distinct Features (OLTP vs. OLAP)
U d i i k User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs star + subject Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries Access patterns: update vs. read only but complex queries
12
OLTP vs OLAP OLTP vs OLAP
OLTP OLAP
users Clerk IT professional Knowledgeworker users Clerk, IT professional Knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds #users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
13
Why Separate DW? Why Separate DW?
High performance for both systems: g p y
DBMS — tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse —tuned for OLAP: complex OLAP queries,
computation of large groups of data at summarized levels,
multidimensional view, consolidation. ,
Processing OLAP queries in operational databases would
degrade the performance of operational tasks.
In ODB, concurrency control and recovery mechanisms
(locking, logging) are required to ensure the consistency
d b f i and robustness of transactions.
OLAP read only access. No need for concurrency
control and recovery control and recovery.
14
Why Separate DW? (2) Why Separate DW? (2)
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain. So, data in ODB is
usually far from complete for decision making. y p g
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources. ODB
t i d t il d d t (t ti ) hi h d t b contain detailed raw data (transactions) which need to be
consolidated before analysis.
data quality: different sources typically use inconsistent data q y yp y
representations, codes and formats which have to be
reconciled.
15
Conceptual Modeling of DW Conceptual Modeling of DW
Data Cube:
see TSBD Lecture Notes on Visualization of Data Cubes
M d li d t h di i & t Modeling data warehouses: dimensions & measurements
Star schema: A single object (fact table) in the middle connected
to a number of objects (dimension tables one for each to a number of objects (dimension tables, one for each
dimension).
Snowflake schema: A refinement of star schema where the
dimensional hierarchy is represented explicitly by normalizing
the dimension tables.
Fact constellations: Multiple fact tables share dimension tables.
Also known as galaxy schema
16
Example of Star Schema Example of Star Schema
Date
Product
Day
Month
Year
Sales Fact Table
Date
ProductNo
ProdName
ProdDesc
C
Date
Product
Store
Category
QOH
Store
CustId
C tN
Cust
Store
Customer
StoreID
City
State
CustName
CustCity
CustCountry
unit_sales
dollar_sales
State
Country
Region
Yen_sales
Measurements
Potensi Redundansi
Bandung, Bogor keduanya
17
ada di Jawa Barat
Snowflake Schema Snowflake Schema
Product Year
Day
Date
Sales Fact Table
ProductNo
ProdName
ProdDesc
Month
Year
Month
Year
Day
Month
Date
Product
ProdDesc
Category
QOH
Year
CustId
Cust
Store
Customer
City StoreID
Cit
Store
CustId
CustName
CustCity
CustCountry
unit_sales
dollar sales
City
State
State
State
City
CustCountry
_
Yen_sales
Country
Region
Country
State
Country
18
Measurements
Region
View of Warehouses and Hierarchies View of Warehouses and Hierarchies
Importing data
Table Browsing
Dimension creation
Dimension browsing
Cube buildingg
Cube browsing
19
Data Cube Data Cube
Total annual sales
D t
Total annual sales
of TV in U.S.A.
Date
sum
TV
PC
1Qtr 2Qtr
3Qtr 4Qtr
U S A
r
y
sum
VCR
PC U.S.A
Canada
C
o
u
n
t
Ca ada
Mexico
sum
20
Data Cube Data Cube
Visualization
OLAP capabilities
21
p
Interactive manipulation
Typical OLAP Operations Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data or from higher level summary to lower level summary or detailed data, or
introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
d ill i l i ( ) th f t t bl drill across: involving (across) more than one fact table.
drill through: through the bottom level to its back-end relational tables.
More info: More info:
www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org
22
Data Mart Data Mart
DW collects information about subjects that span the
entire organization, such as customers, products, sales, assets,
and personnel. Its scope is enterprise-wide.
For DW, fact constellation schema is commonly used
since it can model multiple, interrelated subjects.
Data Mart is a subset of a DW, focuses on a particular
subject. Its scope is department-wide. Typically, a data mart
f l b ( k consisting of a single subject area (e.g. marketing,
operations).
For Data Mart, star or snowflake schema are commonly
used since both are geared towards modeling single
bj t lth h th t h i l subjects, although the star schema is more popular.
23
Data Mart Data Mart
A data mart can be either dependent or independent.
A dependent data mart is a subset that is created directly
from the DW.
Consistent data model
Providing quality data
DW must be constructed first
Ensures that the user viewing the same version of the data that
d b ll h d h are accessed by all other data warehouse users
An independent data mart is a small warehouse designed
f d d i i EDW for a department, and its source is not an EDW.
24
Data Warehousing Process Overview Data Warehousing Process Overview
25
Data Warehousing Process Overview Data Warehousing Process Overview
The major components of a data warehousing process
Data sources
Legacy systems, external data providers (e.g. BPS), OLTP,
ERP Systems
Data extraction
Data loading
Comprehensive database
Metadata
Middleware tools
26
Data Warehousing Architectures Data Warehousing Architectures
27
Data Warehousing Architectures Data Warehousing Architectures
28
Data Warehousing Architectures Data Warehousing Architectures
29
Data Warehousing Architectures Data Warehousing Architectures
30
Data Integration and the ETL Process Data Integration and the ETL Process
Various integration technologies:
Enterprise Application Integration (EAI)
A technology that provides a vehicle for pushing data from source
t i t d t h systems into a data warehouse
Integrating application functionality and is focused on sharing
functionality across systems
Traditionally, API. Nowadays, SOA (web services).
Enterprise Information Integration (EII)
An evolving tool space that promises real-time data integration from
a variety of sources, such as relational databases, Web services, and
multidimensional databases
A mechanism for pulling data from source systems to satisfy a request
for information.
31
Data Integration and the ETL Process Data Integration and the ETL Process
ETL
60-70% of the time in a data-centric project.
Extraction: Reading data from one or more databases
Transformation Transformation
Converting the extracted data from its previous form into the form in
which it needs to be so that it can be placed into a DW
Load
Putting the
d data into
the DW
32
Data Warehouse Development Data Warehouse Development
Direct benefits
Allowing end users to perform extensive analysis in numerous
ways
A f ( f A consolidated view of corporate data (i.e a single version of
the truth)
Better and more timely information Better and more timely information
Enhanced system performance. DW frees production
processing because some operational system reporting processing because some operational system reporting
requirements are moved to DSS
Simplification of data access
33
Data Warehouse Development Data Warehouse Development
Some best practices for implementing a DW (Weir, 2002):
Project must fit with corporate strategy and business objectives
There must be complete buy-in to the project by executives,
managers and users managers, and users
It is important to manage user expectations about the completed
project
The data warehouse must be built incrementally
Build in adaptability
M d b b h IT d b i f i l Managed by both IT and business professionals
Develop a business/supplier relationship
O l l d d t th t h b l d d f lit Only load data that have been cleansed and are of a quality
understood by the organization
Do not overlook training requirements Do not overlook training requirements
Be politically aware
34
Data Warehouse Vendors Data Warehouse Vendors
Computer Associates Microsoft
DataMirror
Data Advantage Group
Oracle
SAS g p
Dell Computer
Embarcadero Technologies
Siemens
Sybase Embarcadero Technologies
Business Objects
HP
Sybase
Teradata
Please visit: HP
Hummingbird
H
Please visit:
Data Warehousing Institute
(tdwi com)
Hyperion
IBM
(tdwi.com)
DM Review (dmreview.com)
Informatica
35
Data Warehouse Vendors Data Warehouse Vendors
Six guidelines to considered when developing a g p g
vendor list:
1 Financial strength 1. Financial strength
2. ERP linkages
Q lifi d l 3. Qualified consultants
4. Market share
5. Industry experience
6. Established partnerships p p
36
Real time DW Real-time DW
Traditionally, updated on a weekly basis.
Unsuitable for some businesses.
Real-time (active) data warehousing ( ) g
The process of loading and providing data via a data
warehouse as they become available y
Levels of data warehouses:
1. Reports what happened 1. Reports what happened
2. Some analysis occurs
3. Provides prediction capabilities, p p ,
4. Operationalization
5. Becomes capable of making events happen p g pp
37
Real time DW Real-time DW
38
Real time DW Real-time DW
39
From DW to DM [JH] From DW to DM [JH]
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
i i lt i i li ti t l mining results using visualization tools.
40
References References
[JH] Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques, Morgan Kaufmann, 2001.
[ET] Efraim Turban et al., Decision Support and Business
Intelligence Systems, Pearson, 2007.
[DO] David Olson and Yong Shi, Introduction to Business
Data Mining, McGraw-Hill, 2007.
41

doc_525793673.pdf
 

Attachments

Back
Top