Building a Business Intelligence System with the Pentaho BI Suite

Description
Virtual data is an ever increasing part of any large organization today. The data can be anything from transactional data coming from the day-to-day operation to customer and product information.




Building a Business Intelligence
System with the Pentaho BI Suite















J O H A N S Ö D E R L U N D










































Master of Science Thesis
Stockholm, Sweden 2011





Building a Business Intelligence
System with the Pentaho BI Suite









J O H A N S Ö D E R L U N D




Master’s Thesis in Computer Science (30 ECTS credits)
at the School of Computer Science and Engineering
Royal Institute of Technology year 2011
Supervisor at CSC was Jens Lagergren
Examiner was Anders Lansner

TRITA-CSC-E 2011:106
ISRN-KTH/CSC/E--11/106--SE
ISSN-1653-5715





Royal Institute of Technology
School of Computer Science and Communication

KTH CSC
SE-100 44 Stockholm, Sweden

URL: www.kth.se/csc






Abstract
)irtual data is an ever increasing *art o+ any large organi,ation today. -he data can be anything
+rom transactional data coming +rom the day-to-day o*eration to customer and *roduct in+ormation.
Business Intelligence is an area that uses organi,ational data to *er+orm analyses and create re*orts.
-he re*orts and analyses are great resources and can be used to give a better overvie o+ the
organi,ation and hel* making better business decisions.
-he *ur*ose o+ the thesis as to investigate the *ossibility o+ creating a +ully +unctional Business
Intelligence-system by only using intelligent tools and not riting any *rogram code at all.
# Business Intelligence-system consists o+ several *arts and +or each there is a ide variety o+ tools
available both +ree and +or a cost. -he Pentaho BI Suite as chosen to build the system ith. It is a
com*lete BI suite that contains tools +or all *arts o+ a BI-system. -he suite is available both as a
commercial enter*rise edition and a +ree community edition. -he +ree community edition as used.
-he investigation covered +ive issues in de*th.
• /as it *ossible at all to create a BI-system ithout riting any code0
• /hat knoledge and skills as re1uired to set u* the system0
• /hat knoledge and skills as re1uired +or an end user0
• /hat ere the major challenges in designing and im*lementing the system0
• 2o as the *er+ormance o+ the system and ho could it be im*roved0
#t the end o+ the investigation a +ully +unctional and reliable BI system had been created. -he
system could success+ully be used to do com*le3 analyses and create in+ormative re*orts. -hese
could then in turn su**ly decision makers in an organi,ation ith +acts to hel* them make better
business decisions.
Referat
Konstruktion av ett Business Intelligence-system
med Pentaho BI Suite.
)irtuell data 4r n5got som allt mer ökar i m4ngd hos stora organisationer och +öretag idag. !ata kan
+innas i en m4ngd olika system. 63em*el 4r transaktionssystem7 kund- och *roduktdatabaser eller
system som t4cker andra delar av a++4rsverksamheten. Business Intelligence 4r ett omr5de som
anv4nder sig av tillg4nglig data i en organisation +ör att med hj4l* av den ut+öra avancerade
analyser och ska*a till+örlitliga ra**orter om verksamheten. #nalyserna och ra**orterna kan sedan
anv4ndas av besluts+attare inom organisationen som stöd n4r a++4rsbeslut tas.
!etta e3amensarbete hade som m5l att ska*a ett business intelligence-system med hj4l* av Pentaho
BI suite 8 en *aketlösning +ör Business Intelligence byggd *5 ö**en k4llkod. Id9n var att undersöka
om det var möjligt att konstruera ett +ungerande BI-system endast med hj4l* av smarta verktyg. 2elt
enkelt skulle inte en enda rad *rogramkod behöva skrivas. :em +r5gest4llningar undersöktes.
• ;r det möjligt att ska*a ett BI-system helt utan att skriva egen *rogramkod0
• )ad behövs det +ör kom*etens och kunska* +ör att ut+orma och konstruera systemet0
• )ad kr4vs av en slutanv4ndare +ör att kunna anv4nda sytemet0
• )ilka stora utmaningar +inns d5 systemet skall ut+ormas och konstrueras0
• )ilken *restenda skulle systemet +5 och *5 vilka s4tt kunde den +örb4ttras0
Slutresultatet var ett +unktionsdugligt och stabilt BI-system ska*at med de verktyg som +anns i
Pentaho BI suite. Systemet kunde anv4ndas +ör att göra kom*le3a analyser och ska*a in+ormativa
ra**orter. !essa kunde i sin tur ha anv4nts som stöd +ör a++4rsbeslut och +ör att ge tydliga
överblickar över a++4rsverksamheten.
Contents
Introduction.....................................................&
Background..................................................$
Pur*ose........................................................$
!elimitation.................................................(
-heory..............................................................<
-he !ata arehouse....................................=
6-"............................................................&&
>"#P and analysis....................................&$
?e*orting and analysis tools......................&<
Pentaho BI suite overvie.............................&=
!ata Integration.........................................&=
#nalysis.....................................................&=
?e*orting and *resentation........................&=
-he BI-server.............................................&@
Aethod...........................................................&B
System machine.........................................&B
!ata generation..........................................&B
!esigning the !ata /arehouse..................&C
Building the 6-" ith P!I.......................$%
>"#P schema and cube............................$%
-he system as a hole...............................$&
Im*lementation..............................................$$
!ata generation..........................................$$
!ata arehouse construction.....................$$
6-" jobs and trans+ormations...................$B
>"#P ith Schema /orkbench and
Pentaho #ggregation !esigner..................((
?esults...........................................................(B
?unning the system...................................(B
Per+ormance and *er+ormance tuning.......(B
!iscussion and conclusions...........................<(
:inal thoughts............................................<B
?e+erences.....................................................<'
#**endi3 #....................................................=%
#**endi3 B....................................................=(
#**endi3 D....................................................=B
Abbreviations and technical terms
Abbreviation ec!nical erm
BI Business Intelligence
PE#- Product Euality #ssessment -ool
!SS !ecision Su**ort System
6-" 63traction7 -rans+ormation and "oading
!/ !ata /arehouse
>"#P >nline #nalytical Processing
?>"#P ?elational >nline #nalytical Processing
A>"#P Aultidimensional >nline #nalytical Processing
P!I Pentaho !ata Integration
S/ Schema /orkbench
P#! Pentaho #ggregation !esigner
P?! Pentaho ?e*ort !esigner
!BAS !atabase Aanagement System
?!BAS ?elational !atabase Aanagement System
D!D Dhanged !ata Da*ture
A!F Aultidimensional 63*ressions
SE" Structured Euery "anguage
FA" 63tensible Aarku* "anguage
/#E? /eb #d 2oc Euery and ?e*orting Dlient
/GSI/GH /hat Gou See Is /hat Gou Het
Introduction
6very business organi,ation has some kind o+ in+ormation about itsel+ and their o*eration.
!ata about *roducts7 em*loyees7 customers7 sales and resources is today necessary to run a
success+ul business I&J. 63ternal data - such as market trends and demogra*hics - is also an
im*ortant asset +or an enter*rise.
-oday most data7 i+ not all7 is stored electronically. -he most common choice is to store it in
databases7 but other solutions7 such as storing it in *lain +iles7 also e3ists. /hile the
o*erational data is crucial +or an enter*rise to +unction since it su**orts the day-to-day
o*eration it can also be used +or analysis. #naly,ing o*erational and e3ternal data is a great
hel* +or managers7 e3ecutives and others hose *osition makes their decisions a++ect the
business. Being able to anser 1uestions about the business such as K2o much does a
*roduct sell in a certain area0L or K2o much did our sales cam*aign a++ect our *ro+it0L is
im*ortant hen making business decisions.
It is sa+e to assume that organi,ations have alays looked to their on data +or hel* in making
im*ortant business decisions7 but ithout a system to e++iciently store and analy,e the data
this is not an easy task. # com*uter-based system is re1uired to do com*le3 analyses o+ the
data. !ecision Su**ort Systems M!SSN has been around since the &C@%sI$J. # !SS calculates
ansers according to one or more de+ined models. -his makes it a good tool in cases here
only a model is su++icient but many business decisions re1uires analyses made only on the
organi,ations business data.
-he ability to e++iciently store and analy,e business data is today a necessity +or every large
organi,ationI(J. -he data available has continuously increased since com*anies +irst started
storing their business data electronically. >ne +actor is that kee*ing historical data is
im*ortant +or analyses I<J7 so historical data is not o+ten removed. #nother +actor is that more
and more data becomes available ith modern technological advancement. :urthermore
*eo*le in charge making decisions ant to be able to ask very com*le3 1uestions very +ast7
*robably asking more than just one 1uestion at a time. -o meet these demands Business
Intelligence MBIN emerged as an o++s*ring o+ the !SSs I=J.
-he term Business Intelligence as *ro*osed in &C'C by 2oard !resner7 he described it as
“a set of concepts and methods for improving the decision process, using support systems
based on facts”I@J.
Business Intelligence is today commonly seen as a term that re+ers to com*uter-based
methods7 tools and a**roaches used to analy,e and inter*ret business data IBJ.
&
Bac!round
>micron Deti #B is a consulting +irm based in Kista7 Seden. #**ro3imately +ive years ago
they develo*ed a 1uality analysis tool +or 6ricsson #B. -he system is called Product Euality
#ssessment -ool MPE#-N and is still used by 6ricsson today. -he *ur*ose o+ PE#- is to
*rovide an accurate analysis o+ returns and re*airs on 6ricssonOs *roducts based on their
*roduct and return data re*ositories. PE#- is a home gron system made entirely by
>micron +or this s*eci+ic *ur*ose. >micron is no interested in looking at other solutions +or
+uture business o**ortunities. /hat is im*ortant to mention here is that PE#-7 and other
*roduct 1uality systems can be im*lemented as a BI system Mthough a more narro one since
it only handles *roduct 1ualityN. -his is good since the *ur*ose o+ a BI system is to give
accurate and +ast ansers to various analyses7 hich is also the goal +or a *roduct 1uality
system.
/hy is it similar to a BI system then0 -he in*ut to the system is a lot o+ data +rom many
di++erent o*erational system concerning everything about a com*anyPs *roducts. -his data is
then loaded7 cleansed and stored in a re*ository used only by the *roduct 1uality system. -he
data is then used to give accurate analyses - in the +orm o+ chart7 gra*hs and numbers - to
*eo*le making decisions. #s ill be shon in the Concepts cha*ter o+ the thesis these
+unctions is e3actly hat makes u* a BI system.
"ur#ose
-he *ur*ose o+ the thesis as to investigate i+ a 1uality analysis tool could be made ithout
traditional *rogramming and instead make use o+ already e3isting BI-tools. -he +olloing
issues ere addressed in the thesis.
• /as it *ossible at all to create a BI-system ithout riting any code0
• /hat knoledge and skills ere re1uired to set u* the system0
• /hat knoledge and skills ere re1uired +or an end user0
• /hat ere the major challenges in designing and im*lementing the system0
• 2o as the *er+ormance o+ the system and ho could it be im*roved0
-he end goal o+ the BI-system as to *resent three business values in various analyses. -he
amount o+ units that has been shi**ed7 the amount o+ units that has been returned and the error
*ercentage o+ the units Mreturned units Q shi**ed unitsN.
-here are numerous BI-tools available today7 both +ree o*en source solutions and commercial
counter*arts. -he Pentaho BI suite as used to create the BI-system. It is an o*en source
system ith both a +ree community edition and an enter*rise edition. Being o*en source
makes it good +or this thesisPs *ur*ose7 since it uses a lot o+ the most *o*ular +reely available
BI-tools. It it also a com*lete BI suite meaning it contains tools +or all the *arts in a BI
system.
$
$elimitation
-he *ur*ose o+ the thesis is not to com*are the Pentaho BI suite to other BI-solutions that
e3ists. Reither as any other o*en source solutions tried as an alternative7 since that ould
greatly increase the orkload. -he thesis does not e3*lain in detail all the di++erent methods
and a**roaches that e3ists hen creating a BI system7 this is a hole thesis or book in itsel+
and many have been ritten regarding that subject Me.g. I'JICJN. #n overvie ill be given o+
ho a BI-system orks and hat *arts it contains. -his is re1uired to understand the
methodology choices7 design discussions and system im*lementations discussed later on in
the thesis.
(
Theor%
-his section ill give an overvie o+ ho a BI-system orks and hat the di++erent
com*onents are. -he o*eration o+ a BI-system can be vieed as a continuous *rocess hich
starts in the organi,ations o*erational systems and ends ith re*orts and analyses. -he
*rocess consists o+ several *arts and can be de+ined as +ollos.
Source systems S 6-" S !/ S #nalysis and ?e*orting tools S 6nd user environments
Figure 3-1 shos a good overvie o+ the *rocess. "ets e3*lain brie+ly ho the *arts connect
to each other and their *ur*ose.
-he source systems is ere data can be gathered +rom and consists o+ the organi,ation
o*erational systems and e3ternal data sources. # central *oint o+ BI is to gather all available
data that is o+ interest and store it in a centrali,ed re*ository. -he re*ository is called a Data
Warehouse (DW. -he basic idea o+ a !/ is not just to have all data in a single storage *oint
but also to have it error +ree7 con+ormed and named and +ormatted so that it is easy to
understand +or a business user. !ata has to be gathered7 trans+ormed and loaded into the !/
+rom the source systems and this is hat the !"# Mshort +or 63traction7 -rans+ormation and
"oadingN does. /hen all data is available in the !/ analyses and re*orts can be made ith it.
:or this *ur*ose there are analysis and re*orting tools. -he most common analytical
a**roaches used in BI are $n%ine &na%ytica% 'rocessing ($#&' and Data (ining. >"#P is a
method o+ translating relational data models into multidimensional models hich gives it the
ability to si+tly anser multidimensional analytical 1ueries. !ata Aining is an a**roach o+
analy,ing data to +ind hidden *atterns in it. -he end user environments are the gra*hical
inter+aces that *resent the re*orts and analysis result to the end users7 hich are mostly
decision makers or other business users. -his is ho a BI-system orks ith the di++erent
*arts orking together to create the in+ormation chain +rom the source systems to the end
user.
-his as a short overvie o+ a BI-system and the sections o+ this cha*ter ill cover each *art
o+ the BI-system in de*th.
<
The $ata &arehouse
# !ata /arehouse M!/N is a storage o+ the organi,ations data a*art +rom the o*erational
systems here data can be accessed +or 1uick analyses.
The #ur#ose o' a $(
# good 1uestion here is hy a !/ is necessary. DanPt the data just be accessed directly +rom
the o*erational system0 -o understand hy a !/ is im*ortant one can look at hat *roblems
arise i+ a !/ is not *resent hen a BI system is de*loyed and used.
#ssume that a com*any ants to use their data +rom their o*erational systems +or analyses.
-his is a very big ste* toards better business decisions +or the com*any but *utting the
orkload directly on the o*erational systems has many drabacks.
• Aany organi,ations have their data in more than just one re*ository7 and to be able to
get a correct analysis all relevant data has to be available. -his means that many
analyses re1uire data +rom many di++erent systems. -he storage method can also di++er
+rom re*ository to re*ository and can be anything +rom storage in *lain +iles to a
?!BA or other database storage solutions I<J" -his creates a serious *roblem hen
collecting the data since the analysis tools ould need to get the data +rom many
di++erent systems that may store the data very di++erently +rom each other. #nother
*roblem is that the *er+ormance o+ storage systems can vary a lot7 hich can result in
one system being a bottleneck in every analysis its data is used in.
• -he data needs to be available instantly hen a user is doing an analysis. 2aving data
accessed and collected +rom the o*erational system every time a 1uery is made +rom
the analysis tools ould *ut e3tra *ressure on those systems and severely im*act their
=
Figure 3-1) $vervie* of a +, system -'1.
*er+ormance. Sloing don the o*erational systems used all the time in the day-to-
day o*eration is not good +or an enter*rise. -he o*erational systems are also not
designed to handle the com*le3 1ueries coming +rom the analysis tools. -hese 1ueries
re1uire big amounts o+ data and o*erational systems are o*timi,ed +or data
transactions7 not large retrievalsI&%J.
• >*erational systems are o+ten not interested in storing historical data since it is not
needed to run the day-to-day o*eration. -here+or many analyses ould give incorrect
or insu++icient re*resentations o+ the business due to the lack o+ historical data.
• -he data gathered +rom the o*erational systems can o+ten be named in a non-
descri*tive ay and many times only makes sense to the designers and users o+ those
systems. :or e3am*le a column o+ data named KrT*rodT1L ould make no sense to
someone ho doesnPt kno in advance hat the column is +or. Business users ill not
be able to understand these de+initions o+ the data ithout having to ask those ho
designed the systems.
-he !/ is designed to take care o+ all o+ these *roblems. It is designed to handle retrieval o+
data e++iciently and can in theory store in+inite amounts o+ data7 thereby su**orting the storage
o+ all *resent as ell as all historical data +rom all o+ the o*erational systems. -he idea is to
collect all data +rom all o+ the organi,ationPs systems - those here it has been deemed
necessary to collect data +rom - and store it in the !/. -he data is con+ormed to +ormatting
and naming standards created by the organi,ation and clear and uni+ormed de+initions o+ hat
di++erent data re*resents e3ists. -his *rovides a sole centrali,ed storage +or all data that can be
used in analyses. -he data is also *resented in a standardi,ed and understandable ay to an
end user.
$esi!nin! a $( ) Inmon vs Kimball
-here are to major schools concerning the design o+ a data arehouse. >ne as described
by Bill Inmon in &CC% hen he *ublished his book +ui%ding the Data Warehouse I'J . ?al*h
Kimball had another a**roach in mind and *ublished his on book "he Data Warehouse
#ifecyc%e "oo%/it ICJ in &CC@. -he major di++erences in these to teachings can be summari,ed
as +ollos.
Data marts: /hile Inmon sa the data arehouse as a monolithic storage +or all the data that
an enter*rise has Kimball suggested the use o+ data marts. # data mart is a *rojection o+ *art
o+ the data that e3ists. -he idea is to have one data mart *er business +unction. :or e3am*le
one data mart could be *roduction hile another ould be sales. -his creates an easy ay to
select and administer hat data di++erent *arts o+ an enter*rise needs and should have access
to. -his a**roach also makes use o+ conformed dimensions7 hich means that several data
marts can use the same dimension such as a *roduction and sales data mart ould both need a
date dimension. Instead o+ having to date dimensions they can use the same shared date
dimension.
@
Normali#ed vs dimensional modeling: /hether or not to have data marts is not the only
thing Inmon and Kimball disagrees on. #nother di++erence is ho the data arehouse should
be *hysically structured. In InmonPs case the data arehouse should be a standard normali,ed
database. Kimball introduced the idea o+ using de-normali,ed dimension tables7 making
dimension tables hold every in+ormation needed to describe the content in them. :or an
e3am*le in a normali,ed a**roach a dimension table holding in+ormation about customers
ould have e3tra tables describing attributes such as city and country. In KimballPs a**roach
all o+ these tables ould be merged in one customer dimension table making it ithhold every
in+ormation about customers. -his ill o+ course lead to redundancy in in+ormation since +or
e3am*le a +actory dimension table ould also need to contain in+ormation about a +actoryPs
location.
-his thesis ill use KimballPs a**roach hen designing a !/ - hy this a**roach is chosen
ill be e3*lained in the methodology section 8 and thus that a**roach ill be e3*lained in
more detail in the +olloing sections.
B
Figure 3-0) ,nmon1s approach on the %eft and 2imba%%1s on the right
$imensional modelin!
-he idea o+ dimensional modeling is to store data in a multidimensional model instead o+ a
traditional normali,ed one o+ten used in transactional o*erational systems. -he dimensional
model consists o+ to ty*es o+ tables7 facts tab%es and dimension tab%es ICJ.
-he dimension tables holds *ure in+ormation about di++erent areas7 such as *roducts7
customers or time.
:act tables contains measures Malso called facts or metricsN and keys to dimension tables.
-he measures are business values that the organi,ation ants to measure7 +or e3am*le revenue
or number o+ units sold. -he dimension tables de+ines ho these values should be measured.
-o make the e3*lanation easier lets assume that e have a com*any called Computer 3tore.
Dom*uter store sells com*uters and at the moment has one store. -he business has just started
and they only sells three *roducts 8 three di++erent la*to*s. -he store has success+ully sold
three *roducts since the start. Figure 3-3 shos Dom*uter Stores transactional database that
kee*s track o+ their sales.
Ro Dom*uter Store ants to create a data arehouse to use +or +uture analyses. -he
measure they are interested in7 as a start7 is the revenue +or sales. /ith a measure s*eci+ied the
ne3t ste* is to look at hat dimensions are avaiable.
'
Figure 3-3) Computer 3tore1s operationa% database
-he transactional database contains in+ormation about *roducts7 stores and sales. -his can be
translated into three dimensions.
• # *roduct dimension ith in+ormation about the *roducts they are selling.
• # store dimension ith in+ormation about their stores.
• # date dimension to identi+y hen sales have occurred.
Figure 3-4 shos a !/ using these three dimensions to measure revenue.
-he dimension tables are date7 store and *roduct. -he +act table is made u* o+ +ive columns.
5evenue, 6nits73o%d, Date7,d, 3tore7,d and 'roduct7,d. 5evenue is a measure and gives the
total revenue. -he other three columns holds keys to ros in the dimension tables.
-he value o+ 6nits73o%d is determined by the com*osition o+ the dimension keys. "ets do a
brie+ e3am*le to give a clearer image o+ ho the data is re*resented in the !/. #ssume that
e have to stores Mstore& and store$N7 to *roducts M*roduct& and *roduct$N and three sales
registered in the o*erational database. -o o+ the sales Msale& and sale$N ere made at store&
+or *roduct$ and one sale Msale(N as made at store$ +or *roduct&. Sale& occurred $%&&-%&-
%B7 sale $ occurred $%&&-%&-&% and sale( occurred $%&&-%&-&&. -his data ould then be
*resent in the o*erational database. -he !/ is at this *oint em*ty and data gets co*ied to it
ith the 6-" tool. "ab%es 3-&7 (-0 and 3-3 shos ho the data ould look like in the !/
a+ter the trans+er. -he date dimension tables data is not interesting since it just contains auto-
generated dates. "ets say it has dates covering the hole o+ $%&&.
C
Figure 3-4) Computer 3tore1s DW
Store$%e& Store$'d Store$(it& Store$Region Store$(ountr&
& & Stockholm S" Seden
$ $ Stockholm S" Seden
"ab%e 3-1) "he data in the store dimension tab%e (dim7store
)roduct$%e& )roduct$'d )roduct$Name )roduct$*roup
& & !ell $&%% PD
$ $ iPad A#D
"ab%e 3-0) "he data in the product dimension tab%e (dim7product
+act$Sales$'d )roduct$%e& Store$%e& Date$%e& Revenue
& $ & $%&&%&%B @%%
$ $ & $%&&%&&% @%%
( & $ $%&&%&&& &$%%
"ab%e 3-0) "he data in the fact tab%e (fact7sa%es
Ro +or e3am*le i+ someone at com*uter store anted to kno ho much the total revenue
+or store& as they ould just +ilter the +act table so that only ros here the ProductTKey
ere e1ual to & ould be selected and then sum the revenue +ield. #nother e3am*le is the
total revenue +or a s*eci+ic *roduct +rom a s*eci+ic store. -hat ould be achieved by a +ilter
using both the *roduct key column and the store key column7 hich is very easy to do. -his
shos the *otential o+ a good !/ - im*ortant 1uestions about the business can be ansered
ith little e++ort.
Star and sno&'lae schema
-here are to di++erent models that a data arehouse can be modeled a+ter. Star schema and
Sno*f%a/e schema. -he common identi+ier +or both models is the +act table hich in both
models holds the measures and keys to the dimension tables. /hat di++ers the to models
+rom each other is that in a star schema every dimension table is com*letely denormali,ed
and covers a hole area o+ the business7 the Dom*uter StorePs !/ model (Figure 3-4 is an
e3am*le o+ a star schema. # sno+lake schema on the other hand have di++erent degrees o+
normali,ed dimensions s*lit into multi*le related tables. Figure 3-8 shos Dom*uter StorePs
!/ modeled as a sno+lake schema.
&%
*ranularit%
Hranularity is im*ortant to mention hen talking about data arehouses because it
determines ho big the +act table ill be I&$J. -he granularity o+ a +act table is the loest
atomic level o+ hich a +act is de+ined I&(J. -he grain o+ a measure is a combination o+ the
granularity o+ the dimension tables. "etPs take the e3am*le o+ the revenue measure ith the
date and *roduct dimensions again. -he grain o+ the revenue measure ill be the grain o+ the
date and *roduct dimension combined. So i+ the loest atomic level o+ the date dimension is
day and +or the *roduct dimension it is *roduct subgrou* then the granularity o+ revenue ill
be revenue by day by *roduct subgrou*.
ET+
-he acronym 6-" Mshort +or 63traction7 -rans+ormation and "oadingN is the *art o+ the BI
system that handles the initial data coming +rom the di++erent source systems. It consists o+
three ste*s Mhence its nameN.
Extraction
-he actual data transaction +rom the source systems Mo+ten o*erational systems such as
?!BAS used in the day-to-day o*erationN to the !/. # very im*ortant *art o+ the e3traction
is to identi+y hich data has been changed7 inserted or deleted in the source system since the
last data u*date7 this *rocess is called Changed Data Capture (CDC I&<J. /hy is it so
im*ortant0 /ithout D!D the 6-" ould have to load all data +rom the source systems every
&&
Figure 3-8) Computer 3tore DW *ith a sno*f%a/e schema
time data is to be u*dated. -his is subo*timal and ould take too much time as ell as create
unnecessary *ressure on the source systems.
Trans'ormation
-here are several *roblems hen dealing ith the data +rom the source systems. # lot o+
systems have humans inserting the data7 and it is no secret that humans make mistakes
resulting in erroneous data. !u*lication o+ data is another *roblem7 +or e3am*le di++erent
databases can have the same *roduct in their data thus the *roduct is de+ined tice and
redundant in+ormation e3ists. :ormatting o+ data can also di++er +rom system to system
creating more *roblems. :or e3am*le a date can be +ormatted in many ays. yyyy-mm-dd7
ddQmm-yyyy7 mmQdd-yyyy are just three e3am*les.
#nother issue that needs to be addressed is the naming o+ the data. Aany times it is only clear
to the designer or the current users hat the data re*resents. :or e3am*le a column named
*rodTm marking the model o+ a *roduct could be renamed to *roductTmodel. -his is also a
goal o+ the !/ - to *resent data in a ay that makes it easy to understand.
#ll o+ these *roblems can be addressed directly in the 6-". -he goal o+ the trans+ormation
*art is to cleanse7 con+orm and alter data so that it is coherent7 clean and +its in the data model
used in the !/.
+oadin!
-his is the actual loading o+ the data into the !/. !imension keys needs to be looked u*
every time a +act ro is inserted. -hus loading data into the +act table is a resource consuming
*rocess.
Sta!in! area
# staging area is an alternative to trans+erring the data directly +rom the source systems to the
!/. -he staging area can be seen as a tem*orary storage beteen the source system and !/
and the data is co*ied straight as it is +rom the source systems to the staging area. -his
a**roach *uts the least *ressure *ossible on the source systems since the data is only e3tracted
be+ore it is *ut in the staging area. "ess *ressure on the source systems is alays a desirable
since they are o+ten critical to the organi,ations o*eration. /hen the data has been co*ied to
the staging area more major o*erations can be made7 such as trans+orming the data or
analy,ing it to create meta-data about erroneous entries.
,+A" and anal%sis
/hen it comes to creating re*orts based on the com*any data the !/ is a +antastic source +or
that *ur*ose. But hen doing com*le3 analyses it a !/ starts to lack in *er+ormance. #+ter
all it is just a +lat relational database7 even i+ it is modeled in a multidimensional +ashion.
$n%ine &na%ytica% 'rocessing ($#&' is a category o+ database *rocessing used +or
multidimensional analyses I&=J. -his gives >"#P the ca*abilities to do com*le3 analytical
and ad-hoc 1ueries very si+tly.
&$
>"#P systems are categori,ed de*ending on ho the underlying data is stored and
traditionally I&@J there are three variants.
,ultidimensional- ,./A) 8 !ata is stored in an o*timi,ed multidimensional array storage.
Relational- R./A) 8 !ata is stored in a relational database. Re tables are created to hold
aggregated values. !e+ined by schema +iles ritten in !9tensib%e (ar/up #anguage (:(#;
H&brid 8 !ata is divided beteen relational and s*eciali,ed storage.
-he hole idea o+ an >"#P system is to create $#&' cubes. -he cubes are hat is used hen
1uerying the system. /ithout a cube there is nothing that can be 1ueried.
Cubes, dimensions and measures
It as mentioned earlier that a star schema is a good choice +or a !/ since it translates ell
into an >"#P cube. -his is true since an >"#P cube is a collection o+ dimensions hich
de+ines values o+ measures. -he dimensions creates a multidimensional structure o+ cells in
hich each cell holds values o+ one or more measures. -he measures are the things you ant
to re*ort on such as revenue or number o+ sold units. 2ere it becomes clear that a star schema
in +act can be translated almost directly into an >"#P cube ith the dimensions tables being
the dimensions and the +act table measures being the cubes measures. /hat is im*ortant to
mention though is that the cube measures are aggregated values and not single atomic +acts.
2o the values are aggregated de*ends on ho the measure is de+ined. Dommon *rede+ined
aggregation +unctions are count, sum and average. Users can also de+ine their on
aggregation +unctions and thereby tailor the measures in any ay they ant.
# cube consists o+ dimensions and measures7 but it is also im*ortant to mention that the
dimensions has hierarchies. # hierarchy in a dimension describes ho the data should be
ordered by granularity. 6very hierarchy has levels7 the loest level is the highest granularity.
&(
Figure 3-8) &n $#&' cube
# dimension must contain at least one hierarchy but can have several. 2ere is an e3am*le to
make it more clear.
# date dimension e3ists in a cube. -he dimension has to hierarchies. >ne that describes
months7 and one that describes eeks. -he levels +or the month hierarchy are in order o+
granularity +rom loest to highest. Gear - Aonth - !ay o+ Aonth.
#nd +or the eek hierarchy it looks like this. Gear 8 /eek 8 !ay o+ eek.
-his means o+ course that the database table the data is gathered +rom must have +ields ith
in+ormation about the year7 month7 eek7 day o+ month and day o+ eek.
/ith these hierarchies de+ined the >"#P server knos ho to order the data in the cube.
!ata +rom >"#P cubes are retrieved ith a 1uerying language called A!F7 abbreviation +or
(u%tidimensiona% !9pressions. It as originally develo*ed by Aicroso+t and is today a de
+acto standard +or 1uerying >"#P databases.
:urther in+ormation about A!F and its synta3 can be +ound here.
htt*.QQ.iccube.comQsu**ortQdocumentationQmd3TtutorialQgentleTintroduction.html
htt*.QQen.iki*edia.orgQikiQAultidimensionalT63*ressions
-e#ortin! and anal%sis tools
-here are many re*orting and analysis tools out there but they all have common
characteristics. -he *ur*ose o+ the re*orting tools is to be able to connect them to a database7
*re+erably a !/7 and retrieve anted data to structure a re*ort ith. >+ten a storage +unction
is *rovided to store already made re*orts and use them again as the data is u*dated. >ther
+eatures that are o+ten *resent are scheduling o+ re*ort creation and the ability to create
gra*hs.
-he analysis tools o+ten has to do ith visuali,ation o+ >"#P cubes and data mining. -he
conce*t o+ data mining ill not be discussed but +urther reading can be +ound here.
htt*.QQen.iki*edia.orgQikiQ!ataTmining
It is im*ortant to remember that a BI system is not a magic bo3 solving everything thatOs
missing in an organi,ation. In order to make a BI system orthhile7 the *eo*le using it must
kno the im*ortant 1uestions and ho to ask them. Selecting the data used in the BI system is
also very critical since it de+ines the ansers. # lack o+ data7 or absence o+ vital data can result
in +aulty analyses that are believed to be correct. I+ business decisions are based on these
+aulty analyses it has done more harm than good and the hole *ur*ose o+ the BI-system has
been lost.
&<
"entaho BI suite overvie&
-he Pentaho BI suite is made u* o+ several tools that can be linked together to +orm a
com*lete BI system. -his cha*ter ill give an overvie o+ the tools available since most o+
them ill be used to build the BI system.
$ata Inte!ration
-he tool available +or 6-" in the Pentaho BI suite is 'entaho Data ,ntegration ('D,7 also
codenamed as Kettle. P!I has a gra*hical user inter+ace here users can *ut together
everything +rom sim*le to com*le3 6-" solutions. P!I has the +olloing building blocks.
Step: # ste* has a data in*ut and a data out*ut and data is streamed through the ste*. 6very
ste* has a +unction and hen data is streamed through it it can be altered andQor +iltered or just
scanned to match *atterns. -here are many di++erent ste*s available in P!I and +or many users
they are enough to do the trick. I+ they are not enough there are scri*t ste*s available here
either SE" or JavaScri*t can be used.
ransformation: # trans+ormation is a container +or ste*s. /hen a trans+ormation is done
and all the ste*s are linked together and has a *ro*er in*ut and out*ut the user can run the
trans+ormation. 6rror logging is available to hel* the user see hat ent rong i+ the
trans+ormation +ails. /hat is im*ortant to mention here is that a trans+ormation can be seen as
a big stream. It has a start Min*ut to +ile7 table etcN and an end Mout*ut to +ile7 table etcN and in
beteen all o+ the data that has been +etched or created on the ay is streamed through the
hole trans+ormation in the order it as brought into the trans+ormation.
0ob: # job consists o+ one or more trans+ormations that ill be run in order de*ending o+ ho
the job is set u*. !ata ill not +lo +rom one trans+ormation to another7 they ill sim*ly be
e3ecuted in order. Jobs can be used to schedule tasks such as running all u*date
trans+ormations +or dimension tables in a data arehouse.
Anal%sis
-he >"#P server +ound in the Penatho BI suite is called Aondrian. Aondrian uses ?>"#P
technology and translates A!F 1ueries to SE" used on a multidimensional model. Aondrian
has caching and bu++ering ca*abilities hich o*timi,es *er+ormance. It also has su**ort +or
security roles that limits hat data users have access to. -he BI suite also contains (ondrian
Schema Wor/bench MS/N and 'entaho &ggregation Designer MP#!N; -he orkbench is a
gra*hical tool +or creating cube7 a schema +ile is generated according to ho the cube is
designed in the HUI. -he aggregation designer is used to add aggregate tables to s*eed u* the
cube *er+ormance. -ables can be added manually or the tool can suggest tables that it thinks
ould im*rove *er+ormance.
-e#ortin! and #resentation
-here are to re*orting tools available in the Pentaho BI suite. >ne is the eb-based Web &d
<oc =uery and 5eporting C%ient (W&=5. /#E? o++ers generation o+ re*orts though it can
&=
only use data *rede+ined by meta-data models. #lthough the /#E? is easy and 1uick to use
the data is limited to grou*ed lists and it does not su**ort charts or gra*hs. In that as*ect the
other re*orting tool7 'entaho 5eport Designer ('D57 is much more *oer+ul. /ith P!? a
user can customi,e their re*ort in almost any ay and it has su**ort +or charts and gra*hs.
P!? is not a W>3,W>? (What >ou 3ee ,s What >ou ?et editor so the user ill have to ork
in a design environment that doesnPt look like the +inal layout o+ the re*ort. -his re1uires some
learning and understanding o+ the editor +rom the user.
The BI.server
-he BI-server is a eb server that can be accessed via a eb inter+ace. -he server also has a
solution re*ository to hich solutions +rom the di++erent tools can be *ublished and accessed.
In P!I this means trans+ormations and jobs and +rom S/ cube schemas. -he re*ository can
also contain re*orts and dashboards created in the eb inter+ace. -he re*orts and dashboards
are created ith the /#E? mentioned in <.(. -he eb inter+ace also o++ers users to vie
*ivot tables o+ >"#P cubes submitted to the re*ository +rom S/. -he o*tion o+ running *ure
A!F 1ueries against cubes is also available. -he BI server uses a *recon+igured tomcat
instance. -omcat is an o*en source Java Servlet container develo*ed by -he #*ache So+tare
:oundation. -o be able to access the BI-server login is re1uired and it has +ull su**ort +or
de+ining users and user roles.
By de+ault the solutions re*ository7 user and authentication data is stored in a *recon+igured
2S"E!B Mhich is a ?elational !BASN" -he server can be recon+igured to use a user
de+ined !BAS instead.
&@
/ethod
-his section ill contain in+ormation about design choices and methods concerning the BI
system that ill be created ith the Pentaho BI suite.
S%stem machine
-he BI-system as run on a la*to* com*uter ith the +olloing s*eci+ications.
)rocessor: Intel Dore $ !uo -='=% M$.&@ H2,N
S&stem memor&: (.%% HB !!?$
.S: /indos B @<-bit version.
:our di++erent Database (anagement 3ystems (D+(3s ere tried as *lat+orms +or the data
arehouse. AySE"7 PostgreSE"7 "ucid!B and Aonet!B. #ll o+ these are +ree o*en source
!BAS ith AySE" and PostgresSE" being the most ell knon. Both "ucid!B and
Aonet!B are develo*ed ith a +ocus on data arehousing and business intelligence.
#s +or trying to increase the *er+ormance o+ the system no im*rovements ere made directly
in the >S. -he im*rovements that ere made during the ork ere done in the tools +rom the
Pentaho BI suite and in the !BASs.
$ata !eneration
Since the data used in PE#- ere not *ublic in+ormation test data had to be created. -he data
model resembled but as not an e3act co*y o+ the one +ound in PE#-.
-he data as *laced in an o*erational database. >+ course the o*erational database ould not
have any real users since it as solely created +or the *ur*ose o+ the thesis.
-he database as designed according to hat can be e3*ected o+ a *roduction database and
the in+ormation available about PE#-s database structure. -he database contained
in+ormation about *roducts7 customers7 +actories and error ty*es. -he database is shon
belo in Figure 8-1.
-he lastTmodi+ied +ield +ound in every table as used to im*ly hen a ro as last changed.
-his +ield as re1uired +or the D!D to ork in the 6-"7 more on that in the ,mp%ementation
cha*ter.
&B
Several co*ies o+ the database ere created. 6ach ith a di++erent amount o+ *roduced units.
-he amount o+ returns as set to &%V o+ the amount o+ units in all databases. -he high
*ercent as chosen to make the analyses easier to *er+orm since one o+ the 1uality measures
e ere interested in as the *ercent o+ units returned.
-he *ur*ose o+ having di++erent amounts o+ units as to be able to evaluate the *er+ormance
o+ the 6-" and the >"#P cube. -he *ur*ose o+ having co*ies o+ the same database but ith
di++erent amounts o+ units as because it *roved easier than having to add or erase data every
time another 1uantity o+ units as to be tested.
-he si3 co*ies o+ the database that ere created and the amount o+ units and returns inserted
are shon in "ab%e 8-1 belo.
&'
Figure 8-1) "he operationa% database
Name 12nits 1Returns 3456 of t!e units7
Prototy*e%= =%%%%%% M= millionN =%%%%% M%.= millionN
Prototy*e&% &%%%%%%% M&% millionN &%%%%%% M& millionN
Prototy*e$% $%%%%%%% M$% millionN $%%%%%% M$ millionN
Prototy*e(% (%%%%%%% M(% millionN (%%%%%% M( millionN
Prototy*e<% <%%%%%%% M<% millionN <%%%%%% M< millionN
Prototy*e=% =%%%%%%% M=% millionN =%%%%%% M= millionN
"ab%e 8-1) "he various copies of the production database
$esi!nin! the $ata (arehouse
-he !/ design ere created using KimballPs a**roach. -his as because the BI system ere
covering data +or a certain *art o+ an enter*rise 8 a *roduction environment. -his made it ideal
to design the !/ as one data mart. Since data marts according to Kimball should re*resent
e3actly one *art o+ a business o*eration one data mart as su++icient to cover a *roduction
system. #nother reason +or choosing KimballPs a**roach as the denormali,ed design
suggested by Kimball. 2aving a normali,ed !/7 as Inmon suggested7 ere not ideal since the
+ocus o+ the !/ as use it +or analysis and build >"#P cubes +rom it. -his made a
denormali,ed design the better choice since as going to yield better *er+ormance to the cube.
-he 6-" ere also going to be +aster ith a denormali,ed design since key looku*s necessary
+or inserting data into the !/ ould be +aster ith less tables to *er+orm the looku*s in.
-he ne3t design choice as ether to use a star schema or a sno+lake schema. # star schema
as chosen +or the same reasons KimballPs denormali,ed a**roach as chosen. "ess key
looku*s +or +aster 6-" and easier translation to an >"#P cube. In KimballPs o*inion
sno+laking should only be used in s*ecial circumstancesICJ. >ne could argue that +or
e3am*le geogra*hic data +or customers and +actories should be *ut in itPs on table. -his
ould make the to dimension tables hich has geogra*hic data Mthe customer table and
+actory tableN smaller and u*dates to geogra*hic data ould only occur in one *lace. -his
creates de*endencies +or the to dimension tables to a table holding geogra*hic in+ormation
and ould slo don both the 6-" and analyses and thus that design as not used.
-he ne3t ste* as to determine ho many dimension tables there should be and ho they
ould look. -he +olloing areas as determined to be o+ im*ortance to analyses. *roduct
in+ormation7 customer in+ormation7 +actory in+ormation7 error in+ormation and time
in+ormation.
#+ter the dimensions had been decided the +acts as the ne3t ste*. -he 1uality measures -
+acts - anted ere as mentioned earlier. number o+ *roduced units7 number o+ returned units
&C
and the error *ercentage. -hese three +acts ill be re+erenced to as units7 returns and ep in the
remainder o+ the te3t.
/ith the decision o+ using KimballPs method +or modeling and having identi+ied the
dimensions and +acts the !/ as ready to be im*lemented.
It as decided that the database model +or the !/ ere going to be done in (y3=#
Wor/bench.
-he !/ also needed a orking D!D-solution. -he D!D as going to be cross-im*lemented
beteen the 6-" and the !/ ith both o+ the systems being able to communicate data
u*dates.
Buildin! the ET+ &ith "$I
-he hole 6-"-*rocess as designed and im*lemented ith the P!I. -he major design
choices ere hether to use a staging area or not and ho the D!D and data cleansing should
be constructed.
-he data trans+er +rom the o*erational database to the !/ ere done ith one trans+ormation
*er dimension table. -he +act table needed to trans+ormations to load all data correctly7 more
on this later on.
Using a staging area as not necessary in the BI system since the o*erational databases ere
not used by anyone else and e3tra *ressure on them +rom the 6-" tool as not a *roblem.
# D!D-solution on the other hand as necessary to make the loading o+ data selective. -he
D!D-solution ere im*lemented in the P!I trans+ormations.
Notes on data cleansing
!ata cleansing is really *art o+ a broader to*ic 8 data @ua%ity 8 hich in turn is a *art o+ the
subject data governance. It is u* +or major discussion hether or not data cleansing should
take *lace in the 6-" or not. #rguably the data should be cleansed and con+ormed at the root
in the source systems. 2aving *er+ect data in the source systems is a very admirable goal7 but
in reality this is very hard. Aany systems have manual in*uts and as e all kno *eo*le make
mistakes. It is alays easy to blame the human +actor hen looking +or sca*egoats but
automatic systems can also create incorrect or +alse data due to various reasons such as
mal+unctions or bugs. -his means that having correct data at the root is very tough ork and
not many organi,ations can manage that. So data cleansing is an im*ortant *art o+ the 6-" i+
an organi,ation ants to be sure that the data used +or analyses is correct and can be trusted.
In the BI system built +or this thesis no data cleansing as done since all o+ the data in the
o*erational systems as automatically generated or *rede+ined and thus could be considered
sa+e. :or the *ur*ose o+ totality7 challenges involving design and im*lementation o+ data
cleansing ill be discussed later.
,+A" schema and cube
-he analysis *art o+ the system consisted o+ cubes run on the Aondrian >"#P server. -he
cube schema ere created ith 3chema Wor/bench; 'entaho &ggregation Designer ere used
to im*rove the *er+ormance o+ the cube. Both 3chema Wor/bench and 'entaho &ggregation
Designer are tools +ound in the Pentaho BI Suite.
$%
Since the !/ ere designed as a star schema the cube looked very much the same as the !/
itsel+. -he measures in the cube ere the same as the +acts in the !/ and the dimensions in
the cube ere the same as the dimension tables in the !/. -his as one o+ the reasons hy a
star schema as used in the +irst *lace7 a star schema translates almost seamlessly into an
>"#P cube.
-o test i+ the cube orked satis+actory and to test its *er+ormance7 telve analysis 1ueries o+
varying com*le3ity ere made. -hese ere run against the cube to measure the time it took to
get an anser. -he test cases can be vieed in #**endi3 #. # java *rogram ere ritten to
1uery the cube7 this also made it very easy to measure the res*onse times.
The s%stem as a &hole
/hen the system had gotten its +inal design and been im*lemented each o+ the test databases
Mith their varying number o+ *roduced unitsN ere used to test the system. Starting ith
moving data ith the 6-" +rom the o*erational system to the data arehouse7 +olloed by
creating >"#P cubes ith Aondrian in order to anser a variety o+ analysis 1ueries. -he
*er+ormance o+ both the 6-" and Aondrian ere measured in time.
$&
Im#lementation
-his cha*ter ill go through ho the di++erent *arts o+ the *roject ere designed and
im*lemented starting +rom the generation o+ the test data and ending ith ho the >"#P
cubes ere made. 6very time a tool +rom the Pentaho BI suite has been used a brie+
e3*lanation o+ ho the tool orks ill be given so that the reader can +ollo the design
*rocess easier.
$ata !eneration
-he region7 countries7 cities and +actories as manually ritten into scri*t +iles and inserted
into the !B7 as ere the *roduct grou*s and error ty*es. -he customers ere generated ith
the hel* o+ the o*en source *rogram S*aner !ata Henerator.
:or the generation o+ *roduced units +or the test databases a java *rogram as ritten. It used
J!BD7 the java #PI +or database connection7 to connect to the o*erational !B. -he *rogram
goes through the customer table and generates beteen &-&% orders +rom that customer. 6ach
order is +or a s*eci+ic *roduct and beteen &% and =%%% units are ordered. 6ach unit gets a
uni1ue randomi,ed serial number &% characters long7 alloed characters are #-W and %-C.
#+ter that a starting *roduction date in the time-s*an $%%=-%&-%& to $%&%-&$-(& is generated.
-his is hen the +irst unit in the order is *roduced7 the rest o+ the units are randomly given a
*roduction date in the time-s*an starting date to Mstarting date X &%%N. #ll o+ the units in each
order are considered to be *roduced +rom the same +actory7 hich is also randomly *icked +or
each order.
-he same *rogram also *o*ulates the returned table by selecting a s*eci+ied number o+ units
+rom the unit table and inserting them in the returned table. -he return date is set to the
*roduction date X M&-&%C=N days7 &%C= days being roughly three years.
$ata &arehouse construction
In the design decisions +ive dimension tables ere identi+ied. -he +olloing list shos ho
the dimension tables as im*lemented and describes their data and level o+ granularity.
Product 8 2olds in+ormation about the *roduct. It as determined that the granularity here
should be at the *roduct name level. !ata ill be gathered +rom the product7group7
product7name and unit tables in the o*erational database.
Dustomer 8 2olds in+ormation about all customers that have bought *roducts. -his table holds
the customer name as ell as geogra*hic in+ormation about the customer. -his results in a
granularity at the customer name level. !ata ill be gathered +rom the customer7 city7 country
and continent tables in the o*erational database.
:actory 8 2olds in+ormation about the +actories that *roduces units. Same in+ormation as the
customer dimension. :actory name and geogra*hic location. Heogra*hic in+ormation here is
im*ortant to be able to see ho much certain regions have *roduced. !ata ill be gathered
+rom the factory7 city7 country and continent tables in the o*erational database.
$$
6rror 8 Dontains all o+ the *ossible errors a unit can be tagged ith. Dontains the error
number and the te3tual name o+ the error. !ata ill be gathered +rom the errors7type table in
the o*erational database.
!ate - -he data in this table ill re*resent time ith a granularity o+ uni1ue dates. -he data
ill be generated ith P!I and then inserted into the table. # s*an o+ &% years ill be
su++icient to cover the *roduction and return dates used in the generated test data in the
o*erational databases. -he dates ill range +rom $%%=-%&-%& to $%&=-%&-%&.
6very dimension table e3ce*t the date dimension used auto-generated *rimary keys7 such
keys are called surrogate /eys. -he date dimension made use o+ a natura% *rimary key in the
+orm GGGGAA!! re*resenting the date the ro holds in+ormation about. Since dates are
uni1ue this creates a great *rimary key and as e ill see later the use o+ this natural key ill
hel* s*eed u* the 6-" signi+icantly. /ith the dimension tables ready it as time to decide on
the +act table designs.
:ive columns in the +act table ere automatically determined by the dimensions tables and
ould hold +oreign keys relating to them. /hat as le+t to add as hat the +act or +acts
should be.. -his means that there has to be a ay to count the number o+ *roduced unit and
the number o+ returned units. -he error *ercentage can be derived +rom those to values by
dividing returned units ith *roduced units. -hough be+ore any decisions involving the +act
table design as made another *roblem as noticed. there as something missing in the
dimensions to get the *ro*er grain +or the 1uality measure.
-he granularity o+ the units as at this *oint only at the unit name level as derived +rom the
*roduct dimension. 2aving the accuracy o+ the 1uality measures as high as *ossible is vital
since that is the core *ur*ose o+ the system. So the higher the granularity o+ the *roduct
dimension the more accurate the 1uality measure ould be. -he highest granularity *ossible
+or *roducts are uni1ue *roduced units. -he *roblem that arises i+ e *ut uni1ue units as the
granularity in the *roduct dimension is that it ill be as long as the +act table in regard o+ the
number o+ ros. -his is not the *ur*ose o+ a dimension table7 they are su**osed to be as short
as *ossible. #nd having it as long as the +act table ould create an enormous join each time
+iltering ith the *roduct dimension as re1uired.
Instead o+ *utting the uni1ue units in the *roduct dimension a single column dimension as
added to the +act table. -his single column dimension held the serial numbers +or all units ever
*roduced. -his solution did kee* the *roduct dimension +airly short ith the granularity at
*roduct name and the +act table became as long the amount o+ *roduced units.
/ith this solution all o+ the needed dimensions ith correct granularity ere in *lace and it
as time to decide ho the +acts ere to be re*resented.
-here ere actually to solutions tried7 one as later discarded due to the second one being
better. But letPs take a look at both o+ them to see hy one as chosen and the other one
discarded.
$(
-he +irst design used to +act tables7 one holding *roduced units and the customer ho
bought them and one holding all the units that had been returned. -his meant to data marts
ith +ive shared dimensions. -he units +act as calculated by counting the number o+ ros in
the unitT+act table since every ro re*resents a uni1ue unit. -he returns +act as calculated in
the same ay +rom the +actTreturned table. #s mentioned earlier the third +act7 the ep7 could
be derived +rom the to *revious ones so all +acts ere no available as ready to be used in a
cube. -he design can be vieed in Figure A-1;
#t +irst this design looked good but it *roved to be di++icult hen the cubes got taken into
consideration. Since there ere to +act tables7 there also had to be to cubes - one +or each
+act table. -o be able to get the most im*ortant 1uality measure Merror *ercentage -
totalQreturnsN the to cubes created +rom each +act table had to be joined as a virtual cube. >r
$<
Figure A-1) "he first DW design
the data - number o+ units and number o+ returns - had to be retrieved +rom both cubes each
time and then divided. Dreating a virtual cube or having to cubes at all +elt too com*le3
considering that the underlying !/ really asnPt that advanced in its design. -he design as
remade.
Instead a single +act table as used. -he im*roved design can be seen in Figure A-0.
-he +act table no contained key columns to all dimensions tables Mto +or the date
dimensionN7 the to %ast7updated columns +rom the *revious +act tables7 and to ne
columns. -he to ne columns7 @uantity and returned as the ne +act columns. Euantity
as alays set to & and returned as set to % or & de*ending i+ the unit had been returned or
not. Ro the sum +unction could be used on those to columns to get the desired +acts units
and returns. -his design as mainly done to counter some issues hen building the >"#P
cube later on but that ill be discussed in the >"#P section o+ this cha*ter.
/ith this ne design a ne *roblem arose. I+ a unit had not been returned the keys columns
return7date7/ey and error7/ey ould be null. Rull values in the +act table are not desirable
because it causes inconsistency in data. >nce again design ideas +rom Kimball ere taken.
Kimball states that instead o+ having null values in dimension key columns they should *oint
to a s*eci+ic ro in the dimension tableI&BJ.
$=
Figure A-0) "he second and fina% DW design
-his lead to the insertion o+ a dummy ro in both the dim7error and the dim7date table ith
+i3ed *rimary keys. #ll units that had not been returned re+erred to these ros in their
return7date7/ey and error7/ey columns.
6arlier designs o+ the +act table contained the serial numbers as identi+iers +or the units7 this
as changed to the unit7id +ield +rom the o*erational database instead since it takes u* less
s*ace and is easier to do looku*s on. #nd the serial numbers arenPt really needed hen doing
the analyses.
-o be able to handle the D!D it as decided that every table in the !/ should have a column
called lastTmodi+ied. -he column ty*e as set to timestamp7 hich contains a date and a time
o+ day. -he %ast7modified columns *ur*ose is to be u*dated each time ne ros are inserted
och ros are u*dated in the table. By having this column and a counter*art in the o*erational
system the 6-" can take the ma3imum o+ the lastTmodi+ied column in the !/7 hich ould
result in the latest time stam*7 and then only select ros +rom the o*erational system that
succeeds that time stam*. /ith this a +unctional D!D can be im*lemented in the 6-".
It is hoever im*ortant to note here that in many cases o+ building a BI system the designer
does not have total control over the source systems7 the more common case is that they have
no control over them at all. -his means that i+ some o+ the source systems does not have any
indication o+ hen in+ormation as last changed constructing a D!D ould be a lot more
com*licated. -he D!D +or a source system ithout that in+ormation ould have to record itPs
on in+ormation o+ hat data has been gathered on *revious data e3tractions. >ne solution to
accom*lish this has been discussed earlier and that is to use a staging area.
>ne last thing to comment on in the +inal +act table im*lementation is the necessity o+ to
lastTmodi+ied columns. -he *roductionTlastTmodi+ied is *retty straight +orard7 it tells the
D!D hen the unit as either inserted or hen its in+ormation as last u*dated. -he *roblem
ith this is that i+ there ere only this lastTu*dated +ield no in+ormation about hen a unit
as returned ould e3ist. >+ course i+ a unit as returned the *roductionTlastTmodi+ied +ield
could be u*dated to the time hen the return as recorded but this ould mean that the 6-"
ouldnPt kno i+ the time stam* signi+ies an in+ormation u*date or that the unit has been
returned. -his ould create e3tra ork +or the 6-" since it ould need to com*are
o*erational data ith its on to see hich change has occurred. So instead o+ +orcing this +ail-
sa+e it as decided that adding the returnedTlastTu*dated +ield as better. -his column tells
the 6-" hen in+ormation about a return as last made.
$@
ET+ 0obs and trans'ormations
#s mentioned be+ore the P!I orks ith Bobs7 transformations and steps. It as decided that
the e3traction and loading o+ the data +rom the o*erational database to the !/ ould be done
ith one trans+ormation *er dimension table. -he +act table loading consisted o+ to
trans+ormations7 but letPs start ith the dimension table trans+ormations.
+oadin! the dimension tables
Figure A-3 shos the P!I job that loaded all data +rom the o*erational database to the
corres*onding dimension tables in the !/.
It consisted o+ si3 trans+ormations 8 +ive +or the +ive dimension tables and one +or inserting
the dummy ros into the date and error dimensions. -he trans+ormations +or loading the
dimension tables - %oad'roducts7 %oadCustomers7 %oadFactories and %oad!rrors - all looked
very similar and to not get over-e3*licit a closer look ill only be taken at the %oadCustomer
trans+ormation.
-he loadDustomers trans+ormations loaded all o+ the customer data into the customer
dimensions table. -he trans+ormation design is shon in Figure A-4;
$B
Figure A-3) "he dimensions %oad Bob
-he +irst ste* called KdimTcustomer last u*datedL is the D!D solution used and ill be
discussed later.
-he ne3t ste* is o+ course to get the data +rom the o*erational database. -his ere done ith
the "ab%e input ste*. # database connection must be de+ined as the in*ut source +or this ste*.
-o 1uery the database SE" 1ueries can be ritten manually in the ste* or a database e3*lorer
can be used to select hich table to im*ort data +rom. -he ste* ill *ass along the +etched
ros +rom the table as its out*ut. -he table in*ut ste* in %oadCustomers connected to the
*rototy*e !B and selected all o+ the +ields +rom the customer table.
-he geogra*hic data about the customers Mcontinent7 country and cityN didnPt reside in the
customer table but in its on tables. -his data also had to be gathered +or the customer
dimension table since it as totally denormali,ed and held all available in+ormation about a
customer. -he customer table in the *rototy*e !B contained +oreign keys ith in+ormation
about hich city the customer ere based in. -he city table in turn contained +oreign keys to
hich country the city as located in. #nd the country table contained +oreign keys as to
hich continents the countries belonged to. -his means that the in+ormation needed to be
+etched +rom the di++erent tables ith the use o+ the +oreign keys. :or this the Database
#oo/up ste* as used in the %oadCustomers trans+ormation. -he looku* ste* re1uires as in*ut
a table name7 a key column in the table and a value +rom the stream in P!I to do the matching
against. # user can then choose hich columns to +etch +rom the table i+ a looku* is
success+ul. In %oadCustomers it as done7 in order7 as +ollos.
City %oo/up. :etch city name and country id +rom city table7 match on city id +rom stream
ith city id in table.
$'
Figure A-4) "he customers %oad transformation
Country %oo/up. :etch country name and continent id +rom country table7 match on country id
+rom stream ith country id in table.
Continent %oo/up. :etch continent name +rom continent table7 match on continent id +rom
stream ith continent id in table.
#+ter the the looku*s all o+ the necessary data ere available. -he 3e%ect va%ues ste* ere
used to +ilter out unanted data in the stream7 such as the +oreign keys. -hey ere not *art o+
the data anted in the customer dimension table and thus ere discarded a+ter use.
-he last ste* ere to inset the data into the customer dimension table. -his ere done ith the
,nsert C 6pdate ste*. -his ste* needs a database connection7 a table and a key column. -he
key column is used to match ith a +ield +rom the stream. I+ a match is +ound that ro in the
table is u*dated. I+ no match is +ound a ne ro is inserted. /hich columns that should be
u*dated i+ a match is +ound can also be s*eci+ied. In %oadCustomers the customer dimension
table +rom the !/ ere the table and the customer7id +ield ere used to match ith the
customer7id +ield +rom the stream7 hich originated +rom the customer table in the
o*erational !B. #ll o+ the columns in the dimension table ere set to be u*dated i+ a match
ere +ound7 e3ce*t the customerTid +ield since that ere used +or the matching and should not
be changed.
-his is ho the loading o+ the customer dimension table as im*lemented in P!I. #s
mentioned be+ore all o+ the dimension table loading trans+ormations ere very similar and
they all consisted o+ a table in*ut7 some table looku*s and an insert Q u*date out*ut. #ll o+
them also had the +irst ste* hich has not yet been e3*lained - the D!D ste*. In the
loadDustomers trans+ormation it as called KdimTcustomer last u*datedL. It as a table in*ut
ste* hich contained manually ritten SE".
SELECT COALESCE( MAX(customer_last_modified),'197!1!1 ""') AS
ma#_dim_customer_last_u$date %&OM dim_customer
-he A#F +unction returns the greatest value +rom the customer7%ast7modified +ield in the
customer table +rom the !/. Since it as a time stam* +ield the greatest value ould be the
latest time stam*. -he D>#"6SD6 +unction returns the +irst non null value +rom its
*arameters. -his means that i+ there as any data at all in the customer dimension table the
resulting value ould alays be the greatest value o+ the customer7%ast7modified column. I+
the customer table as em*ty then the P&CB%-%&-%& %%.%%.%%P time stam* ould be returned.
-he value +rom the SE" 1uery as +orarded to the tab%e input ste* used to +etch data +rom
the o*erational database. In loadDustomers the SE" +or +etching data +rom the o*erational
database looked like this.
SELECT customer_id, customer_'ame, customer_cit(_id, last_modified
%&OM customer
)*E&E last_modified + ,
6verything e3ce*t the /26?6 statement ere auto-generated ith the database e3*lorer. -he
K0L in the /26?6 statement as the in*ut +rom the *revious ste* 8 the time stam* +rom the
1uery in the dim7customer %ast updated ste*. -his meant that every time the %oadCustomers
$C
trans+ormation as run the latest time stam* in the dimension table as +etched and only the
ros +rom the o*erational database that had been changed or inserted a+ter that time stam*
ere +etched. -his meant a orking D!D had been im*lemented.
-he date dimension trans+ormation di++ers +rom the others because it as not +etching data
+rom any database. Instead dates ere generated in P!I and inserted into the date dimension
table.
-he date trans+ormation is shon in Figure A-8. # descri*tion o+ the most im*ortant ste*s ill
be made to clari+y ho the trans+ormation is done.
Start date and range: -his ste* takes a limit *arameter and column data. -he limit *arameter
determines ho many ros that ill be created7 this ill be the total number o+ dates
generated. 6ach ro ill have one column named start7day hich is the +irst date in the
anted date range. Since a +ull range o+ dates +rom $%%=-%&-%& to $%&=-%&-%& M&% yearsN as
the goal +or the date dimension the start date as set to $%%=-%&-%&. -he limit as set to (@=$
hich is &% years Y (@= days X the to e3tra days in :ebruary +rom the lea* year o+ $%%' and
$%&$ hich is the lea* years in that time *eriod.
Da&s$since: -his is just a se1uence that starts at % and gets incremented by one each time.
-his means that it ill simulate the days that has *assed since the start date each time a ne
startTday ro is created and *assed along in the stream.
(alc date: -he most im*ortant ste* in the trans+ormation. -his ste* is a Ca%cu%ator ste*. -he
calculator ste* consists o+ +unctions7 *arameters and results. # +unction is chosen7 and there a
ide variety o+ numeric7 string and date +unctions available. Some +unctions re1uire
(%
Figure A-8) "he date generation transformation
*arameters and they can either be ritten directly in the ste* or taken +rom the steam. #ll o+
the +unctions returns a result7 hich is *assed along in the stream. -he ste* in this *articular
trans+ormation makes use o+ some date +unctions7 the most im*ortant one being Date & D +
days hich takes as *arameters a date # and an integer B. -he date # is the startTday column
+rom the steam and the integer B is the value +rom the Days7since ste*. -his results in every
date beteen $%%=-%&-%& to M$%%=-%&-%& X (@=$N days to be created. -his +unction also
handles lea* years.
-he rest o+ the ste*s ere there to determine hich 1uarter a date belongs to and hat month
name and eekday name a date had. -he last ste* ere a "ab%e output ste* that *ut the
generated dates into the date dimension table.
-he last trans+ormation in the dimensions job ere the insert dummy ro*s trans+ormation and
as really sim*le. It as just static hard coded in+ormation inserted into the date dimension
and the errors dimension.
So that as ho the trans+ormations to *o*ulate the dimension tables in the !/ as
designed. Ro a closer look ill be taken at ho the +act table as *o*ulated.
+oadin! the 'act table
-he job +or *o*ulating the +act table as divided into to trans+ormations. -he +irst one7
called popu%ate'roductionFacts7 inserted the data about every *roduced unit. -he second one7
called popu%ate5eturnFacts7 u*dated the columns im*lying the return o+ a unit +or units that
had been returned. popu%ate'roductionFacts is shon in Figure A-A.
(&
Figure A-A) "he units %oad transformation
-he trans+ormation started as all o+ the other trans+ormations ith the D!D table in*ut ste*7
+olloed by the o*erational database table in*ut. #+ter that there ere +our keys that needed
to be looked u*7 one +or each dimension table. 6ach *roduced unit ere going to be a ro in
the +act table and every unit ro contained keys to corres*onding ros in every dimension
table. -he +irst design o+ this trans+ormation had a table looku* +or the date dimension as ell.
But since the date dimension makes use o+ smart *rimary keys MyyyyAAddN the key can be
derived +rom the *roduction date and a table looku* is ski**ed. -his made a vast
im*rovement to the run time o+ the trans+ormation since the table dimension is the biggest o+
the +our. So the Fi9 date/ey and se%ect va%ues ste* +ilters out needed values in the stream and
converts the *roduction date +ormat MyyyyQAAQdd hh.mm.ssN to a date key +ormat
MyyyyAAddN.
-he &dd constants ste* added the key values in the columns returns7date /ey and error7/ey
+or the dummy ros in the date dimension and error dimension. I+ the unit had been returned
these values as going to be changed but that is hat the popu%ate5eturnFacts trans+ormation
as +or. -here as no need to set the returned column to a constant value since it as set to
de+ault to % in the database. -he ?et current datetime ste* e3ecuteed the R>/MN +unction in
P!I hich returns the systems current date and time. -his value as then used +or the column
production7%ast7updated.
-he last trans+ormation as popu%ate5eturnFacts and is shon in Figure A-E.
It started ith the D!D and a table in*ut +rom the returns table in the o*erational database.
-hen the date key +or the returns7date7/ey column as +ormatted correctly by altering the
returns7date +ield +rom the return table. #+ter that a table looku* as made to get the key +or
($
Figure A-E) "he returns %oad transformation
the error ty*e. -he ?et current date time ste* as the same as in popu%ate'roductionFacts7
but this time the date as intended +or the returns7%ast7updated column. -he &dd constants
ste* set returned to & and +inally an 6pdate ste* ere used to alter the ros ith units that
had been returned. -he u*date ste* as used instead o+ an insert Q u*date ste* since nothing
as being inserted7 ros ere just being u*dated.
-hat concludes the design o+ the 6-" in P!I. Ro code as needed to design the 6-" at all.
Some manual SE" ere ritten but that as only to do the D!D ste* and to generate the
dummy ros +or the error and date dimension tables.
,+A" &ith Schema (orbench and "entaho A!!re!ation $esi!ner
/ith the !/ and 6-" all done the ne3t ste* as to design a cube. -he cube as the most
im*ortant *art +or the end result since all analyses made as going to be 1ueries run against it.
# cube is described by an FA"-+ile hich describes the dimensions7 measures and
hierarchies in the cube. #s mentioned *reviously 3chema Wor/bench MS/N as used to
construct the cube. /hat S/ really does is that it gives the user a gra*hical inter+ace in hich
the cube can be designed. -he schema +ile is automatically generated according to the users
in*uts in the HUI. -his means that no knoledge about the rules and standards o+ schema
+iles are necessary. #ll the user needs to kno is that a cube consists o+ dimensions7 measures
and hierarchies and ho these building blocks are connected. /ith that knoledge it is really
easy to design a cube in the S/. Figure A-F shos ho S/ looks. In the +igure the +inal cube
design +or the *rototy*e cube is also loaded in S/ +or the *ur*ose o+ having some content. So
lets go through ho the *rototy*e cube as designed ith S/ and also look at hat design
decisions ere made during the *rocess.
((
Figure A-F) "he 3chema Wor/bench
-he S/ consists o+ a toolbar +or commonly used +unctions and to content *anes. -he le+t
*ane is the obBect e9p%orer hich shos the contents o+ a Aondrian schema. -he content is
the di++erent obBects that e3ists in a schema such as cubes and their corres*onding dimensions
and measures. -he right *ane is the options-pane and shos the *arameters available +or an
object hen it is select in the object e3*lorer. #ll terms used above should not be considered
o++icial ones instilled by Pentaho Dor*oration - they are de+ined by the author o+ the thesis.
-he +irst thing to do in S/ is to s*eci+y a database connection. S/ connects to the database
and then the database can be used to build a cube. -o build the cube there are a variety o+
+unctions available in the S/.
-he +unctions that ere used in S/ to create the cube as the +olloing.
&dd cube 8 -his adds an em*ty cube the the FA" schema. 2as *arameters +or meta data such
as name and descri*tion but none o+ the *arameters are re1uired to create a cube. But it needs
dimensions and measures added to it. :or measures to be added a +act table must +irst be
s*eci+ied. -his is done by selecting a table +rom the database connection. #+ter that measures
can be added ith the add measure +unction. I+ that doesnPt cover the intended measures a
user has in mind ca%cu%ated member can be used instead7 more on these later.
&dd dimension 8 #dds a dimension to a cube. -he dimension itsel+ has7 just as the cube7
*arameters +or meta data but instead o+ dimensions it re1uires at least one hierarchy to be
s*eci+ied and +or the hierarchy to be able to have levels a database table is also needed. #+ter a
table is s*eci+ied +rom the database connection level can be added. -he levels must *oint to
columns in the added dimension table.
&dd measure - #dds a measure to the cube. -he measure is de+ined by using *rede+ined
+unctions Mcount7 distinct count7 sum7 min7 ma3 and avgN on one column in the +act table.
&dd ca%cu%ated member 8 # measure that is calculated by a user de+ined +ormula. -he +ormula
is s*eci+ied in A!F and the data is gathered +rom the cube.
/ith the +unctions at hand the Aondrian schema +or the *rototy*e system could then be built.
It had been decided earlier that one cube as enough so a cube as added to the schema ith
the add cube +unction. #+ter that the +ive dimensions ere added 8 !ate7 Product7 6rror7
Dustomer and :actory. 6ach o+ the dimensions had one hierarchy added to them e3ce*t !ate7
hich got to. #gain the star schema design o+ the !/ shined through here as each
dimension table could be directly connected to the dimensions. #dding levels as also very
straight +orard since the !/ had been design ith great care and ith cubes in mind. -he
levels in the dimension hierarchies looked e3actly like the internal ordering in the dimension
tables. -he +olloing list shos ho the dimensions and hierarchies turned out.
(<
Date dimension
Hierarc!& name: Aonths
/evels: Gear7 Euarter7 Aonth7 !ay o+ month
Hierarc!&: /eeks
/evels: Gear7 Euarter7 /eek7 !ay o+ eek
)roduct dimension
Hierarc!& name: Product 2ierarchy
/evels: Product Hrou*7 Product Rame7 )ersion nr
Error dimension
Hierarc!& name: 6rror 2ierarchy
/evels: 6rror nr
(ustomer dimension
Hierarc!& name: Dustomer 2ierarchy
/evels: Dontinent7 Dountry7 Dity7 Dustomer Rame
+actor& dimension
Hierarc!& name: :actory 2ierarchy
/evels: Dontinent7 Dountry7 Dity7 :actory Rame
/ith the dimensions done it as time to add the measures. "ets look at ho they ere
im*lemented. -hree measures ere anted 8 the number o+ units7 the number o+ returned
units and the error *ercentage +or the units. It as decided to call the measures 6nits7 5eturns
and 'ercent. #s discussed in the data arehouse section o+ this cha*ter the arehouse as
later redesign to solve some issues ith the cube design. #t +irst the cube measures ere
calculated in the +olloing manner.
Units 8 ?eally very straight+orard7 the count +unction as used on the unitTid +ield in the
+act table. Since unitTid as a column dimension it ould alays be *resent in a result even i+
none o+ the other dimensions ere used in the 1uery. So counting the number o+ unitTid ros
in a result set gives the amount o+ units.
?eturns 8 -his as added as a ca%cu%ated member and the +ormula counted the number o+
ros in the result set here the errorTkey +ield as not e1ual to $==. I+ you recall the ro in
the dimTerror table as the dummy ro inserted that non returned units could re+erence to in
the errorTkey +ield in the +act table. So counting every ros that had a key not e1ual to $==
ould result in the amount o+ returns in the result set.
Percent 8 -his as also a calculated member and the +ormula as sim*ly.
-(easures.;-!rror. C -(easures.;-6nits.
1
&=IAeasuresJ.IAeasure nameJ is ho you select a measure +rom a cube in A!F.
(=
#+ter some testing and consideration this +elt like the rong a**roach. !oing the calculation
o+ the unit amounts and returned amounts each time a 1uestion as asked to the cube as
both *er+ormance and time consuming. -he +inal a**roach as to add the to additional
+ields to the +act table that as in the +inal !/ design - 1uantity and returned. -his gave the
o**ortunity to calculate the measures like this.
Units 8 Still a normal measure but instead o+ counting the unitTid +ield the sum +unction is
used on the 1uantity +ield.
?eturns 8 Same solution as +or the units but the sum +unction is used on the returned +ield.
-he calculated member to give the Percent measure as ke*t since itPs just a sim*le division7
but the members o+ the division became a lot +aster to calculate.
-his concludes this section and also the cha*ter. -he hole system had no been designed
and im*lemented and as ready to be tested. -he big 1uestions no as i+ it as going to
ork7 ho good it ould ork and hat the *er+ormance ould be like.
(@
-esults
-unnin! the s%stem
-he system as started and the 6-" ran very smoothly ithout any errors ith the AySE"
and PostgresSE" !BASs. Both "ucid!B and Aonet!B encountered *roblems in the 6-".
-he *roblems as encountered in the bu%/ %oading ste*s available in P!I +or "ucid!B and
Aonet!B. # standard tab%e output ste* as tried instead but that did not ork either. -he
bu%/ %oading ste*s as recently moved by Pentaho Dor*oration +rom the e3*erimental status
to stable status and little documentation as available. #+ter several di++erent a**roaches in
solving this a solution as not +ound and both !BASs ere discarded due to time constraints.
#+ter the !/ had been *o*ulated ith the data +rom the source system a cube as created
and the 1ueries made against it got correct result ithout any incidents.
-he BI-server as also tested. -he de+ault 2SE"!B storage +or the BI-servers data as
changed to the AySE" server running the *rototy*e databases. :inding ho to do this and
ere to make changes took some time and the +olloing +orum guide +ound in the o++icial
Pentaho +orums ere used.
htt*.QQ+orums.*entaho.comQshothread.*h*0@@C%&-Pentaho-BI-Server-(.=-AySE"-
PostgreSE"-and->racle-+or-/indos-am*-"inu3--utorial
#+ter +olloing the guide the BI-servers databases ere success+ully run +rom the AySE"
server. #ccessing the solution re*ository +rom both P!I and S/ orked as intended and
trans+ormations7 jobs and schemas ere success+ully u*loaded and donloaded directly in the
tools. -he *ivot tables +or >"#P cubes ere tested and orked as intended. ?e*orts ere
created and submitted to the solutions re*ository ithout any di++iculties. In other ord the
BI-server orked as described in the in+ormation +rom Pentaho Dor*oration.
"er'ormance and #er'ormance tunin!
The ET+
#t +irst the 6-" as just run ith the de+ault *ro*erties on every ste* in the trans+ormations.
#s mentioned in B.& this orked +ine and no *roblems ere detected so the 1uestion no as
i+ the time run times could be loered. -he +ocus ill be on im*roving the +act table loading7
hich includes the popu%ate'roductionFacts and %oad popu%ate5eturnFacts trans+ormations.
-he loading o+ the dimension tables as so +ast that it really didnPt need any teaking Mlongest
run time as belo &% secondsN and the teaking done ith the +act table loadings can be
a**lied to those too i+ one chooses to.
# number o+ tuning o*tions ere identi+ied in P!I. :irst o+ all the commit si,e in the tab%e
output ste* could be altered. -he de+ault value here as &%%. -his means that every time the
stream has delivered &%% ros to the out*ut ste* a those ros are inserted into the database.
-he +act table trans+ormation +or units ere run ith di++erent commit si,es to see hat
commit si,e yielded the best result. -he amount o+ data as +irst set to =%%.%%% and then
=.%%%.%%% ros. -he results are shon in the "ab%e E-1 belo.
(B
(ommit si#e 317 )ostgreS8/
95% ro:s 3s7
,&S8/
95% ro:s 3s7
)ostgreS8/
9, ro:s 3s7
,&S8/
9, ro:s 3s7
4 - 45 C$& @B$ C<'@ @'CB
455 <'< (@= =%&( (='C
4555 <'B (C% =&&B (@&(
45555 =%% <%$ =$&' (B<@
455555 =%% <%< =('' ('$(
"ab%e E-1) Commit siGe run times
#s the results sho the best value +or the commit si,e as the de+ault value o+ &%%.
-he ne3t *er+ormance adjustment to try as *reloading o+ the tables used +or key looku*s in
the trans+ormation. -he looku* ste*s had a checkbo3 that i+ marked ould *reload the hole
table used +or the looku* hen the trans+ormation started. -his *roved to be an im*rovement
to the trans+ormation run time and it as decided to use it in the +inal im*lementation. It is
im*ortant to note though that i+ *reloading a a table is used there has to bee enough system
memory available to load the table into. So very large dimension tables ill use a lot o+
memory hen using this a**roach and it needs to be taken into consideration hen designing
trans+ormations. -he dimension tables used +or the *rototy*e system as +airly small and they
ere all able to be *reloaded ithout taking u* too much memory. -he e3ce*tion though as
that the dimTdate dimension table as not *reloaded since it as too big to *reload. -his
leads to the third and last *er+ormance adjustment +or the +act table loading. Since the date
dimension used a natura% /ey MyyyyAAddN as *rimary key the key is alays knon. -his
means that the key can be generated directly in the 6-" trans+ormation. #s e3*lained in
3ection A;3;0. the date keys ere generated +rom the *roduction date and the return date +ields
gathered +rom the o*erational database. Henerating date keys instead o+ doing looku*s in the
date dimension *roved to be the best im*rovement to the +act table loading. Belo is a short
list o+ test times ith and ithout the adjustments hen loading the +act table.
Run times for populateProductionFacts :it! 9 million units
Preloading and generation o+ date keys. =C minutes
Ro *reloading7 generation o+ date keys. &%C minutes
Ro *reloading7 no generation o+ date keys. $%& minutes
Run times for populateReturnFacts :it! ; million returned units
Preloading and generation o+ date keys. $= minutes
Ro *reloading7 generation o+ date keys. (% minutes
Ro *reloading7 no generation o+ date keys. =C minutes
('
It can be derived +rom the results that the date key generation as a signi+icant im*rovement
to the run time. -he *reloading o+ looku* tables as also an im*rovement but didnPt have the
same im*act on *er+ormance as the key generation had.
/ith the adjustments made to the +act table trans+ormations they ere ready to be tested. -he
trans+ormations ere run on all o+ the source databases. -his meant load si,es o+ =7 &%7 $%7
(% 7<% and =% million ros +or the units trans+ormation and %.=7 &7 $7 (7 < and = million ros
+or the returns trans+ormation. Both !BASs ere also tested. -he results are shon belo
both as a table in "ab%e E-0 and as a gra*h in Figure E-1;
/oad units to fact
table 3Seconds7
)ostgreS8/
/oad units to fact
table 3Seconds7
,&S8/
/oad returns to fact
table 3Seconds7
)ostgreS8/
/oad returns to fact
table 3Seconds7
,&S8/
=%&( (='C =B' <($
&%<&' B$C@ &$%= &%$'
$&@C( &<'<B $=B' $&(%
(&(B< $($C< (B%' ($B'
<%'&& (&@<& <C'B <'<'
==<<B <%='( @<(& @$%C
"ab%e E-0) 5un times for %oading the fact tab%e *ith the !"# too%
(C
#s as e3*ected the longest run times occurred hen the highest 1uantity o+ ros7 =% million7
ere loaded. "oading o+ =% million units +rom the o*erational database to the data arehouse
using the AySE" databases took && hours and &@ minutes. /ith PostgreSE" it took &= hours
and $< minutes. "oading = million returned units took & hour and <( minutes ith AySE"
and & hours and <B minutes ith PostgreSE".
<%
Figure E-1
0 5 10 15 20 25 30 35 40 45 50 55
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
432
1028
2130
3278
4848
6209
3589
7296
14847
23294
31641
40583
578
1205
2578
3708
4987
6431
5013
10418
21693
31374
40811
55447
ETL run times
Load Units (PostgreSQL
Load !eturns (PostgreSQL
Load Units ("#SQL
Load !eturns ("#SQL
Units (mi$$ion
T
i
m
e

(
s

/ondrian
-o test Aondrian *ro*erly a sim*le java *rogram as created to start the server an run a client
toards it. # cube as created +rom the schema designed in S/ and then the telve test
1ueries +ound in #**endi3 # as run against it. -his as done +or all o+ the !/ si,es and +or
both o+ the !BASs used. 6very test as timed and the results can be seen in #**endi3 B.
>ne ay to im*rove 1uery results is to use aggregated tables. #n aggregate table holds
measures that are calculated be+ore hand. /hat the values +or the measures are is decided by
the aggregation levels. -he aggregation levels are levels +rom the dimension hierarchies and
an aggregation table can have one level +rom each e3isting dimension. :or e3am*le i+ a
common 1uestion +or the *rototy*e system is K2o many returns is there in a s*eci+ic
month0L. #n aggregate table +or this ty*e o+ 1uestion ould need aggregation on the level
month in the date dimension. -o accom*lish this a table ould need to be created in the !/
to hold the data and then aggregation must be de+ined in the schema +ile +or the cube in order
to let Aondrian kno it e3ists. -he table created in the !/ ould look like "ab%e E-1 belo.
dim$date$&ea
r<
dim$date$=uart
er$number
dim$date$mont!
$number
fact$=ualit&$uni
ts
fact$=ualit&$ret
urns
fact$=ualit&$fac
t$count
$%%= & & (<=@ $B< (<=@
$%%= & $ $'B@ &<= $'B@
... ... ... ... ... ...
$%%' ( B ($C' $=< ($C'
... ... ... ... ... ...
"ab%e E-1) &n aggregation tab%e e9amp%e
-he +irst three columns are the year47 @uarter7number7 month7number columns +rom the date
dimension and is the aggregation level e anted. -he ne3t to columns are calculated
values +or the measures units and returns. Since the measure percent is a calculated member it
is not *resent in the aggregation table. -he fact7@ua%ity7fact7count column is so the cube
knos ho many ros has been aggregated. In our case this value ill be the same as the unit
measure since that measure is just a sum o+ the 1uantity column +rom the +act table.
So lets say this aggregation had been im*lemented and someone anted an anser to the
1uestion K2o many units ere returned in July $%%'0L. Aondrian ould then see that an
aggregation table e3isted ith the aggregation on the level month +or the date dimension and
get the value +rom the aggregate table. -his means that the result takes as long as it takes the
database to get the correct ro MyearZ$%%'7 monthZBN +rom the aggregated table.
<&
#ggregate tables that as s*eci+ically matched to the telve test 1ueries ere made ith
'entaho &ggregation Designer. -he aggregations made are shon in a table in #**endi3 D.
-he run times o+ the test 1ueries ith the cube no using the ne aggregate tables is shon
in #**endi3 B. #lmost all o+ the 1ueries got instantaneous results. -his is really hat to be
e3*ected since the ork load has shi+ted +rom calculating measures by joining dimensions
tables ith each other to just +etching the right ro in an aggregation table. Still it *roves the
*oer o+ using cubes ith aggregation tables and i+ you kno hat the most common 1ueries
are aggregation tables can be used to im*rove the time it takes to run analyses.
<$
$iscussion and conclusions
-his cha*ter ill address the issues raised in the beginning o+ the thesis and at the same time
discuss hat the investigation resulted in and hat conclusion can be dran +rom the ork.
>as it possible at all to create a ?'-s&stem :it!out :riting an& code@
-he sim*le anser to this 1uestion is yes7 it is *ossible to set u* a BI system ithout even
having to bother about ho things ork in the background. #s this thesis has *roven ith the
use o+ the Pentaho BI suite a hole BI system as set u* and used ithout any major
*roblems. -ools ere available +or every *art in the system. -he tools did a great job o+ giving
the user an intuitively gra*hical inter+ace to ork in. #nd orking in the tools led to ell
+unctioning *arts hich as a hole made u* the BI system. Rone o+ the tools needed any
knoledge about *rogramming or *rogram code to be used *ro*erly.
>!at kno:ledge and skills :ere re=uired to set up t!e s&stem@
-his is ere it gets more interesting. :irst o+ all setting u* a BI system re1uires knoledge
about hat BI is and ho it orks. Aany major design choices needs to be made and the
better the understanding o+ BI is the better the choices ill get and the system ill be more
tailored to a s*eci+ic organi,ations needs. Knoledge and insight in BI is the +irst ste* to
create a ell orking BI system.
-hat knoledge involves being able to identi+y hat the organi,ation ants to do ith a BI
system. /hich re*orts are the organi,ation interested in0 /hat are the analyses re1uired to
create those re*orts and is there data available in the organi,ation to su**ort those analyses0
Identi+ying hat data the organi,ation has and hat it can be used to anser is crucial hen
creating a BI system. Aaking analyses on insu++icient data or even the rong data is a major
error. #nd i+ decision makers take the results as true and decisions are made based on them
that could lead to a disaster. -his means that creating a BI system also re1uires insight in hat
data is available to be used in the system and hat ty*e o+ analyses can be done ith the data.
-he !/ is a database and has to be created and maintained in a !BAS. -his re1uires
knoledge about databases7 SE" and at least one reliable !ABS. -he 6-" also has some
section ere knoledge o+ SE" is vital. #*art +rom knoing the ins and outs o+ designing a
BI system as a hole designing the !/ is the second most im*ortant *art7 hy ill be
discussed in the *er+ormance 1uestion.
/hen it comes to >"#P cubes S/ and P#! does most o+ the ork +or the user but
knoledge about ho cubes are modeled is re1uired to be able to create them in S/.
2ere is a list o+ hat is re1uired to setu* and manage the di++erent *arts in the system.
<(
S&stem as a :!ole:
• Knoledge and insight in BI theory. -he better knoledge the better the design o+ the
system ill be.
• !atabase knoledge
Data :are!ouse:
• !atabase knoledge
• Being able to handle at least one !BAS to create the !/ in.
E/:
• Small SE" knoledge
Sc!ema >orkbenc! and )enta!o Aggregation Designer:
• Knoledge about cube structures and aggregation tables
Reporting tools:
• -o be able to de+ine data sets to be used in re*orts SE" is re1uired i+ data is gathered
directly +rom the !/.
• I+ data is gathered +rom an >"#P cube A!F knoledge is re1uired.
>!at kno:ledge and skills :ere re=uired for an end user@
-raining in ho to use the re*orting tools and the ability to *rocess manuals. -o make it even
easier meta data can be s*eci+ied to data sets and the end user can be *resented ith the meta
data hen selecting data to re*orts. :or e3am*le a manager o+ the system can s*eci+y a data
set called KShi**ed units *er yearL hich collects number o+ shi**ed units *er year +rom a
cube ith A!F. -he end user then only has to select KShi**ed units *er yearL an add it to his
re*ort to get the numbers anted.
>!at :ere t!e maAor c!allenges in designing and implementing t!e s&stem@
#lmost all o+ the major challenges lies be+ore the system is even im*lemented. # BI-system is
not a magic bo3 solving any unansered business 1uestion an organi,ation has. -here must
+irst be a clear and sound de+inition o+ hat the *ur*ose o+ the system is. -he out*ut o+ a BI-
system is hat comes out at the end o+ the *rocess. business analyses and business re*orts.
/hat do the organi,ation ant to analy,e and hat re*orts are they interested in having0 # lot
o+ thought should be *ut into this. Second the data re1uirements +or making the analyses and
re*orts must be identi+ied.
#nd hen the re1uired data is knon there must be a *ro*er analysis o+ data available. -he
organi,ation must be sure that the data ill be available to the system. -his includes
<<
su**lying the correct data as ell as the re1uired 1uantity o+ data. #ddressing these issues is
the +irst major challenge.
-he ne3t challenge lies in designing the system. # BI-system consists o+ several *arts and it is
im*ortant that each *art is designed *ro*erly. Starting ith the 6-" there are three to*ics to
cover. !ata cleansing7 D!D and staging areas.
"ets begin ith data cleansing7 hich is *robably one o+ the *arts in a BI system that
organi,ations *uts the most ork hours intoI&<J. !ata cleansing is e3tremely hard to do and
the ideal scenario ould be that the source systems contained correct and *ro*erly +ormatted
data to begin ith. -his is as discussed in (.$ usually not the case. !ata cleansing can be
made in P!I ith various available tools. -he big 1uestion here is hether to *ut a big e++ort
in getting data as clean as *ossible or be satis+ied ith more or less incorrect data sli**ing into
the !/. 2aving e3tensive data cleansing in the 6-" ill o+ course slo it don. #
recommendation here is to run the system ithout acce*ting any incorrect data at all. Instead
in+ormation - meta-data - about the erroneous data is logged and stored. -he logged meta-
data can then be analy,ed to give a good vie o+ hich source systems contains incorrect data
and here it is located. -hen the organi,ation have the choice to address the issue directly in
the source systems. >+ course sim*ler data cleansing mechanics should be *resent to correct
minor *roblems such as sim*ler +ormatting errors or miss*ellings. In the end it is all a
1uestion o+ ho much time - and in the end money7 it is orth to s*end on having correct
con+ormed data in the system.
2aving a staging area or not is really a 1uestion o+ ho much *ressure the organi,ation ants
to *ut on the source systems hen e3tracting data. -he more com*le3 the 6-" *rocess is the
heavier the burden ill be on the source systems. -he amount o+ data to be e3tracted ill
determine the length o+ the e3traction7 thus determine the time +rame in hich the 6-" ill
a++ect *er+ormance. -he number o+ available D!D solutions is closely connected to the
staging area. Since a D!D solution can be *ut in the source systems7 in the staging area7 in the
!/ or s*lit amongst any combination o+ the three not having a staging area ill limit the
o*tions. Using time stam*s7se1uences or triggers to determine hen data as last changed are
the best choices7 but here to *ut the in+ormation is a bigger challenge. 2aving the
in+ormation in the source system is a good choice since the time stam* or se1uences used can
just be com*ared to the ones in the staging area or !/. But *utting this kind o+ in+ormation in
the source systems is not alays *ossible7 or orth the ork. Aany transactional systems
have auto generated-data hich means that the system or so+tare that generates the data
must be changed in order +or it to account +or chronology. In some cases this might *rove to
be very sim*le but in other cases it might mean a lot o+ ork and in large organi,ations ith
many di++erent source systems7 all ith their on auto-generating so+tare7 the amount o+
ork increases greatly. -he alternative is to use a log-based solution7 to let the !BASPs in the
source systems log every data change and then the logs are used to look u* hat data has
changed. Dhoosing D!D-solution re1uires some serious thought and can even be
im*lemented di++erently +or each source system i+ it is necessary. # staging area can be o+
great hel* but is not necessary to solve the issue o+ D!D.
<=
-he biggest advantage is the di++erence in orkload the 6-" can *ut on the source system and
the staging area. Since all o+ the data is e3tracted to the staging area the issue o+ *utting e3tra
*ressure on the source systems ill only have to do ith the e3traction *rocess. Aore
*er+ormance-demanding o*erations7 such as analy,ing the data or trans+orming it7 can be done
hen it has been trans+erred to the staging area. -he staging area ill not have the same
restrictions on the orkload since the 6-" is the only *rocess that ill access it.
Ho: :as t!e performance of t!e s&stem and !o: could it be improved@
-he *er+ormance o+ the system can be vieed in Dhapter E. Donsidering the machine the
system as created and run on7 these values could be vastly im*roved on a machine dedicated
to running a BI-system Me.g. a dedicated server machine ith >S and !BAS *ro*erly
con+igured to boost the *er+ormance o+ that *articular systemN. -he longest run time +or the
6-" as to *o*ulate the +act table and i+ e add the to run times +or the unit
trans+ormations and return trans+ormation e land at a total o+ &$ h =C min <C s +or =%
million ros. -ake into consideration that =% million ros is a very big load and this as to
initiali,e the system. :uture 6-" loadings ould only need to trans+er data about nely added
ros or ro u*dates. /ith a dedicated server machine ith tailored con+igurations the 6-"
jobs could easily be run over night hen most o*erational systems has their loest ork load.
It is very im*ortant to discuss ho the design o+ the !/ and the con+iguration o+ the !BAS
im*acts system *er+ormance.
-he *er+ormance o+ the system relies heavily on the !/ *er+ormance. -he 6-" loads data
into the !/7 cubes are created +rom the !/ and re*orts uses data either directly +rom the
!/ o+ +rom the cubes. 2o the !/ is modeled ill mostly a++ect the >"#P cubes. I+ the
!/ is designed ith a cube in mind as as did in this thesis it ill be very easy to design
cubes since the structure is already laid out. #lthough the !/ model needs much
consideration and needs to be tailored according to the goals +or the *articular BI system this
is not all. -he !BAS itsel+ must be tailored to su**ort the system as best as *ossible i+ good
*er+ormance is a goal. Ro in-de*th e3*lanation o+ ho to set u* a !BAS +or a BI system ill
be given but some major things ill be *ointed out.
• Since most analyses uses joins on large database tables a large tem*orary table si,e is
a huge im*rovement to cube *er+ormances.
• -he !/ ill have many inserts each time the 6-" is run. 2aving good inde3es ill
hel* in cube 1ueries but slo don inserts. Priority considerations must be taken here.
-he cube *er+ormance ithout the use o+ aggregate tables as on the other hand a slight
disa**ointment. >ne can argue that the test 1ueries run against the cube maybe isnPt the most
common 1ueries that ill be used in re*orts but that is no e3cuse +or the most com*le3 1uery
ith =% million ros in the !/ taking & h $= min +or the AySE" !/. PostgreSE" +ared
much better ith the same 1uery taking roughly &C min. Some e++ort ere *ut into trying to
con+igure both !BASs to su**ort the cube analyses better and as seen in the results this as
most success+ul ith PostgreSE". I+ PostgresSE" is better suited +or !/s and >"#P cubes
is a matter +or a more thorough analysis. -oo little data is available to dra any major
<@
conclusions about this in the thesis. It as very un+ortunate that the !/ and cube Mor the 6-"
+or that matterN could not be tried ith "ucid!B and Aonet!B. Since the goal both these
!BASs is to be used +or data arehousing and business intelligence the *er+ormance ould
have *robably been much better ith them than ith AySE" and PostgreSE". #lthough the
analysis run times ere disa**ointing this as as said ithout the aggregated tables. /ith the
aggregated tables the run times ere belo one second on almost all o+ the test 1ueries. -he
most com*le3 one took &-$ minutes de*ending on !BAS. So7 ith a system ith *ro*er
aggregated tables the analysis s*eed is astonishingly +ast. #gain it is im*ortant to note that the
con+iguration and o*timi,ation o+ the !BAS has a huge im*act on the *er+ormance o+ t he
system.
1inal thou!hts
-he Pentaho BI suite is a very strong alternative to creating the system ith your on code.
/ith a dedicated server machine to run the system on and a ell o*timi,ed7 *ro*erly
con+igured !BAS7 the result is a very *oer+ul system that can be a great assistance to
decision makers in an organi,ation.
<B
-e'erences
I&J Borking7 K.7 !anielsson7 A.7 6kenberg7 ".7 "arsson7 #.7 Ide+eldt7 J. +ortom +usiness
,nte%%igence. Second edition. 6landers Sverige #B7 Seden7 $%&%.
I$J Poer7 !.J. & +rief <istory of Decision 3upport 3ystems. !SS?esources.D>A7 /orld
/ide /eb7 htt*.QQ!SS?esources.D>AQhistoryQdsshistory.html7 version $.'7 Aay (&7 $%%(.
I(J ?ivest S.7 et al. 3$#&' techno%ogy) (erging business inte%%igence *ith geospatia%
techno%ogy for interactive spatio-tempora% e9p%oration and ana%ysis of data. ISP?S Journal o+
Photogrammetry [ ?emote Sensing7 issue @%7 *. &B8((7 $%%=.
I<J Bouman7 ?.7 van !ongen7 J. 'entaho 3o%utions - +usiness ,nte%%igence and Data
Warehousing *ith 'entaho and (y3=#. /iley Publishing7 Inc.7 Indiana*olis7 US#7 $%%C.
I=J #,evedo7 #.7 Santos7 A.:.7 +usiness inte%%igence) 3tate of the art, trends, and open issues;
&st International Don+erence on Knoledge Aanagement and In+ormation Sharing7 *. $C@-
(%%7 :uncahl7 Aadeira7 $%%C.
I@J )elicanu7 A.7 Aatei7 H. & Fe* ,mp%ementation 3o%utions for +usiness ,nte%%igence.
In+ormatica 6conomic\7 issue (7 *. &('-&<@7 $%%'.
IBJ !ayal7 U.7 Dastellanos7 A.7 Simitsis7 #.7 [ /ilkinson7 K. Data integration f%o*s for
+usiness ,nte%%igence. &$th International Don+erence on 63tending !atabase -echnology.
#dvances in !atabase -echnology M*. &-&&N. St. Petersburg7 ?ussia7 $%%C.
I'J Inmon7 /. 2.7 +ui%ding the Data Warehouse7 :irst 6dition. E6!7 /ellesley7 A#7 &CC%.
ICJ Kimball7 ?.7 et al. "he Data Warehouse #ifecyc%e "oo%/it; /iley Publishing7 Inc7 Re
Gork7 US#7 &CC'.
I&%J ?ivest7 S.7 et al. 3$#&' techno%ogy) (erging business inte%%igence *ith geospatia%
techno%ogy for interactive spatio-tempora% e9p%oration and ana%ysis of data; ISP?S Journal o+
Photogrammetry [ ?emote Sensing7 issue @%7 *. &B-((7 $%%=.
I&&J "evene7 A.7 "oi,ou7 H. Why is the 3no*Ha/e 3chema a ?ood Data Warehouse DesignI;
In+ormation Systems7 )olume $'7 Issue (7 $%%(.
I&$J Utley7 D. Designing the 3tar 3chema Database. DI>Brie+ings.D>A7 /orld /ide /eb7
htt*.QQ.ciobrie+ings.comQPublicationsQ/hitePa*ersQ!esigningtheStarSchema!atabaseQtab
idQ&%&Q!e+ault.as*3 7 version &.&7 July &B7 $%%'.
I&(J IBA Dor*oration7 ,+( ,nformi9 Database Design and ,mp%ementation ?uide7
IBA.D>A7 /orld /ide /eb7
htt*.QQ*ublib.boulder.ibm.comQin+ocenterQidshel*Qv&%Qinde3.js*0
to*icZQcom.ibm.ddi.docQddi$$'.htm7 Rovember $7 $%%=.
<'
I&<J Dasters7 A.7 Bouman7 ?.7 van !ongen7 J. 'entaho 2ett%e 3o%utions J +ui%ding $pen
3ource !"# 3o%utions *ith 'entaho Data ,ntegration; /iley Publishing7 Inc.7 Indiana*olis7
US#7 $%&%.
I&=J Dodd7 6.:.7 Dodd7 S.B.7 Salley7 D.-. 'roviding $#&' to 6ser-&na%ysts J &n ," (andate.
6. :. Dodd [ #ssociates7 &CC(.
I&@J Aicroso+t Dor*oration7 3=# 3erver &na%ysis 3ervices - (u%tidimensiona% Data7
AID?>S>:-.D>A7 /orld /ide /eb7 htt*.QQmsdn.microso+t.comQen-
usQlibraryQms&B<C&=.as*37 January &7 $%&&
I&BJ -hornthaite7 /. 2imba%% Design "ip K43) Dea%ing With Lu%%s ,n "he Dimensiona%
(ode%7 Kimball Hrou*7 /orld /ide /eb7
htt*.QQ.kimballgrou*.comQhtmlQdesignti*sP!:Q!esign-i*s$%%(QKimball!-<(!ealing/it
h.*d+0-rkI!Z!-&$'T!-<(7 :ebruary @7 $%%(.
+igure references:
IP&J !ayal7 U.7 Dastellanos7 A.7 Simitsis7 #.7 [ /ilkinson7 K. Data integration f%o*s for
+usiness ,nte%%igence. &$th International Don+erence on 63tending !atabase -echnology.
#dvances in !atabase -echnology M*. &-&&N. St. Petersburg7 ?ussia7 $%%C.
<C
A##endix A
-his a**endi3 lists the telve test 1ueries that ere run to test the *er+ormance o+ the created
>"#P cube.
est case 4:
Euestion. Euality measure *er *roduct grou* *er +actoryQtotal and all years *er 1uarter
:ilters. Gear7 *roduct grou*7 +actory name
Aeasure. Percent
#ggregations. Production !ate.Gear.Euarter7 Product.Hrou*7 :actory.Rame
est case ;:
Euestion. Euality measure *er *roduct grou* *er +actoryQs*eci+ic time +rame
:ilters. S*eci+ic year7 month7 *roduct grou*7 +actory name
Aeasure. Percent
#ggregations. Production !ate.Gear.Aonth7 Product.Hrou*7 :actory.Rame
est case B:
Euestion. Shi**ed unitsQcountry7 total and all years *er 1uarter
:ilters. Gear and 1uarter7 customer country
Aeasure. Units
#ggregations. Dustomer.Dountry7 Production !ate.Gear.Euarter
est case <:
Euestion. Shi**ed unitsQcity7 total and all years *er 1uarter
:ilters. Gear and 1uarter7 customer city
Aeasure. Units
#ggregations. Dustomer.Dity7 Production !ate.Gear.Euarter
est case 9:
Euestion. 6rror *ercent *er *roduct name and itPs versionsQtotal and years and months
:ilters. Gear and month7 *roduct name and version
Aeasure. Percent
#ggregations. Production !ate.Gear.Aonth7 Product.?-state
est case C:
Euestion. 6rror *ercent *er *roduct name and itPs versionsQs*eci+ic time +rame
:ilters. S*eci+ic year7 1uarter and month7 *roduct name and version
=%
Aeasure. Percent
#ggregations. Production !ate.Gear.Euarter.Aonth7 Product.?-state -- S#A6 #S =.
est case D:
Euestion. 6rror ty*e *ercentages on *roduct grou*Qtotal7 all years *er 1uarter
:ilters. Gear and 1uarter7 error7 s*eci+ic *roduct grou*
Aeasure. Percent
#ggregation. 6rror.6rror-y*e7 Production !ate.Gear.Euarter7 Product.Hrou*
est case E:
Euestion. 6rror ty*e *ercentages on *roduct grou*Qs*eci+ic time +rame
:ilters. S*eci+ic year7 1uarter and month7 error7 s*eci+ic *roduct grou*
Aeasure. Percent
#ggregation. 6rror.6rror-y*e7 Production !ate.Gear.Euarter.Aonth7 Product.Hrou* MSame as
-est case BN
est case F:
Euestion. 6rror ty*e *ercentages +or s*eci+ic *roduct nameQtotal and year and 1uarters
:ilters. Gear and 1uarter7 error7 s*eci+ic *roduct name
Aeasure. Percent
#ggregation. 6rror.6rror-y*e7 Production !ate.Gear.Euarter7 Product.Rame
est case 45:
Euestion. 6rror ty*e *ercentages +or all *roduct names and r-states *er +actoryQtotal and years
*er 1uarter
:ilters. Gear and 1uarter7 error7 *roduct name and version7 +actory name
Aeasure. Percent
#ggregation. 6rror.6rror-y*e7 Production !ate.Gear.Euarter7 Product.?-state7 :actory.Rame
est case 44:
Euestion. 6rror *ercent +or all *roduct names *er customerQGears and 1uarters and total
:ilters. Gear and 1uarter7 *roduct name7 customer name
Aeasure. Percent
#ggregation. Production !ate.Gear.Euarter7 Product.Rame7 Dustomer.Rame
est case 4;:
Euestion. 6rror *ercent +or all customersQGears and 1uarters and total
:ilters. Gear and 1uarter7 customer name
Aeasure. Percent
=&
#grregation. Production !ate.Gear.Euarter7 Dustomer.Rame
=$
A##endix B
-his a**endi3 shos the run times +or the test cases in #**endi3 # run against the >"#P
cube. All cell values are in seconds.
,&S8/ G No aggregation tables used:
Number of units 3million7
9 45 ;5 B5 <5 95
Run time
for 4
&C< <@= C'= &=B$ $%C( ($@C
Run time
for ;
&&( $'B @<B &%%C &<@$ &B&B
Run time
for B
=@ &%= $&@ ($$ <=$ &<=<
Run time
for <
=B &&% $$C (<& <B= &<CC
Run time
for 9
@= &&C $=$ (=B <C< &@$$
Run time
for C
(% =B &&C &B$ $(B (&$
Run time
for D
&$$ $(@ =B$ B'' &&=C '&&
Run time
for E
BB &<= ('' <($ @%% =$=
Run time
for F
CB &C< <'C ==@ '%C @B=
Run time
for 45
<'@ C=% &'(& $B&C <<$% =%C@
Run time
for 44
$$$ <$( <$$ @$$ BB( $=@&
Run time
for 4;
=( &%= $%< (($ <(= &%<(
=(
)ostgreS8/ G No aggregation tables used:
Number of units 3million7
9 45 ;5 B5 <5 95
Run time
for 4
(@ &%@ $%B B== &%&' &<==
Run time
for ;
&& <( =B (&B <(( <C'
Run time
for B
&( (= =< &<= $%@ $(B
Run time
for <
&< (' =C &=& $&= $<@
Run time
for 9
$& <= B< &B& $($ $@'
Run time
for C
$ < ' $& (= <&
Run time
for D
@ &' (% &=( (%< $($
Run time
for E
% % % & & $
Run time
for F
% % % % % %
Run time
for 45
&=( $C$ =<C B=$ CB& &&$C
Run time
for 44
&=& &CB $'' (<B <%' <B$
Run time
for 4;
&% $$ <B B@ CB &&=
=<
,&S8/ G Aggregation tables used:
Number of units 3million7
9 45 ;5 B5 <5 95
Run time
for 4
% % % % % %
Run time
for ;
% % % % % %
Run time
for B
% % % % % %
Run time
for <
% % % % % %
Run time
for 9
= B B ' C C
Run time
for C
% % % % % %
Run time
for D
% % % % % %
Run time
for E
% % % % % %
Run time
for F
% % % % % %
Run time
for 45
B( &%& &$% &$' &$' &$'
Run time
for 44
&$$ &$% &&C &$& &$% &$&
Run time
for 4;
% % % % % %
==
)ostgreS8/ G Aggregation tables used:
Number of units 3million7
9 45 ;5 B5 <5 95
Run time
for 4
% % % % % %
Run time
for ;
% % % % % %
Run time
for B
% % % % % %
Run time
for <
% % % % % %
Run time
for 9
& $ $ < = =
Run time
for C
% % % % % %
Run time
for D
% % % % % %
Run time
for E
% % % % % %
Run time
for F
% % % % % %
Run time
for 45
$( =% == == =@ =@
Run time
for 44
=$ =$ =( =$ =& =$
Run time
for 4;
% % % % % %
=@
A##endix C
-his table shos the aggregation tables made ith #ggregation !esginer. 6very ros is on
aggregation table and the columns indicates at hat level the dimensions ere aggregated in
that table. -he synta3 is 2ierarchyM"evelN. I+ there is only one hierarchy *resent in the
dimension - hich is the case o+ all but the date dimension - the hierarchy name ill be
omitted. -he M#llN level means that no aggregation +or that dimension has been done.
)roduction date )roduct (ustomer Error +actor&
4 AonthMEuarterN MHrou*N MHrou*N M#llN M#llN
; AonthMAonthN MHrou*N MHrou*N M#llN M#llN
B AonthMEuarterN M#llN M#llN MDountryN M#llN
< AonthMEuarterN M#llN M#llN MDityN M#llN
9 AonthMAonthN M)ersionN M)ersionN M#llN M#llN
C AonthMEuarterN MHrou*N MHrou*N M#llN M-y*eN
D AonthMAonthN MHrou*N M#llN M-y*eN M#llN
E AonthMEuarterN MRameN M#llN M-y*eN M#llN
F AonthMEuarterN M)ersionN M#llN M#llN MRameN
45 AonthMAonthN M#llN MRameN M#llN M#llN
44 AonthMEuarterN M#llN MRameN M#llN M#llN
=B
TRITA-CSC-E 2011:106
ISRN-KTH/CSC/E--11/106-SE
ISSN-1653-5715






































www.kth.se

doc_704236494.pdf
 

Attachments

Back
Top