A Data Analysis Model For Business Intelligence

Description
Business intelligence (BI) has become an integral part of e-business activities of companies.

Int. J. Internet and Enterprise Management, Vol. 1, No. 1, 2003 7
Copyright © 2003 Inderscience Enterprises Ltd.

A data analysis model for business intelligence
Arie Segev*
Haas School of Business, University of California, Berkeley,
California, USA
E-mail: [email protected]
*Corresponding author
Shin-Chung Shao
Jason Technology Ltd., Hsi-Chih, Taipei, Taiwan, ROC
E-mail: [email protected]
J. Leon Zhao
Department of MIS, University of Arizona, Tucson,
Arizona, USA
E-mail: [email protected]
Abstract: Business intelligence (BI) has become an integral part of e-business
activities of companies. The success of such effort is dependent on integration
and management of terabytes of transactional data in a large-scale data
warehouse to support various analyses. In the last two years, web-oriented BI
and outsourcing have led to more challenging problems of integrating these
new types of data with the internal transactional data. Furthermore, recent
trends in enabling the real-time company have placed new performance
demands on BI software. Significant research has been carried out in
optimising multidimensional aggregation in data warehousing that serves as the
base data for BI analysis software. However, virtually all of this work has
focused on either simple aggregates or on specialised analysis such as in human
genome and other scientific applications. In this paper, we focus on statistical
analysis models and techniques to reduce the volume of base data and enable
real-time BI analysis. We introduce a new data aggregation model, referred to
as the multivariate and multidimensional aggregated data model (M
2
AD), for
supporting statistical computing in a data warehouse environment.
Keywords: Business intelligence; data aggregation; data mining; data
warehouse; decision making; statistical analysis; summary data.
Reference to this paper should be made as follows: Segev, A., Shao, S-C. and
Shao, J.L. (2003) ‘A data analysis model for business intelligence’, Int. J.
Internet and Enterprise Management, Vol. 1, No. 1, pp.7–30.
Biographical notes: Dr. Arie Segev is a Professor of Business and Director of
the Fisher Center for Information Technology and Marketplace Transformation
(CITM) at the Haas School of Business, University of California, Berkeley.
Recently, Dr. Segev’s research has focused on enabling e-business, including e-
business process transformation, B2B procurement, information management
and quality, support for procurement of services, customer interaction, e-
negotiations and contracting and e-collaboration and e-coordination. Professor
Segev has published over 100 papers on these research topics in leading

8 A. Segev, S-C. Shao and J.L. Zhao

journals and conferences, consulted in government and industry, and is the
recipient of major research grants. He holds Bsc and Msc degrees in industrial
engineering and management from the Technion – Israel Institute of
Technology, and an MS in operations research and a PhD in computers and
information systems from the University of Rochester.
Dr. Shin-Chung Shao is a principal at Jason Technology Ltd where his primary
focus has been on software development in the area of workflow systems and
business intelligence. Dr. Shao also taught at the Department of Information
Management at NanHwa Management College, Fo-Kuan University, Taiwan.
From 1995 to 1998 he was an Associate Professor and chairman of the
Computer and Information Science Department of Chinese Military Academy
in Taiwan. Dr. Shao received his PhD in business administration from Haas
School of Business, University of California, Berkeley in 1995.
Dr. J. Leon Zhao is an Associate Professor in the Department of Management
Information Systems, University of Arizona and taught in the Hong Kong
University of Science and Technology and the College of William and Mary,
respectively. He holds a PhD degree in business administration from Haas
School of Business, University of California, Berkeley, an MS degree in
agricultural engineering from the College of Engineering, University of
California, Davis, and a BS in engineering from Beijing Institute of
Agricultural Mechanization. His current research focuses on the development
of database and workflow technologies and their applications in electronic
commerce, knowledge management, and organisational process automation. He
has published over 20 research articles in major journals and over 30 refereed
conference papers. He is a co-founding director of the AIS Special Interest
Group on Process Automation and Management (www.sigpam.com).

1 Introduction
Business intelligence (BI) has become an integral part of e-business activities of
companies. The success of such effort is dependent on integration and management of
terabytes of transactional data in a large-scale data warehouse to support various
analyses. In the last couple of years, web-oriented BI and outsourcing has led to more
challenging problems of integrating these new types of data with the internal
transactional data. Furthermore, recent trends in enabling the real-time company have
placed new performance demands on BI software.
Significant research has been carried out in optimising multidimensional aggregation
in data warehousing that serves as the base data for BI analysis software [1-8]. However,
virtually all of this work has either focused on simple aggregates or on specialised
analysis such as in human genome and other scientific applications. In this paper, we
focus on statistical analysis models and techniques to reduce the volume of base data. We
introduce a new data aggregation model, referred to as the multivariate and
multidimensional aggregated data model (M
2
AD), for supporting statistical computing in
a data warehouse environment.
A data warehouse repository usually contains aggregated data, called materialised
views, which are computed and integrated from multiple data sources. A data warehouse
system usually has two main components: the integration component, responsible for
collecting and maintaining the materialised views, and the query and analysis component,

A data analysis model for business intelligence 9

responsible for fulfilling the information and analysis needs of specific end users [9]. Our
focus in this paper is on the query and analysis component. In particular, we investigate
techniques for supporting statistical computing in more advanced data analyses used in
business planning and decision making.
To facilitate data analyses and visualisation, the data warehouse typically supports
a multidimensional data model, also referred to as summary table or data cube
[1-3,10-12]. Conceptually, a summary table can be viewed as a materialised aggregate
view defined by an SQL query. For example, in the data warehouse of a retail chain, a
summary table may be computed from the Sales transaction table to contain the Total
Quantity Sold and the Average Price grouped by Store, Month, and Item ID. The
following relational schemas show the Sales transaction table and the summary table:
Sales(Store, Date, Transaction ID, Item ID, Quantity, Price)
Sales_Summary(Store, Month, Item ID, Quantity, Average Price)
The Sales_Summary table can be defined by SELECT Store, Month, Item ID,
SUM(Quantity) AS Quantity, AVERAGE(Price) as Average Price FROM Sales GROUP
BY Store, Month, Item ID. In this aggregate view, SUM() and AVERAGE() are
aggregate functions, Quantity and Price are measure attributes, and Store, Month, and
Item ID are dimension attributes. Because the number of dimension attributes is more
than two, the summary table is considered multidimensional.
Whilst summary tables are useful for supporting various data analysis and business
reports, they do not support more advanced data analyses such as statistical regression. In
this paper, we show that a summary table must satisfy certain criteria to support statistical
computing. We then introduce a multivariate and multidimensional aggregated data
(M
2
AD) model, whose data is derived from transactional data that may be stored in
multiple data sources. M
2
AD has three nice properties. First, M
2
AD is compact. The sizes
of M
2
AD instances can be much smaller than that of the source data. Second, M
2
AD is
distributive, thus its contents can be updated using incremental update mechanisms
[13-15]. Third, M
2
AD is sufficient, i.e., its instances provide sufficient inputs to the
statistical computing processes of many statistical methods.
The main contributions of this paper are:
• The development of a theoretical model of the M
2
AD approach by presenting the
data structure and its main features, and demonstrating how it can support advanced
statistical applications such as multivariate regression and real-time data mining BI
applications.
• Investigating the impact of applying the M
2
AD approach to statistical analysis in a
data warehouse environment by analysing the potential savings in storage and
transmission delays. As will be discussed in detail in the conclusion of the paper
these two performance measures that were de-emphasised in high performance
computing are back in the spotlight due to the increase in mobile e-business, where
the use of hand-held devices and wireless communications is often associated with
low speed and limited storage. The model introduced in this section can enable and
reduce the cost of statistical BI applications in those environments, in particular due
to the emerging trend back to increased client-side functionality and load
assignment. This latter trend is due to advances in dynamic HTML and
web services that are often deployed in environments with highly loaded web
servers.

10 A. Segev, S-C. Shao and J.L. Zhao

• Presenting high-level algorithms for the derivation and maintenance of M
2
AD tables
and for the queries against the M
2
AD table using SQL commands and stored
procedures.
The remaining sections are structured as follows: Section 2 motivates the problem of
supporting statistical analysis in data warehouses and presents the concepts of the M
2
AD
approach. In Section 3, we define the M
2
AD data model and the algorithms for
maintaining and querying the M
2
AD tables. In Section 4, we show how M
2
AD supports
statistical computing, such as univariate, multivariate, categorical data analyses and time
series analysis. Section 5 provides examples of using our model to bridge very large
databases and advanced data analyses. In Section 6, we investigate the advantages of
applying the M
2
AD approach in a data warehouse environment. Finally, Section 7
summarises the paper and outlines our future research directions.
2 Problem statement
In this section, we motivate the data aggregation problem for supporting statistical
computing in a data warehouse. We use an example to illustrate the basic ideas and
develop the concept of multidimensional and multivariate aggregated data for storing
summary data in data warehouses. More formal analysis will be given in later sections.
Suppose a retail store maintains a database containing sales transaction data. The
store manager wants to estimate the demand function that relates quantities sold to unit
prices, in order to formulate pricing policy or promotion plans. For simplicity, we
consider the demand function of a single item (assuming ItemID = “0001”). Table 1
tabulates individual sales records of that item. Due to different discount rates and
coupons used, the actual sale prices may vary.
Table 1 Source sales data
Date Quantity Price Date Quantity Price
1 8 3.25 2 8 3.00
1 9 3.10 2 8 3.20
1 9 3.00 2 4 3.50
1 5 3.25 3 3 3.20
1 4 3.50 3 4 3.40
A simple linear regression is used to estimate the demand function Q = a + bP, where Q
is the quantity sold per transaction, P is the price per unit. The unbiased least square
estimators for the intercept a and the slope b of the regression line is given by [16]:

( )
( )
2
2
? ?
? ? ?
?
?
=
i i
i i i i
p p n
q p q p n
b
(2.1)

n
p
b
n
q
b a
i i ? ?
? = ? = P Q
(2.2)

A data analysis model for business intelligence 11

Note that P and Q are (n × 1) column vectors with elements denoted by
i
p and
i
q ,
respectively. Given Table 1 and treating Quantity as Q, Price as P, the resulting
regression equation is ƒ
1
: Q = 37.95 ? 9.80 * P.
Consider that a retail chain consists of 50 retail stores and each store may incur
thousands of transactions. The resulting volume of transactional data for the retail chain
can be in the order of gigabytes per day. Furthermore, in order to perform a statistical
analysis such as linear regression, the retail chain must gather the transactional data from
all retail stores, transmit it to a central location, store the data in a data warehouse, and
then carry out the statistical computation. This is not a simple task because the distributed
nature of transactional data and the sheer volume of the data require an elaborate process
to gather data from potentially heterogeneous databases into a single data warehouse. In
general, the data analysts who perform statistical computing may not be located in the
same place as the data warehouse. As a result, the process of statistical computing will
involve the transmission of the transactional data to a new location before a statistical
package can be applied to the data.
One common way to reduce the size of data for ease of transmission and computation
is to convert the transactional table into a summary table such as shown in Table 2, which
contains total quantity sold in each day, average quantities of each transaction and
average price in each day.
Table 2 Summary table of daily sales
Date Count Sum(Quantity) Sum(Price)
1 5 35 16.1
2 3 20 9.7
3 2 7 6.6
Using the summary data in Table 2 [17] another demand function can be derived through
linear regression (note that although Table 2 does not contain enough records for deriving
an accurate linear function, it suffices for an illustration of basic concepts). By treating
Sum(Quantity) as Q, Sum(Price)/Count as P, we get ƒ
2
: Q = 996 ? 300 * P, which
estimates the daily sales as a function of average price per unit. Whilst the summary table
in Table 2 is much smaller than the transactional table in Table 1, and the new demand
function ƒ
2
is useful for certain analysis, the summary table can no longer be used to
derive the demand function ƒ
1
. Therefore, this summary table is incomplete for
supporting statistical computing in both transactional and aggregated levels. Our goal is
to develop a data aggregation model that preserves the statistics analysis of transactional
data.
Notice that the computing formulas of (2.1) and (2.2) include several types of
aggregation functions, i.e., count, sums, sums of squares (SS) and sums of
cross-products (SCP). These functions produce special purpose summary data for
statistical computing. This hints that if we pre-aggregate this data, store it in a summary
table (as shown in Table 3), and apply formulas (2.1) and (2.2) to the last row (Total) in
Table 3, then we will be able to derive function ƒ
1
. That is, using this alternative
summary table, we can preserve the linear regression analysis at the transactional data.
Furthermore, demand functions can also be derived on different time periods by selecting

12 A. Segev, S-C. Shao and J.L. Zhao

data in different rows of Table 3. In addition, the demand function ƒ
2
can also be derived
from Table 3 since it contains Table 2.
Table 3 is referred to as a summary table containing multidimensional and
multivariate aggregated data (M
2
AD), or simply M
2
AD table. A M
2
AD table includes a
count column, p summation columns, and p(p+1)/2 number of cross-product columns,
where p is the number of measure attributes. In Table 3, the measure attributes are P and
Q. Therefore, p = 2 and p(p+1)/2 = 3. Note that the cross-product matrix of measure
attributes can be denoted by X’X, where X denotes the data matrix of measure attributes,
X’ is the transpose of X. The X’X matrix contains sums of squares as its diagonal
elements and sum of cross-products as its off-diagonal elements. The size of X’X is given
by (p × p). Since X’X is symmetric, we need to store only its upper or lower triangular
matrix in the summary table.
Table 3 An alternative summary table
Date Count Sum(Q) Sum(P) SS(Q) SS(P) SCP(Q, P)
1 5 35 16.1 267 51.99 111.15
2 3 20 9.7 144 31.49 63.60
3 2 7 6.6 25 21.80 23.20
Total 10 62 32.4 436 105.28 197.95
The M
2
AD summary table approach has important implications for statistical computing
in a data warehouse environment:
• First, the M
2
AD table can be used to support statistical analysis both at the
transactional level and aggregated levels. Therefore, we say that the M
2
AD table is
statistically complete whist conventional summary tables are not. Another difference
is that we introduce more columns representing X’X to M
2
AD summary table. Thus,
the cost of maintaining a M
2
AD summary table is higher than that of maintaining the
conventional summary table.
• Second, the main advantage of the M
2
AD table is that it can be many times smaller
than the transactional data. In addition, the M
2
AD table can support many types of
statistical analysis that the transactional table can. A formal analysis of this
advantage will be given in a later section.
• Third, the cross-product matrix X’X, like count and sum, are distributive as will be
shown in the next section. Thus, when data is inserted into or deleted from the
transactional databases, the M
2
AD table can be updated incrementally.
• Fourth, the M
2
AD table supports more statistical methods than a conventional
summary table. In fact, the M
2
AD table supports all statistical methods whose
computing formulas can be rewritten as functions of counts, sums, and X’X. In
Section 4, we will show how M
2
AD summary table supports many commonly used
univariate, multivariate, and categorical statistical methods.

A data analysis model for business intelligence 13

3 Multidimensional and multivariate aggregated data (M
2
AD)
This section introduces the definition and maintenance algorithms of the M
2
AD table as
well as its relational implementation. M
2
AD tables can be implemented in a relational
database system or a proprietary cube-based system [4,11,18]. The relational
representation of M
2
AD is conceptual and is used to demonstrate the modelling concept
rather than the physical structure or access methods. Other implementation schemes are
feasible and possibly more efficient.
3.1 Definition and creation of M
2
AD summary table
An M
2
AD summary table is defined by a quadruple: . T denotes a table
containing source data, D is a set of m dimension attributes, X is a set of p numeric-
valued measure attributes, and A is a composite data object containing aggregated data.
Let X denote the transactional data matrix contained in T, i.e., X = ?
X
(t), where ? denotes
the generalised (duplicates allowed) project operator, and t is an instance of T. D
partitions X in such a way that G
D
(?
X
(t)), where G denotes the group-by operator with D
as operand. Thus, X is partitioned into (d
1
× d
2
… × d
m
) submatrices, where d
k
, k = 1, …,
m, is the number of distinct values in D
k
? D. For each submatrix X
l
, a set of aggregated
data, denoted by A
l
, is derived by:

{ } ..., , 1 , X in column a is X ' X ), SUM( COUNT(*), A p i x x
l i l l i l
= =
The collection of all A
l
makes up A. That is, A contains (d
1
× d
2
… × d
m
) component
objects, each containing a count, p summations, and a cross-product matrix derived from
a partitioned submatrix.
Suppose T is a table defined in a relational database, then the creation and
materialisation of M
2
AD can be easily implemented using SQL commands having the
following syntax:

SELECT D, Count(*), Sum(X
1
), …, Sum(X
1
* X
1
), …, Sum(X
1
* X
2
), …
INTO

FROM T
GROUP BY D
For convenience, we set a naming convention for table and attribute names. We use
M
2
AD_T as the name of the M
2
AD summary table created from T. For dimension
attributes, we use the same names as they are defined in T. For aggregated attributes, we
use _Count, Sum_X
i
, SS_X
i
(sum of squares of X
i
) and SCP_X
i
_X
j
(Sum of cross-
products of X
i
and X
j
) to denote Count(*), Sum(X
i
), Sum(X
i
* X
i
) and Sum(X
i
* X
j
),
respectively. We assume that X
i
and X
j
are arranged alphabetically when they appear in
SCP_X
i
_X
j
.
Example 1
Suppose the database of a retail chain maintains a sales transaction table defined by Sales
= . Let D = {Store, Date, ItemID} and X =
{Quantity, Price}, then M
2
AD_Sales can be created by the following SQL command:

14 A. Segev, S-C. Shao and J.L. Zhao

SELECT Store, Date, ItemID, Count(*) as _Count, Sum(Quantity) as
Sum_Quantity, Sum(Price) as Sum_Price, Sum(Quantity*Quantity)
as SS_Quantity, Sum(Price*Price) as SS_Price, Sum(Price*Quantity)
as SCP_Price_Quantity
INTO M
2
AD_Sales
FROM Sales
GROUP BY Store, Date, ItemID
Assuming that the chain store has 50 branch stores, stocks 1000 items, then (50 * 1000)
records will be appended to M
2
AD_Sales each day, regardless of how many transactions
are made daily. It is obvious that conventional summary tables such as periodic (daily,
weekly, monthly, etc) sales reports, categorised by stores, items or dates, can be derived
from M
2
AD_Sales.
3.2 Data manipulation
Before we show how we use SQL commands to manipulate M
2
AD summary tables, we
first show that M
2
AD is distributive. An aggregate function f( ) is said to be distributive if
f(X) = f(X
1
) + f(X
2
) + ... + f(X
r
), where X = [X
1
’ | X
2
’ | ... | X
r
’], i.e., X
1
, X
2
, ... X
r

vertically partition X. It has been shown in [13] that aggregate functions such as COUNT
and SUM are distributive so that aggregated data in different cells can be added up, and
incremental update of data cubes can be supported. Recall that X’X contains sums of
squares as its diagonal elements and sums of cross-products as its off-diagonal elements,
it is easy to see that:

? ? ? ?
+ + + =
2 2
2
2
1
2
ri i i i
x x x x
? ? ? ?
+ + + =
ri ri i i i i i i
y x y x y x y x
2 2 1 1

where x
i
and y
i
denote the i-th element of different columns in X,
and x
ji
and y
ji
, j = 1, 2, ..., r, denote the i
th
element of different columns in the submatrix
X
j
. From the above formulas, it follows that the cross-product matrix X’X is also
distributive, and thus we may combine X’X in different rows, and incremental update of
X’X are also supported. For example, suppose X
1
and X
2
vertically partition X. Then we
have X’X = X
1
’X
1
+ X
2
’X
2
, where + denotes matrix addition operator.
On the other hand, consider incremental update. Let X
o
and X
n
denote the old version
(before update) and the new version (after update) of X, respectively. X
n
= X
o
? X
I
– X
D
,
where ? and – denote set union (duplicate not removed) and set difference operators, and
X
I
and X
D
denote data inserted into and deleted from X
o
, respectively. Then we have,
X
n
’X
n
= X
o
’X
o
+ X
I
’X
I
– X
D
’X
D
. That is, we may incrementally update X’X using only
modifications made to X during two consecutive updates to the M
2
AD summary table.
We defer the discussion of incremental update in Section 3.3. In this subsection, we only
show how to use the distributive property to manipulate records in M
2
AD.
The M
2
AD table can also be used for further data analysis. By the distributive
property, the following SQL command syntax can be used to retrieve a set of data from
the M
2
AD table:

A data analysis model for business intelligence 15

SELECT [D?], Sum(A
i
), …
FROM M
2
AD_T
[WHERE ]
[GROUP BY D?]
[HAVING ]
where D’ ? D and A
i
denotes an aggregated attribute in M
2
AD_T. Notice that in this
command, we use Sum( ) to represent any aggregated attribute, i.e., count, sum, SS, and
SCP.
Example 2
Table 2 (daily sales report of all stores of a given item) and the last record in Table 3
illustrated in Section 2 can be derived from M
2
AD_Sales by the following commands,
respectively:

SELECT Date, Sum(_Count), Sum(Sum_Quantity), Sum(Sum_Price)
FROM M
2
AD_Sales
WHERE ItemID = “0001”
GROUP BY Date

SELECT Sum(_Count), Sum(Sum_Quantity), Sum(Sum_Price),
Sum(SS_Quantity), Sum(SS_Price), Sum(SCP_Price_Quantity)
WHERE ItemID = “0001”
FROM M
2
AD_Sales
In a data warehouse environment, some users may not know the existence of all
pre-defined M
2
AD tables. Therefore, casual users may directly issue aggregate queries to
source tables [19]. This may result in recomputing aggregated data already existing in the
data warehouse. To avoid this inefficiency problem, a query rewrite algorithm is
presented below. This algorithm converts aggregate queries to T into those to M
2
AD_T
so that recomputing of existing aggregated data is avoided.
Let T be a source table, M
2
AD_T be an M
2
AD defined over T with dimension
attribute set D and measure attribute set X. Consider an aggregate query to T, denoted by
Query(T), having the syntax:
SELECT [D’], Aggr(Y), ...
FROM T
[WHERE ]
[GROUP BY D’]
[Having
where D’ ? D, Y ? X, Aggr ? {Count, Sum, Avg, SS, SCP}. Notice that we introduce
two additional aggregate operators SS (a unary operator) and SCP (a binary operator) to
SQL to represent sum of squares and sum of cross-products, respectively. It can be
shown that there exists an aggregate query to M
2
AD_T, denoted by Query(M
2
AD_T),
such that Query(T) is equivalent to Query(M
2
AD_T) in the sense that they produce the
same results when evaluating against any instance of T. In particular, Query(M
2
AD_T)
can be constructed using the query rewrite algorithm illustrated in Figure 1.

16 A. Segev, S-C. Shao and J.L. Zhao

Figure 1 Aggregate query rewrite algorithm
Query rewrite algorithm:
Input: Query(T) such that D’? D, Y ? X, and
Aggr ?{Count,Sum,Avg,SS,SCP}
{
For each Aggr(Y) in the SELECT and in the HAVING clause {
If Aggr(Y) = COUNT(*)
replace Aggr(*) by Sum(_Count)
else if Aggr ? {Sum, SS}
replace Aggr(Y) by Sum(Aggr_Y)
else if Aggr = Avg
replace Avg(Y) by (Sum(Sum_Y) / Sum(_Count))
else replace SCP(Y1, Y2) by Sum(SCP_Y1_Y2)
}
replace T by M
2
AD_T in the FROM clause
} /* the WHERE clause and the GROUP BY clause remain unchanged */

Example 3
The first query can be converted into the second query as shown below.
(1) SELECT Store, Avg(Quantity), Avg(Price), SS(Price)
FROM Sales
WHERE ItemID = “0001”
AND Date Between “1/1/1997” and “12/31/1997”
GROUP BY Store
HAVING Avg(Quantity) >= 5
(2) SELECT Store, (Sum(Sum_Quantity) / Sum(_Count)), (Sum(Sum_Price) /
Sum(_Count)), Sum(SS_Price)
FROM M
2
AD_Sales
WHERE ItemID = “0001”
AND Date Between “1/1/1997” and “12/31/1997”
GROUP BY Store
HAVING (Sum(Sum_Quantity) / Sum(_Count)) >= 5
3.3 Maintenance of M
2
AD summary table
When the source data contained in T has been modified, M
2
AD_T must be updated to
maintain consistency between the source data and its corresponding aggregated data. In
the context of very large databases, aggregated data is often updated using incremental
update mechanisms. Incremental update refers to updating derived data using only
modifications made to its source data. This subsection shows how to incrementally
update M
2
2AD_T using SQL commands and triggers. For simplicity, we assume that the
source table T is operated with only insert operations.
Recall that in Section 3.2, we showed that X
n
’X
n
’ = X
o
’X
o
+ X
I
’X
I
’, where X
o
’X
o

and X
n
’X
n
’ denote the before-update and after-update versions of X’X, respectively, and

A data analysis model for business intelligence 17

X
I
is the matrix inserted into X during two consecutive updates of M
2
AD_T. Incremental
update of M
2
AD_T can be implemented as a two-phase process. In the first phase,
records inserted into T are maintained in a temporary table in the course of database
operations. When M
2
AD_T is to be updated, the second phase of updating M
2
AD is
invoked. Let T_inserted be the temporary table holding records inserted into T
(T_inserted includes the same attribute set as T). The first phase can be implemented as
the following trigger:
create trigger insert_T
on T
for insert
as
insert into T_inserted select * from inserted
where T_inserted denotes records inserted into T in an insert transaction.
The second phase, which is the actual update of M
2
AD_T can be implemented using the
batch of commands depicted in Figure 2.
Figure 2 Incremental update commands
(1) SELECT D, count(*) as _count, Sum(X1) as Sum_X1, ...,
Sum(X1 * X1) as SS_X1, ..., Sum(X1 * X2) as SCP_X1_X2, ...
INTO Temp1
FROM T_Inserted
GROUP BY D
(2) INSERT INTO M
2
AD_T
SELECT * FROM Temp1
(3) SELECT D, Sum(_Count) as _Count, Sum(Sum_X1) as Sum_X1, ...,
Sum(SS_X1) as SS_X1, ..., Sum(SCP_X1_X2) as SCP_X1_X2, ...
INTO Temp2
FROM M
2
AD_T
GROUP BY D
(4) DROP M
2
AD_T
RENAME Temp2 M
2
AD_T
DELETE T_Inserted

In Figure 2, the first command is used to create a temporary table Temp1 from the
inserted data T_Inserted. The second command then merges Temp1 into the M
2
AD table.
The resulting table is then reaggregated using the third command to form the updated
version of M
2
AD table. The remaining commands are used to do house cleaning tasks.
The purpose of these commands is to show how M
2
AD table can be incrementally
updated using simple SQL. More efficient implementation is possible, but is beyond the
focus of this paper.

18 A. Segev, S-C. Shao and J.L. Zhao

4 Supporting statistical computing
In this section, we will show that many commonly used statistical methods can be
accurately supported by M
2
AD summary tables in the sense that the computing formulas
of their relevant statistics can be expressed in terms of Count (n), summations, and X’X
(sum of squares and sum of cross-products).
4.1 Univariate data analysis
In describing or making inferences about univariate sample data, a user may require
computing sample average, variance, standard deviation, confidence intervals (about
population mean, standard deviation, differences of means of two samples, etc.), or test
statistics (e.g., z-test, t-test, ?
2
-test, F-test, or p-value). From basic knowledge of
statistical computing, we know that all the above statistics can be derived from univariate
aggregated data including n, summation, and sum of squares. For example, consider the
computing formulas of sample variance and the estimation of (1 – ?) % confidence
interval of population mean µ, which are given by:
Sample variance
( )
1
2
2
2
?
?
=
? ?
n
x x
S
i i

(1 – ?) 100% confidence interval of population mean µ:
n
S
t X
2 ?
±
,
n
x
X
i ?
=
,
2
S S = , and
2 ?
t can be derived from t-table [20].
Since conventional summary tables do not include sum of squares, the above statistical
methods are not supported. For example, using Table 2 in Section 2, we cannot calculate
the standard deviation of quantities purchased in different sales transactions.
4.2 Multivariate data analysis
We generalise our discussions to the case of multivariate data sets and show that many
multivariate data analyses can be supported by M
2
AD tables. Let X denote a (n × p)
matrix, Y denote a (n × 1) column vector, and S
X
denote a (p × 1) column sum vector,
i.e.,
?
=
=
n
i
ij
x
j
1
X
S , j = 1, ..., p. To perform a multiple linear regression of Y on X, i.e.,
Y = X? + ?, the unbiased least squares estimator of the regression coefficients ? is given
by ([16], see also [21]). Other statistics relevant to linear regression, e.g., adjusted R-
squared, variance of error terms, can also be derived from X’X and X’Y, without referring
to the very large X and Y. For computing formulas of these statistics,
(see [16,22]):

( ) ( ) Y X' X ' X
ˆ
1 ?
= ?
A measure of goodness-of-fit of the model, denoted by
2
R , or called R-squared, can be
calculated by the following formula [16]:

A data analysis model for business intelligence 19

?
?
?
?
?
?
?
?
?
?
?
?
? =
p n
n
y y
R
1 ˆ ˆ
1
2
? ?

where ? ?
ˆ
X - Y ˆ = is an (n × 1) column vector denoting residuals, and

( ) ( ) ? ? ? ? ? ? ? ?
ˆ
X X
ˆ
Y X
ˆ
-
ˆ
X Y - Y Y
ˆ
X - Y
ˆ
X - Y ˆ ˆ ? ? + ? ? ? ? =
?
= ?
denotes sum of squares of residuals (SSR) , and
2
Y - Y Y n y y ? = ? . In the above
formulas, if we include Y as a measure attribute in defining our M
2
AD table, then Y’Y,
X’Y are also included in that table. In particular, Y’Y is a sum of squares, and X’Y is a
vector of sums of cross-products. Therefore, M
2
AD table supports multiple regression
analysis.
Moreover, this summary table also supports the calculation of sample covariance
matrix of X, denoted by ?
X
, by the following formula [22]:

( ) 1
S ' S - X X'
X X
X
?
= ?
n n
n

Indeed, by maintaining the above aggregated data types, most multivariate statistical
methods, e.g., conditional covariance matrix, partial correlation analysis, principal
component analysis, canonical correlation analysis, and factor analysis, can be supported
[4,11]. For example, the eigenvalues and eigenvectors of X’X can be used to achieve
dimension reduction by defining new variables called principal components, and thus
support principal component analysis and factor analysis [22]. The principal components
are designed to retain most of the variation described by X’X, while reducing the number
of dimensions or variables. Therefore, a better strategy to support multivariate data
analyses using the above statistical methods is to include the count, sums and the cross
product matrix in a summary table and allow the computing processes of these analyses
to share this summary table.
4.3 Categorical data analysis and time series analysis
Due to its m-dimensional tabular structure, M
2
AD supports many categorical data
analyses in a natural way. Indeed, an M
2
AD can be viewed as a combination of
m-dimensional count table, sum tables, sum of squares tables, and sums of
cross-products tables, from which m-dimensional average table, variance table and
standard deviation tables can be derived. Users can further derive the above tables in
lower dimensions by manipulating existing M
2
AD [23].
Using an M
2
AD table and its derivatives, categorical data analyses such as
contingency table analysis, also referred to as cross-tabulation analysis, can be supported,
since their computing formulas involve only cell counts, relative frequencies and absolute
frequencies. Similarly, the test of goodness-of-fit (?
2
test), which requires the same
aggregated data as contingency table analysis, can be supported. On the other hand, since

20 A. Segev, S-C. Shao and J.L. Zhao

M
2
AD contains sum of squares and sum of cross-products in its cells, analysis of
variance (ANOVA) and multiple analysis of variance (MANOVA) are also supported
[22,23]. Finally, if one of the dimension attributes is a time attribute, then a time series
can be derived from M
2
AD table. Thus, all statistical methods supporting time series
analysis, e.g., ARIMA (auto-regressive and moving average of differenced time series
data) and exponential smoothing, are supported by M
2
AD. For example, the following
command to Sales can be rewritten into (using the query rewrite algorithm in
Section 3.2) a command to M
2
AD_Sales to retrieves a time series data, namely, daily
sales of a particular item, from the M
2
AD_Sales table:

SELECT Date, Sum(Quantity)
FROM Sales
WHERE ItemID = “0001”
GROUP BY Date
The resulting time series data can then be used as inputs to a statistical package for
further analysis.
5 Bridging data warehouse and data analysis
In this section we present two examples of using M
2
AD to bridging a database and data
analyses. We extend the syntax of SQL to include the support of data analysis. We will
show how to convert commands following the extended SQL into traditional SQL
commands and then apply appropriate formulas on the resulting aggregated data.
Our first example considers the support of multiple regression analysis. The syntax of
performing a least-square regression is given by:

REGRESS Y ON X
FROM

[WHERE ]
where Y is an attribute denoting the dependent variable, X is a set of attributes denoting
independent variables. Y and X are defined in the table referred in the FROM clause. For
example, the following command performs a simple linear regression of Quantity on
Price:

REGRESS Quantity on Price
FROM Sales
WHERE ItemID = “0001” and Date >= “1/1/1997”
The above command can be easily converted into an SQL command on Sales, which is
then converted into a query to M
2
AD_Sales to fetch aggregated data. The resulting data is
then input to formulas (2.1) and (2.2) to derive estimated regression coefficients. Note
that the REGRESS clause above is an extended SQL syntax that can be implemented in
any universal database management system (such as Informix, Oracle, IBM, Sybase, etc)
through the extensible abstract data type and function definition capability.
Our next example is the application of M
2
AD to support data mining [24]. In
particular, we consider mining regression rules and regression trees [25]. A regression

A data analysis model for business intelligence 21

rule can be represented as: Pred(D) ? Y = X?, where D denotes a set of categorical
variables, i.e., dimension attributes, Pred(D) denotes a conjunctive predicate over D, and
Y = X? is a regression equation. The regression equation is conditioned on a conjunctive
predicate defined over D. Regression rules are useful to represent relationships hidden in
mixtures of categorical data and numeric data, while other kinds of rules, e.g.,
classification rules, and association rules [24], deal mainly with categorical data. An
example of a regression rule in our Sales data is: If (Store = “Berkeley”) and (Month =
“January”) and (ItemID = “0001”) then Quantity = 45 – 8 *P.
In general, there may exist a set of such regression rules. They can be presented as a
tree graph, called the regression tree, to assist understanding, interpreting, and applying
these rules. In a regression tree, each non-leaf node represents a dimension attribute and
each edge represents a distinct value in the domain of that attribute. Each leaf node links
to a regression equation. A path from the root node to a leaf node specifies a regression
rule. Figure 3 is an example of the regression tree mined from our sales transaction
database example. In Figure 3, the left-most path specifies the rule: If (Store = “A”) and
(ItemID = “0001”) and (Month = January) Then (Quantity = 31 – 4.25 * Price). Notice
that we also include “ALL” as an edge of each node to represent all values (including
missing values) of that attribute. Therefore, the right-most path denotes the rule given by:
(unconditioned) Quantity = 10.5 – 3.2 * Price, which is equivalent to the results of
applying linear regression on all quantity and price data in the Sales table.
Figure 3 An example of regression trees
Store

ItemID
ALL
0002 0001

ALL 0002 0001

ALL FEB JAN ALL FEB JAN ALL FEB JAN
Month

A
B ALL



Q=31 - 4.25 * P Q=34 -4.5 * P Q=10.5 - 3.2 * P

The command syntax for mining regression rules from data stored in a relational database
is:
MINE REGRESSION RULE Y ~ X
FROM

[WHERE ]
WITH CONFIDENCE c
GROUP BY D
where c denotes a threshold used to filter out those insignificant rules. A regression
equation with measure of goodness-of-fit r, e.g., R-squared in Section 4.2, is said to be
significant if r ? c, otherwise, it is said to be insignificant. Notice that the above
command syntax is similar to the SQL SELECT command. Therefore, it can be easily
converted into command to transactional data, which is then converted into command to

22 A. Segev, S-C. Shao and J.L. Zhao

M2AD. As an example, the following command mines the regression rules/tree
illustrated in Figure 3:

MINE REGRESSION RULE Quantity ~ Price
FROM Sales
WITH CONFIDENCE 0.7
GROUP BY Store, month(Date) as Month, ItemID
The process of mining regression rules/trees can be divided into three stages: aggregated
data derivation, search space reduction, and estimation. M
2
AD table provides necessary
and sufficient data for the last two steps. Thus, there is no need to reaggregate data when
appropriate M
2
AD table existed. In the first stage the command is converted into an
aggregate query to M
2
AD_Census. This can be done by an obvious generalisation of the
query rewrite algorithm presented in Section 3.2. Due to space limitation, we do not show
the modified query rewrite algorithm. The above command can be converted into:

SELECT Store, month(Date) Month, ItemID, Sum(_Count),
Sum(Sum_Quantity), Sum(Sum_Price),
Sum(SS_Quantity), Sum(SS_Price), Sum(SCP_Quantity_Price)
FROM M
2
AD_Sales
GROUP BY Store, Month, ItemID

Each record in the resulting table contains a count, two sums, and an X’X, and
corresponds to a leaf node in Figure 3. Since the number of regression rules grows
exponentially with the number of dimension attributes, in the second stage a series of
homogeneity tests is conducted to reduce search space by identifying independent
categorical variable(s) [22,25]. Let X be vertically partitioned by a categorical attribute
D
i
, i.e., X = [X
1
’ | X
2
‘ | ... |X
k
’ ], where k is the number of elements in the domain of D
i
.
Let j X and ?
j
, j = 1, 2, ..., k, denote the mean vector and the correlation matrix of X
j
,
respectively, then D
i
is said to be independent of X if:

k X X X 2 1 = = = , and
k
? = = ? = ?
2 1

Under the normality assumption, if D
i
is independent of X, then we may remove all
nodes of D
i
from the regression tree, thus the regression tree is shrunken and the number
of regression rules to be estimated in the third stage may be significantly reduced. The
test of independent categorical attribute can be done by homogeneity tests, which require
count, sums, and X’X as inputs. Therefore, the resulting table of the above aggregate
query to M
2
AD_Sales can provide sufficient information to conduct homogeneity tests.
In the third stage regression equations are estimated by applying the formulae illustrated
in Section 2 to each record in the resulting table.

A data analysis model for business intelligence 23

6 The advantages of the M
2
AD approach
In this section, we illustrate the potential savings of the M
2
AD approach in terms of
storage and transmission costs.
6.1 Potential storage savings
Let m be the number of dimension attributes and d
i
be the number of values in the i
th

dimension attribute. Let p be the number of measure attributes and B be the number of
bytes per attribute. Furthermore, let I be the average number of records per repeating
group. A repeating group contains records that have the same values on their dimension
attributes. The storage sizes of transactional data and the M
2
AD table are estimated as:
• The number of attributes in the transactional table is equal to m + p, and the size of a
transactional record is B × (m + p). The number of records in the transactional table
is equal to the number of records per repeating group I times that of repeating groups
(?d
i
). Consequently, Equation (6.1) is derived as follows:
The size of the transactional data = B × (m + p) × I × (?d
i
). (6.1)
• The number of attributes in the M
2
AD table is equal to (m + p) + p × (p + 1)/2, where
the second term takes into account of the cross-products. The size of a record is B ×
[(m + p) + p × (p + 1)/2]. The number of records in the transactional table is the
number of repeating groups (?d
i
). Therefore, we have the next equation:
The size of the M
2
AD table = B × [(m + p) + p × (p + 1)/2] × (?d
i
). (6.2)
Based on the two equations above, the relative storage sizes for the two types of tables
are affected greatly by the number of measure attributes (p) and the number of records
per repeating group (I). The number of measure attributes turns to increase the size of the
M
2
AD table more than its counterpart of transactional table, while the number of records
per repeating group does the opposite. Next, we examine the effects of these two
variables on the relative storage sizes of the transactional table and the M
2
AD table.
The following default parameter values are used in the comparison: B = 2 bytes, I = 10,
d1 = 50 (stores), m = 3, d
2
= 3650 (days), d
3
= 1,000 (items), p = 3.
Figure 4 illustrates that the size of the M
2
AD table is significantly smaller than
the transactional table for the given default parameter values. When the number of
measure attributes increases, the relative savings in storage sizes reduces due to the
exponential effect of the number of measure attributes in Equation (6.2).

24 A. Segev, S-C. Shao and J.L. Zhao

Figure 4 Storage sizes as a function of the number of measure attributes
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
2 3 4 5 6 7 8
Number of measure attributes
S
t
o
r
a
g
e

s
i
z
e

(
G
i
g
a
b
y
t
e
s
)
Trans. Table
M2AD Table

Figure 5 shows that the M
2
AD table turns out to be larger than the
transactional table when the average number of records per repeating group is very small
(less than 2). However, as this value increases, the transactional table can be ten or more
times larger than the M
2
AD table.
Figure 5 Storage sizes as a function of the number of records per repeating group
1.0
10.0
100.0
1000.0
0 20 40 60 80 100
Number of records per repeating group
S
t
o
r
a
g
e

s
i
z
e

(
G
i
g
a
b
i
t
e
s
)
Trans. Table
M2AD Table

A data analysis model for business intelligence 25

The analysis results indicate that the M
2
AD table can be advantageous in many cases
except for cases that the average number of records per repeating group is very small.
Next, we investigate the impact of the M
2
AD table on the transmission cost of data
warehouse applications.
6.2 Potential transmission savings
Due to potential savings in storage size, the M
2
AD approach can also have a significant
impact on data transmission. We argue that whilst very high speed infrastructure exists,
the reality is that many users’ sessions are subject to 20K-1Mbps. The lower range
corresponds to voice dial-ups and mobile digital links for laptops and hand-held devices
bps; the upper range corresponds to cable/DSL speeds or to the effective rate for loaded
local area subnets. Consider the retail chain example given previously. Firstly, the
transactional data is transmitted periodically from all retail stores to the headquarters in a
location remote from most stores. Secondly, data analysis is not usually carried out in the
data warehouse location, and therefore, transmission is often needed before computation.
Consequently, there are two types of transmission in the scenario. Whilst both types put
pressure on network costs for the company, the first type transmission is usually carried
out periodically and incrementally, and consequently, it is not very critical in terms of the
transmission time per cycle. Therefore, we focus on the second type of transmission cost.
A comparison of transmission costs is done for both the transactional data approach
and the M
2
AD summary approach, using the same default parameter values as in Section
6.1. We also use two additional parameters:
• for each data analysis, only one tenth of a percent of the data in the table is needed
• the default transmission rate ? is 100 Kilobytes per second (equivalent to about
1Mbps including transmission overhead), which is a relatively high value for the
cases mentioned above.
The transmission times can then be obtained easily based on Equations (6.1) and
(6.2).
The transmission time for the transactional data
= 0.001 × B × (m + p) × I × (?d
i
) / ?. (6.3)
The transmission time for the M
2
AD data
= 0.001 × B × [(m + p) + p×(p + 1)/2] × (?d
i
) / ?. (6.4)
We use Figure 6 to illustrate a comparison of transmission times with respect to the
number of measure attributes. One can see that the potential savings of transmission time
are significant. At the lower end (the number of measure attributes is fewer than eight),
the transmission time can be reduced by over 50%.

26 A. Segev, S-C. Shao and J.L. Zhao

Figure 6 Transmission times as a function of the number of measure attributes
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
2 4 6 8 10 12
Number of measure attributes
T
r
a
n
s
m
i
s
s
i
o
n

t
i
m
e

(
m
i
n
)
Trans. Table
M2AD Table

Figure 7 looks very similar to Figure 5 because of the linear relationship between storage
size and transmission time. Nevertheless, the two figures contain different information.
Figure 7 indicates that when the number of records per repeating group becomes larger
than 20, the transmission time becomes too large to transmit even for only 0.1% of the
total transactional data. On the other hand, the M
2
AD approach has no such problem
since it is independent of the parameter. This leads to an interesting insight. Even just for
transmission purposes, the M
2
AD table should be computed before sending the data to
another location for data analysis. In case some portion of the transactional data is needed
frequently, it should then be materialised using the M
2
AD approach. This again indicates
the value of the data aggregation model based on the M
2
AD table.
Figure 7 Transmission delays as a function of the number of records per repeating group
0.1
1.0
10.0
100.0
0 20 40 60 80 100
Number of records per repeating group
T
r
a
n
s
m
i
s
s
i
o
n

t
i
m
e

(
m
i
n
)
Trans. Table
M2AD Table

A data analysis model for business intelligence 27

To illustrate the impact of transmission rate on the savings of using the M
2
AD approach,
we plot the transmission costs in Figure 8. The chart indicates that even for a higher
effective transmission rate of 250 Kilobytes per second (about 5Mbps nominal rate that
can correspond to the rate for an individual session in a busy 100Mbps LAN), the savings
are still significant. This indicates that, in many cases, the M
2
AD approach offers a major
advantage even when the data analysis is carried out in the same location as the data
warehouse.
Figure 8 Transmission times as a function of transmission rate
0.1
1.0
10.0
100.0
0 50 100 150 200 250
Transmission rate (Kilobytes/sec)
T
r
a
n
s
m
i
s
s
i
o
n

t
i
m
e

(
m
i
n
)
Trans. Table
M2AD Table

7 Conclusion
In this paper, we have introduced a new data aggregation model to support statistical data
analyses in a data warehouse environment that serves as the basis for business
intelligence applications. A new aggregated data structure, namely, the cross-product
matrix X’X, has been added to conventional summary tables. We have shown that by
adding this new data type, more advanced statistical methods can be supported in an
efficient manner. Moreover, similar to univariate aggregated data types such as count and
sum, the cross-product matrix X’X is distributive so that the efficient incremental update
mechanism can be used to update M
2
AD tables.
A simple yet practical relational implementation has also been described. An efficient
query rewrite algorithm has been developed to support querying of aggregated data on
M
2
AD tables in the data warehouse. Two cases of using the M
2
AD model to support
advanced statistical data analyses have been presented, i.e. multiple regression analysis
and data mining.
We have also conducted an impact analysis of applying the M
2
AD approach. Two
most important parameters are the number of measure attributes and the average number
of records per repeating group. Our quantitative results indicate that in many cases M
2
AD
tables can be used to achieve significant savings in storage and transmission costs, except

28 A. Segev, S-C. Shao and J.L. Zhao

for when the number of records per repeating group is very small (fewer than two).
Nevertheless, the analysis concludes that the M
2
AD approach provides critical
advantages for large-scale data sets.
Potential applications of our M
2
AD approach include the support of data analyses in a
multi-user and very large data warehouse environment such as stock or foreign exchange
transactions, nationwide or worldwide chains’ sales transactions, or
large-scale statistical and scientific databases. By adopting the proposed approach,
companies can save, potentially, both fixed and variable costs and improve operational
efficiency. The paper has shown that the M
2
AD can save over 50% of the storage space
for the same application and thus can reduce significantly the total storage requirements
of the company significantly. Furthermore, when the warehouse data is duplicated in
various data marts to improve availability, the potential savings can be even higher. The
proposed approach can also improve operational efficiency since less data needs to be
transferred. This in turn will reduce the stress on the company communication networks
and delay the need for upgrading to a higher bandwidth, thus leading to more potential
savings.
Also, due to advances in storage and communication technology, there is often a
reduction in emphasis on storage and communication costs. The reality however is
different, at the terabyte and petabyte level, high-availability storage can be very
expensive; and as more microlevel data is collected and more data warehouse
aggregations take place, the techniques introduced in this paper become more beneficial.
On the communication costs front, increasing multimedia content load on local area
networks can lead to effective user-session rates of 1-5Mbps even in high-speed LANs.
In addition, the increase in mobile e-business and the use of hand-held devices and
wireless communications has brought to the forefront old optimisation trade-offs
associated with low speed and limited storage. The model introduced in this section can
enable and reduce the cost of statistical BI applications in such environments, in
particular because of the emerging trend back to increased client-side functionality and
load assignment. This latter trend is due to advances in dynamic HTML and web services
that are often deployed in environments with highly loaded web servers.
The techniques introduced in this paper can either be implemented by sophisticated
users or by vendors in their BI products. In the latter case, it is also likely that various
configuration options will be provided and it is incumbent on IT managers to understand
the trade-offs in product selection and deployment. In further research, we will
investigate the usefulness of specialised data structures and materialisation strategies (for
example, extending the approach of selective aggregate materialisation of [18] to the
statistical environment presented in this paper).
References and Notes
1 Gray, J., Bosworth, A., Layman, A. and Pirahesh, H. (1996) ‘Data cube: a relational
aggregation operator generalizing group-by, cross-tab, and sub-total’, in Proc. of the Twelve
IEEE International Conference on Data Engineering, New Orleans, LA, Feb., pp.152-159.
2 Gupta, Harinarayan, V and Quass, D. (1995) ‘Aggregate-query processing in data
warehousing’, in Proc. of the 21st VLDB Conference, Zurich, Switzerland, pp.358-369.
3 Harinarayan, V., Rajaraman, A. and Ullman, J. (1996) ‘Implementing data cubes efficiently’,
in Proc. of the 1996 ACM SIGMOD International Conference on Management of Data,
Montreal, Canada, June, pp.205-216.

A data analysis model for business intelligence 29

4 Ho., C.T., Agrawal, R., Megiddo, N. and Srikant, R. (1997) ‘Range queries in OLAP data
cubes’, in Proc. of the 1997 ACM SIGMOD International Conference on Management of
Data, Tucson, Arizona, USA, May, pp.73-88.
5 Li, D. Rotem, and Srivastava, J. (1999) ‘Aggregation algorithms for very large compressed
data warehouses’, in Proc. Int’l Conf.onVery Large DataBases, Scotland, pp.651–662.
6 Sarawagi, S. (2000) ‘User-adaptive exploration of multidimensional data’, in Proc. of the 26th
Int’l Conference on Very Large Databases (VLDB).
7 Sathe G. and Sarawagi, S. (2001) ‘Intelligent rollups in multidimensional OLAP data’, in
Proc. of the 27th Int’l Conference on Very Large Databases (VLDB).
8 Segev, A. and Zhao, J.L. (1998) ‘Selective view materialization in data warehousing systems’,
Technical Report, Lawrence Berkeley National Lab, Berkeley, CA 94720, USA, January.
9 Labio, W.J., Zhuge, Y., Wiener, J.L., Gupta, H., Garcia-Molina, H. and Widom, J. (1997)
‘The WHIPS prototype for data warehouse creation and maintenance’, in Proc. of the 1997
ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA,
May, pp.557-559.
10 Agarwal, S.,Agrawal, R., Deshpande, P.M., Gupta, A., Naughton, J.F., Ramakrishnan, R. and
Sarawagi, S. (1996) ‘On the computation of multidimensional aggregates’, in Proc. of the
22nd VLDB Conference, Mumbai, India, pp.506-521.
11 Chaudhuri, S. and Dayal, U. (1997) ‘Data warehousing and OLAP for decision support’, in
Proc. of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson,
Arizona, USA, May, pp.507-508.
12 Sher, B.Y., Shao, S.C. and Hsieh, W.S. (1998) ‘Mining regression rules and regression tree’,
to appear in Proc. of PAKDD-98, Lecture Notes in Artificial Intelligence, Springer-Verlag,
Australia, April.
13 Agrawal, D., El Abbadi, A., Singh, A. and Yurek, T. (1997) ‘Efficient view maintenance
warehouses’, in Proc. of the 1997 ACM SIGMOD International Conference on Management
of Data, Tucson, Arizona, USA, May, pp.417-427.
14 Mumick, S., Quass, D. and Mumick, B.S. (1997) ‘Maintenance of data cubes and summary
tables in a warehouse’, in Proc. of the 1997 ACM SIGMOD International Conference on
Management of Data, Tucson, Arizona, USA, May, pp.100-111.
15 Quass, D. and Widom, J. (1997) ‘Online warehouse view maintenance’, in in Proc. of the
1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona,
USA, May, pp.393-404.
16 Pindyck, R.S. and Rubinfeld, D.L. (1991) Econometric Models and Economic Forecasts,
Third Edition, McGraw-Hill Inc.
17 Note that although Table 2 does not contain enough records for deriving an accurate linear
function, it suffices for an illustration of basic concepts.
18 Roussopoulos, N., Kotidis, Y. and Roussopoulos, M. (1997) ‘Cubtree: organization of and
bulk incremental updates on the data cube’, in Proc. of the 1997 ACM SIGMOD International
Conference on Management of Data, Tucson, Arizona, USA, May, pp.89-99.
19 Srivastava, D., Dar, S., Jagadish, H.V. and Levy, A.Y. (1996) ‘Answering queries with
aggregation using views’, in Proc. of the 22nd VLDB Conference, Mumbai, India,
pp.318-329.
20 Anderson, T.W. (1984) An Introduction to Multivariate Statistical Analysis, Second Edition,
New York, Wiley.
21 Other statistics relevant to linear regression, eg., adjusted R-squared, variance of error terms,
can also be delivered from X'X and X'Y, without referring to very large X and Y.
22 Jobson, J.D. (1992) Applied Multivariate Data Analysis, Vol II: Categorical and Multivariate
Methods, Springer-Verlag.
23 Agresti, A. (1990) Categorical Data Analysis, John Wiley & Sons, Inc.

30 A. Segev, S-C. Shao and J.L. Zhao

24 Fayyad, U., Shapiro, G.P., Smyth, P. and Uthurusamy, R. (Eds.) (1996) Advances in
Knowledge Discovery and Data Mining, AAAI Press/The MIT Press.
25 Zhao, Y., Deshpande, P. and Naughton, J.F. (1997) ‘An array-based algorithm for
simultaneous multidimensional aggregates’, in Proc. of the 1997 ACM SIGMOD International
Conference on Management of Data, Tucson, Arizona, USA, May, pp.159-170.

doc_382095838.pdf
 

Attachments

Back
Top