White Paper on Importance of Skyline Attributes

Description
Querying databases with preferences is an important research problem. Among various approaches to querying with preferences, the skyline framework is one of the most popular. A well known deficiency of that framework is that all attributes are of the same importance in skyline preference relations.

Discovering Relative Importance of Skyline Attributes
Denis Mindolin
Dept. of Computer Science and Engineering
University at Buffalo
Buffalo, NY 14260-2000
[email protected]
Jan Chomicki
Dept. of Computer Science and Engineering
University at Buffalo
Buffalo, NY 14260-2000
[email protected]
ABSTRACT
Querying databases with preferences is an important re-
search problem. Among various approaches to querying with
preferences, the skyline framework is one of the most pop-
ular. A well known de?ciency of that framework is that
all attributes are of the same importance in skyline pref-
erence relations. Consequently, the size of the results of
skyline queries may grow exponentially with the number of
skyline attributes. Here we propose the framework called
p-skylines which enriches skylines with the notion of at-
tribute importance. It turns out that incorporating relative
attribute importance in skylines allows for reduction in the
corresponding query result sizes. We propose an approach
to discovering importance relationships of attributes, based
on user-selected sets of superior and inferior examples. We
show that the problem of checking the existence of and the
problem of computing an optimal p-skyline preference rela-
tion covering a given set of examples are NP-complete and
FNP-complete, respectively. However, we also show that a
restricted version of the discovery problem – using only su-
perior examples to discover attribute importance – can be
solved e?ciently in polynomial time. Our experiments show
that the proposed importance discovery algorithm has high
accuracy and good scalability.
1. INTRODUCTION
E?cient user preference management is a crucial part of
any successful sales-oriented business. Knowing what cus-
tomers like and more importantly why they like that and
what they will like in the future is an essential part of the
modern risk management process.
One of the most popular preference handling models is
preferences as skyline relations [2]. The skyline preference
relation is de?ned on top of a set of preferences over in-
dividual attributes. It represents the Pareto improvement
principle: a tuple o
1
is preferred to a tuple o
2
i? o
1
is as
good as o
2
in all the attributes, and o
1
is strictly better than
o
2
in at least one attribute. The set of the most preferred
tuples according to this principle is called a skyline.
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDBcopyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘09, August 24-28, 2009, Lyon, France
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
Example 1 Assume we have the following ?ve cars on sale.
id make price year
t
1
ford 30k 2007
t
2
bmw 45k 2008
t
3
kia 20k 2007
t
4
ford 40k 2008
t
5
bmw 50k 2006
Assume also Mary wants to buy a car and her preferences
over automobile attributes are as follows.

make
BMW is better than Ford, Ford is better than Kia

year
the car should be as new as possible

price
the car should be as cheap as possible
This skyline preference relation is denoted by
1
. Then
the corresponding skyline is
id make price year
t
1
ford 30k 2007
t
2
bmw 45k 2008
t
3
kia 20k 2007
t
4
ford 40k 2008
An important de?ciency of the skyline framework is its
inability to represent di?erences in the relative importance
of attributes. In real life scenarios, it is often the case that
bene?ts in one attribute may outweigh losses in one or more
attributes. For instance, given cars that di?er in age and
price, for some people the age is crucial while the price is
secondary. Therefore, the price has to be considered only
when bene?ts in age cannot be obtained, i.e., when the ages
are equal. However, according to the Pareto improvement
principle, the age and the price are of equal importance and
thus have to be considered together when comparing tuples.
Another drawback of the skyline framework is that the
size of a skyline may be exponential in the number of at-
tribute preferences involved [10]. This computational issue
is caused by the looseness of the Pareto improvement princi-
ple. Pareto improvement implies that if a tuple o
1
is better
than o
2
in one attribute, then the existence of any attribute
in which o
2
is better than o
1
makes the tuples incompara-
ble. Thus, every additional attribute increases the number
of such incomparable tuples. It has been shown that skyline
sizes become unmanageable for large data sets [1].
Example 2 Assume that Mary decides that year is more
important than make and price, which in turn are equally
important. Thus, regardless of the values of make and price,
a newer car is always better than an old one. At the same
time, given two cars of the same age, one needs to compare
their make and price to determine the better one. Denote
this preference relation as
2
. The most preferred tuples
according to
2
are
id make price year
t
2
bmw 45k 2008
t
4
ford 40k 2008
Namely, t
2
and t
4
are better than all other tuples in year,
t
2
is better than t
4
in make, and t
4
is better than t
2
in price.
Here we propose the p-skyline framework which general-
izes the skyline framework, and addresses its computational
and semantical limitations. The skyline semantics is en-
riched with the notion of attribute importance in a natural
way. Given an attribute A being more important then an
attribute B, we assume that a tuple with a better value of A
is unconditionally preferred to all tuples with worse values
of A, regardless of their values of B. However, given a tuple
with the same value of A, the one with a better value of B is
preferred. When dealing with equally important attributes,
we use the Pareto improvement principle. Therefore, skyline
queries are also representable in this framework.
It turns out that adding attribute importance to skyline
relations makes more tuples comparable. This results in a
signi?cant reduction in the size of the corresponding query
results. Therefore, incorporating attribute importance into
skyline relations allows not only to model user preferences
more accurately but also to make the size of the correspond-
ing query results more manageable.
So far, we have illustrated only the drawbacks of the sky-
line semantics. However, one of its main properties, which
has made it so popular, is the simplicity of representing pref-
erences. Namely, one needs to provide only a set of atomic
preferences to specify the corresponding preference query.
For p-skylines, an additional piece of information – the rela-
tive importance of the attributes – has to be provided by the
user. However, requiring users to describe attribute impor-
tance explicitly seems impractical for several reasons. First,
the number of required comparisons may be rather large:
one has to compare all attributes pairwise. The second is-
sue is even more serious – users themselves may be not fully
aware of their own preferences.
The major contribution of our paper is an alternative
approach to discovering attribute importance relationships,
based on user feedback. The type of feedback we discuss
here consists of two sets of tuples belonging to a given set:
superior examples [14], i.e., the desirable tuples, and infe-
rior examples [14] i.e., the undesirable tuples. Given the
sets of superior and inferior examples, we consider the prob-
lem of the existence of a p-skyline relation which covers the
corresponding example sets. The next problem we tackle is
computing a p-skyline relation that covers the example sets
and ?ts user preferences optimally. We show that the exis-
tence problem is NP-complete in general, and the problem of
computing optimal p-skyline relations is FNP-complete.
However, restricted versions of these problems – using
only superior examples to discover p-skyline relations – bring
computational bene?ts. First, we show a polynomial-time
approach for checking the existence of a p-skyline relation
covering a given set of superior examples. Second, we pro-
vide a polynomial time algorithm for computing p-skyline
relations which optimally ?t user preferences. We also pro-
vide techniques for improving the algorithm performance.
The experimental evaluation of the algorithms on real-life
and synthetic data sets demonstrates their high accuracy
and scalability. Proofs of the results are provided in [20].
2. BASIC NOTATIONS
Below we describe some concepts of a variant of the binary
relation preference framework [6] which we adopt here.
Let A = {A
1
, ..., A
n
} be a ?nite set of attributes (a re-
lation schema). Let every attribute A
i
? A be associated
with an in?nite domain D
A
i
. The domains considered here
are rationals and uninterpreted constants (numerical or cat-
egorical). Let U =

A
i
?A
D
A
i
. Given a tuple o ? U, we
denote the value of its attribute A
i
as o.A
i
.
De?nition 1 Let A be an attribute from the set A. Then
an atomic preference relation over A is a total order
A
which is a subset of D
A
×D
A
.
De?nition 2 A preference relation is a strict partial order
(SPO) binary relation which is a subset of U × U. A pref-
erence formula is a quanti?er-free ?rst-order formula with
two free tuple variables, built from comparisons (=, =, >, ?).
Given a preference formula f, the preference relation it rep-
resents is denoted as
f
.
Notice that we do not require preference relations to be
?nite. An example of a preference formula is
o
1

1
o
2
? o
1
.year ? o
2
.year ? o
1
.price ? o
2
.price ?
(o
1
.make = BMW ? o
2
.make = Ford ? o
1
.make = Ford ?
o
2
.make = Kia ? o
1
.make = BMW ? o
2
.make = Kia?
o
1
.make = o
2
.make) ? (o
1
.year = o
2
.year ?
o
1
.price = o
2
.price ? o
1
.make = o
2
.make)
representing
1
from Example 3.
Working with preferences, the two most common tasks
are 1) dominance testing: checking if a tuple is preferred
to another one, according to a given preference relation, 2)
computing the most preferred tuples in a given set, according
to a given preference relation. The former problem is solved
easily here by checking if the formula representing the pref-
erence relation evaluates to true for the given pair of tuples.
To deal with the latter problem, the winnow relational al-
gebra operator has been proposed [6, 15].
De?nition 3 Let be preference relation over U. Then
the winnow operator is written as w

(A), and for every
instance r of A (i.e., a subset of U):
w

(r) = {t ? r | ¬?t

? r . t

t},
2.1 p-skyline relations
Here we describe an extension of the skyline framework
[2]. Let H be the set containing the atomic preference
relation
A
for every attribute A ? A. Preferences in
this framework are combined using Pareto accumulation [15]
(which represents equal importance of the preference re-
lations being combined) and prioritized accumulation [15]
(which represents di?erent importance of the preference re-
lations being combined). A preference relation combined
using these operators is called a prioritized skyline prefer-
ence relation or simply a p-skyline relation.
De?nition 4 For a subset S of A, let the relation I
S
be the
set of all pairs of tuples whose values of the attributes in S
are equal, i.e.,
I
S
= {(o, o

) | o, o

? U ? ?X ? S . o.X = o

.X}.
Let also V ar() be the set of all relevant attributes. A pref-
erence relation is a p-skyline relation if one of the follow-
ing holds:
1. is induced by an atomic preference relation
A
? H
? {(o, o

) | o, o

? U . o.A
A
o

.A}.
Then V ar() = {A}.
2. is a prioritized accumulation
c1
&
c2
of p-skyline relations
c1
and
c2
, de?ned as
? {(o, o

) | o, o

? U . o
c1
o

?
I
V ar(
c1
)
(o, o

) ? o
c2
o

},
where V ar(
c1
) ? V ar(
c2
) = ?. Then V ar() =
V ar(
c1
) ? V ar(
c2
).
3. is a Pareto accumulation
c1
?
c2
of p-skyline
relations
c1
and
c2
, de?ned as
? {(o, o

) | o, o

? U . o
c1
o

? I
V ar(
c2
)
(o, o

)?
o
c2
o

? I
V ar(
c1
)
(o, o

) ? o
c1
o

? o
c2
o

},
where V ar(
c1
) ? V ar(
c2
) = ?. Then V ar() =
V ar(
c1
) ? V ar(
c2
).
According to this de?nition, with each p-skyline relation
we associate the set of relevant attributes V ar which are
used to compose the entire relation. It follows from the def-
inition that every atomic preference may appear only once
in a p-skyline relation de?nition. [16] shows that p-skyline
relations are preference relations, i.e., strict partial orders.
Proposition 1 [15] The operators ? and & are associa-
tive. The operator ? is symmetric.
Since the accumulation operators are associative, we ex-
tend them from binary to n-ary operators. Denote the set of
all p-skyline relations, each composed from all members of
H, by F
H
. Such relations are called full p-skyline relations.
All p-skyline relations we deal with further are considered
to be full. It is easy to see that the skyline preference rela-
tion (denoted as sky
H
) is a p-skyline relation constructed as
a Pareto accumulation of the atomic preferences H. Thus,
p-skylines are indeed a generalization of skylines.
Example 3 The preference relations
1
and
2
from Ex-
amples 1 and 2, resp., can be represented as the following
p-skyline relations

1
=
year
?
price
?
make
,

2
=
year
& (
price
?
make
).
The main di?erence between the skyline and the p-skyline
frameworks is the ability to represent relative importance of
atomic preferences (or the corresponding attributes) in the
latter one. It is intuitive to represent a p-skyline relation as
a directed graph showing relative importance of the atomic
preferences used to compose the p-skyline relation. We call
such graphs p-graphs. Their nodes are members of the set
of attributes A, and directed edges go from more important
to less important attributes. Formally, p-graphs are de?ned
as follows.
De?nition 5 Let be a p-skyline relation. Then the p-
graph representing (denoted as ?

) is a binary relation
over A. The sets of nodes N(?

) and edges E(?

) of ?

are de?ned inductively as follows:
1. if is induced by an atomic preference over A, then
N(?

) = {A}, E(?

) = ?;
2. if = (
2
&
2
), then N(?

) = N(?

1
)?N(?

2
),
and E(?

) = E(?

1
) ? E(?

2
) ? {(A, B) | A ?
N(?

1
), B ? N(?

2
)};
3. if = (
1
?
2
), then N(?

) = N(?

1
)?N(?

2
),
and E(?

) = E(?

1
) ? E(?

2
).
Example 4 The relations
1
and
2
(Example 3) are p-
skyline relations, and their p-graphs are shown in Figure 1.
year price make
(a) ?

1
year
price make
(b) ?

2
Figure 1: p-graphs of
1
and
2
Given a p-graph ?

of a p-skyline relation , denote the
set of all children of a node A ? A as Ch
?

(A). We denote
an edge from X to Y in ?

as ?

|= X ?Y .
It turns out that p-graphs not only naturally capture the
notion of relative attribute importance induced by p-skyline
relations, but also can be e?ciently used to describe essen-
tial properties of such relations. We use these properties
extensively in the study of the problem of p-skyline relation
discovery discussed further on. The following theorem shows
how to check if a directed graph is a p-graph.
THEOREM 1 (SPO+Envelope)
A directed graph ? with the node set N(?) = A is a p-graph
of some p-skyline relation i?
1. ? is transitive and irre?exive, i.e., a strict partial order
(SPO), and
2. ? satis?es the Envelope property:
?A, B, C, D ? A, all di?erent
? |= A ?B ? ? |= C ?D ? ? |= C ?B ?
? |= C ?A? ? |= A ?D ? ? |= D ?B
B D
A C
Figure 2: The Envelope property
The graph properties shown in Theorem 1 are justi?ed as
follows. Recall that an edge in a p-graph captures the dif-
ference in the importance of two atomic preferences. Thus,
irre?exivity of p-graphs implies that an atomic preference
relation cannot be more important than itself. Transitivity
of p-graphs captures the transitivity of the importance rela-
tionship. The Envelope property follows from the fact that
each atomic preference relation can have only one occurrence
in the de?nition of a p-skyline preference relation.
The p-skyline relation corresponding to a p-graph ? can
be constructed by a recursive decomposition of ? into pairs
of disjoint subgraphs (Pareto accumulation) or pairs of sub-
graphs in which every node of the ?rst one has an outgoing
edge to every node of the second one (prioritized accumula-
tion).
The next theorem shows that checking the containment
of full p-skyline relations can be reduced to the problem of
checking the containment of the corresponding p-graphs.
THEOREM 2 Let
1
,
2
be two members of F
H
. Then
1.
1
=
2
? E(?

1
) = E(?

2
)
2.
1
?
2
? E(?

1
) ? E(?

2
).
The graph ?
sky
H
, containing no edges, is the least among
all p-graphs corresponding to p-skyline relations in F
H
. The-
orem 2 implies that sky
H
is the least among all p-skyline
relations in F
H
viewed as sets of pairs.
2.2 p-skyline query evaluation
For completeness, we outline here some methods of p-
skyline query evaluation. We have shown that the p-skyline
framework is a generalization of the framework of skylines.
Therefore, some skyline algorithms can be adopted to eval-
uate p-skyline queries: BNL [2] and SFS [7]. The former
algorithm is suitable for any preference relation, and thus
can be applied to p-skyline relations as well. To apply the
latter algorithm, one needs to construct a weak order of tu-
ples compatible with the p-skyline relation. This can be
done by constructing a p-graph T

which is a total order of
attributes containing the original p-graph ?

. By Theorem
2, ?

, and it is easy to check that

is a weak order.
The de?nition of p-skyline relations is based on the accu-
mulation operators which are the foundation of the Prefer-
ence SQL language [17]. Thus, to evaluate a p-skyline query,
one could convert it to a Preference SQL statement, using
certain base preference constructors to represent the atomic
preference relations. A number of optimization techniques
have been developed for such queries [12].
3. P-SKYLINE DISCOVERY
In this section, we focus on the problem of discovering
p-skyline preference relations, given user feedback. This
problem can be decomposed into two subproblems. The
?rst consists of discovering the relevant attributes in A and
the corresponding atomic preference relations. The second
problem consists of discovering the structure of the p-skyline
relation that captures user preferences, given a set of atomic
preference relations. Namely, we need to ?nd an appropri-
ate combination of the accumulation operators needed to
construct the appropriate p-skyline relation. Here we focus
on the second problem. Some methods of solving the ?rst
problem are discussed in Section 5.
3.1 Discovering relative attribute importance
Preference discovery is based on some form of feedback
from a target user. We use the following scenario [14]. Given
a ?nite set of tuples O ? U, a user has to partition it into
disjoint subsets: the set G of tuples she con?dently likes
(superior examples), the set W of tuples she con?dently
dislikes (inferior examples), and the set of remaining tuples
about which she is not sure. We assume here that if a tuple o
is inferior, then there is at least one superior example which
the user likes more than o. This assumption is justi?ed
by a general principle that a user considers something bad
because she knows of a better alternative.
Now having subsets G and W of a set of tuples O, we want
to ?nd a possible p-skyline relation which implies G being
superior and W being inferior examples in O. Formally, we
want to compute ? F such that 1) G ? w

(O) (i.e.,
the tuples in G are among the most preferred tuples in O
according to ), 2) for every tuple o

in W, there is a tuple
o in G such that o o

(i.e., o

is an inferior example). Such
a p-skyline relation is called favoring G and disfavoring
W in O.
Thus, the ?rst problem we consider is the existence of
favoring/disfavoring p-skyline relations.
Problem 1. Given a set of atomic preference relations H,
a set of superior examples G, and a set of inferior examples
W in a set O, determine if there exists a p-skyline relation
? F
H
favoring G and disfavoring W in O.
Example 5 Take the car inventory data and the atomic
preferences H = {
make
,
year
,
price
} from Example 1.
Assume G = {t
4
}, W = {t
3
}. We discover ? F
H
favoring
G and disfavoring W. First,
make
cannot be more impor-
tant than all other atomic preferences since then t
2
and t
5
dominate t
4
, and thus t
4
is not superior. Moreover,
price
cannot be more important than the other atomic preferences
because then t
3
and t
1
dominate t
4
. However, if
year
is
more important than the other atomic preferences, then t
4
dominates t
3
and no other tuple dominates t
4
in
year
. At
the same time, both t
2
and t
4
are the best according to
year
,
but t
2
may dominate t
4
in
make
. Therefore,
make
should
not be more important than
price
. Thus, the following p-
skyline relation favors G and disfavors W in O

2
=
year
& (
price
?
make
)
The set of the best tuples in O according to
2
is {t
2
, t
4
}.
Generally there may be zero or more members of F
H
fa-
voring G and disfavoring W in O. When more than one
of them exist, we pick a maximal one (in the set theoretic
sense). Larger preference relations imply more dominated
tuples and fewer most preferred ones. Consequently, the re-
sult of w

(O) gets more manageable due to its decreasing
size. Moreover, maximizing corresponds to minimizing
w

(O) ? G, which implies more precise correspondence of
to the real user preferences. A maximal p-skyline relation
favoring G and disfavoring W in O is called optimal.
Thus, the second problem considered here is computing
an optimal p-skyline relation favoring G and disfavoring W.
Problem 2. Given a set of atomic preference relations
H, a set of superior examples G, and a set of inferior ex-
amples W in a set O, discover an optimal p-skyline relation
? F
H
favoring G and disfavoring W in O.
Example 6 Take G, W, and
2
from Example 5. Note
that in order to make t
4
dominate t
2
, we need to make price
more important than year. As a result, the relation

3
=
year
&
price
&
make
also favors G and disfavors W in O, but the set of best tuples
in O according to
3
is {t
4
}. Moreover,
3
is optimal.
This is because no other p-skyline relation favoring G and
disfavoring W contains
3
since the p-graph of
3
is a
total order of {year, price, make} and thus is a maximal
SPO.
Below we show the constraints which should be satis?ed
by the p-graph ?

of a p-skyline relation favoring G and
disfavoring W in O. Let us consider the notion of favoring
G in O ?rst. For any member o

? G to be in the set of
the most preferred tuples of O, o

must not be dominated
by any tuple in O. That is,
?o ? O, o

? G . o o

(1)
Dominance testing can be done easily.
THEOREM 3 Let be a member of F
H
and o, o

be dif-
ferent members of U. Let B(o
1
, o
2
) = {A | o
1
.A
A
o
2
.A}.
Then
o o

?Ch
?

(B(o, o

)) ? B(o

, o)
In other words, o is preferred to o

i? every attribute in
which o

is better than o is less important than at least one
attribute in which o is better than o

.
Note that all less important attributes for an attribute are
its descendants in the p-graph. Since p-graphs are transitive,
all the descendants of an attribute are also its children.
Using Theorem 3, we can rewrite (1) as
?o ? O, o

? G . Ch
?

(B(o, o

)) ? B(o

, o) (2)
Note that no tuple can be preferred to itself by irre?exivity
of . Therefore, a p-skyline relation favoring G in O should
satisfy (|O| ?1) · |G| negative constraints ? in the form:
? : Ch
?

(L
?
) ? R
?
where L
?
= B(o, o

), R
?
= B(o

, o). We denote this set of
constraints as N(G, O).
Example 7 Take Example 1. Then any p-skyline relation
? F
H
favoring G = {t
3
} in O has to satisfy each negative
constraint below
t
1
t
3
Ch
?

({make}) ? {price}
t
2
t
3
Ch
?

({make, year}) ? {price}
t
4
t
3
Ch
?

({make, year}) ? {price}
t
5
t
3
Ch
?

({make}) ? {price,year}
Now consider the notion of disfavoring W in O. According
to the de?nition, a p-skyline relation disfavors W in O i?
the following holds
?o

? W ?o ? G . o o

(3)
Following Theorem 3, it can be rewritten as a set of positive
constraints P(W, G)
?o

? W

o
i
?G
Ch
?

(B(o
i
, o

)) ? B(o

, o
i
). (4)
Therefore, in order for to disfavor W in O, its p-graph
?

has to satisfy |W| positive disjunctive constraints.
Example 8 Take Example 1. Then any p-skyline relation
? F
H
favoring G = {t
1
, t
3
} and disfavoring W = {t
4
} in
O has to satisfy the constraint
t
1
t
4
? t
3
t
4
which is equivalent to the following positive constraint
Ch
?

({price}) ? {year} ? Ch
?

({price}) ? {year,make}
Summarizing: in order to construct a p-skyline relation
favoring G and disfavoring W in O, we need to construct
an attribute importance graph ?

satisfying SPO+Envelope,
N(G, O), and P(W, G). By Theorem 2, a p-graph of an opti-
mal is maximal among all graphs satisfying SPO+Envelope,
N(G, O), and P(W, G).
3.2 Using superior and inferior examples
In this section, we study the problem of computing a fa-
voring/disfavoring p-skyline preference relation. We start
with the problem of its existence.
THEOREM 4 Problem 1 is NP-complete.
We prove Theorem 4 by a reduction from SAT. To address
Problem 2, we develop a method of computing minimal ex-
tensions of p-skyline relations discussed below.
De?nition 6 A relation

? F
H
is a minimal extension
of ? F
H
if ?

and there is no

? F
H
such that
?

?

.
Similarly, we can de?ne minimal extensions of ?

. We
note that by Theorems 1 and 2, a p-graph ?

of a minimal
extension

of ? F
H
is also a minimal extension satisfying
SPO+Envelope of the p-graph ?

.
3.2.1 Computing minimal extensions of a p-skyline
relation
Here we develop a set of rewriting rules which allow to
construct all minimal extensions of a given p-skyline rela-
tion. The rules are applied to a syntax tree of a p-skyline
relation. Given a p-skyline relation , its syntax tree T

corresponds to its de?nition using accumulation operators.
Every leaf node of such a syntax tree is an attribute. Each
non-leaf node is labeled with an accumulation operator and
corresponds to the result of applying this operator. Due
to associativity of ? and & and symmetry of ?, the same
p-skyline relation may have more than one syntax tree. Syn-
tax trees representing the same p-skyline relation are called
equivalent. A p-skyline syntax tree is called normalized if
every non-leaf node is labeled di?erently from its parent.
Clearly, for every p-skyline relation, there is a normalized
p-skyline syntax tree. However, a normalized syntax tree is
not unique for a p-skyline relation due to symmetry of ?.
Example 9 Let a p-skyline relation be

4
=
A
? (
B
&
C
) ? (
D
& (
E
?
F
))
Instances of its unnormalized and normalized syntax trees
are shown in Figure 3.
?
A &
B C
?
&
D ?
E F
(a) Unnormalized
?
A
&
B C
&
D ?
E F
(b) Normalized
Figure 3: p-skyline syntax tree of
4
Every node of a syntax tree is itself a root of another syn-
tax tree. Let us associate with every node C of a syntax tree
the set V ar(C) of attributes which are the leaf descendants
of C in the parse tree. Essentially, V ar(C) corresponds to
V ar(
C
) introduced in De?nition 4.
Note that we use two graph notations for p-skyline rela-
tions: p-graphs and p-skyline syntax trees. Although they
represent di?erent concepts, there is a relationship between
them, shown informally in Figure 4. For every pair of paths
in a syntax tree going from a ? -node to leaf nodes, no
edges exist between the nodes in the di?erent paths. For
every pair of paths in a syntax tree going from a & -node
to leaf nodes, the p-graph has edges going from all the leaf
nodes on the left path to all the leaf nodes on the right path.
These relationships follow from De?nition 5.
N
1
?
N
i
N
k
part of
a syntax tree
N
1
N
i
N
k
part of
the p-graph
N
1
&
N
i
N
k
part of
a syntax tree
N
1
N
i
N
k
part of
the p-graph
Figure 4: Syntax tree and p-graph correspondence
The method of computing all minimal extensions we pro-
pose here operates directly on p-skyline relations represented
as p-skyline syntax trees. In particular, we show a set of
rewriting rules of p-skyline syntax trees such that every
unique application of a rule from this set results in a unique
minimal extension of the given p-skyline relation. If all min-
imal extensions of a p-skyline relation are needed, then one
needs to apply to the syntax tree every rule in every possible
way.
Original tree part
?
C
1
. . .
&
C
i+1
. . .
C
k
N
1
. . .
N
m
R
C
i
Transformed tree part
?
C
1
. . .C
i?1
C
i+2
. . .
C
k
&
N
1
?
C
i+1 &
N
m
. . .
N
2
(a) Rule
1
(T

, C
i
, C
i+1
)
Original tree part
?
C
1
. . .
&
C
i+1
. . .
C
k
N
1
. . .
N
m
R
C
i
Transformed tree part
?
C
1
. . .C
i?1
C
i+2
. . .
C
k
&
N
m
?
C
i+1 &
N
m?1
. . .
N
1
(b) Rule
2
(T

, C
i
, C
i+1
)
Original tree part
?
C
1
. . .
C
i
C
i+1
. . .
C
k
R
Transformed tree part
?
C
1
. . .C
i?1
C
i+2
. . .
C
k
&
C
i
C
i+1
(c) Rule
3
(T

, C
i
, C
i+1
)
Original tree part
?
C
1
. . . . . .
C
k
&
N
1
. . .
N
m
&
M
1
. . .
M
n
R
C
i
C
i+1
Transformed tree part
?
C
1
C
i?1
. . . C
i+2
. . .
C
k
&
?
& &
N
s+1
. . .
N
m
M
t+1
. . .
M
n
?
& &
M
t
. . .
M
1
N
s
. . .
N
1
(d) Rule
4
(T

, C
i
, C
i+1
, s, t)
C
i - leaf node
C
i - leaf or non-leaf node
Figure 5: Syntax tree rewriting rules
The rewriting rules are shown in Figure 5. On the left
hand side, we show a part of a normalized syntax tree of the
original p-skyline relation. On the right hand side, we show
how the corresponding part is modi?ed in the syntax tree of
the resulting relation. We assume that the remaining part
of the syntax tree is left unchanged. All the rewriting rules
perform actions with two children C
i
and C
i+1
of a ?-node
R of a syntax tree. For the sake of simplicity, C
i
and C
i+1
are shown in Figure 5 as consecutive children. However, in
the rules we assume that they may be any pair of child nodes
of the same ?-node.
It can be easily checked (Figure 4) that all the rules only
add edges to the p-graph of the original preference relation.
That means (Theorem 2) that they extend the original p-
skyline relation.
Observation 1 Given a p-skyline relation ? F
H
, a single
application of a rule from {Rule
1
, . . . , Rule
4
} to a normal-
ized syntax tree T

adds the following edges to ?

:
• Rule
1
(T

, C
i
, C
i+1
):
{XY | X ? V ar(N
1
), Y ? V ar(C
i+1
)}
• Rule
2
(T

, C
i
, C
i+1
):
{XY | X ? V ar(C
i+1
), Y ? V ar(N
m
)}
• Rule
3
(T

, C
i
, C
i+1
): {(C
i
, C
i+1
)}
• Rule
4
(T

, C
i
, C
i+1
, s, t):
{XY |X ? ?
p?1...s
V ar(N
p
), Y ? ?
q?t+1...n
V ar(M
q
)} ?
{XY |X ? ?
p?1...t
V ar(M
p
), Y ? ?
q?s+1...m
V ar(N
q
)}.
We note that every & - and ? -node in a p-skyline syntax
tree has to have at least two child nodes. This is because
the operators & and ? must have at least two arguments.
However, as a result of a rewriting rule application, some
& - and ? -nodes may have only one child node, i.e., the
syntax tree becomes invalid. To make it valid, we remove all
nodes C with a single child and connect the child directly
to the parent of C, as shown in Figure 6.
Before single-child
node elimination
?
N
? ? { & , ? }
After single-child
node elimination
N
Figure 6: Single-child node elimination
An important property of the set of rules in Figure 5 is
shown in the next proposition. It says that an application
of any rule in Figure 5 results in a minimal extension of
the p-skyline relation, and every minimal extension may be
obtained using a single application of one of these rules.
Proposition 2 Let ? F
H
, and T

be a normalized syn-
tax tree of . Then

is a minimal extension of i? a
syntax tree T

of

is obtained from T

by a single appli-
cation of a rule Rule
1
, . . . , Rule
4
.
Proposition 2 has several corollaries which we use further.
Corollary 1 A syntax tree of every minimal extension of
? F
H
may be computed in time O(|A|).
In Corollary 1, we assume the adjacency-list representa-
tion of syntax trees. The total number of nodes in a tree is
linear in the number of its leaf nodes |A|. Thus the number
of edges in T

is O(|A|). The rewriting of T

using any rule
requires removing and adding O(|A|) edges of T

.
Corollary 2 Every ? F
H
has at most O(|A|
4
) di?erent
minimal extensions.
The justi?cation for Corollary 2 is as follows. The set
of minimal-extension rules is complete due to Proposition
2. Every rule operates with two nodes – C
i
and C
i+1

of a syntax tree. Hence, the number of such node pairs
is O(|A|
2
). Rule
4
also relies on some partitioning of the
sequence of child nodes of C
i
and C
i+1
. The total number of
such partitionings is O(|A|
2
). As a result, the total number
of di?erent rule applications is O(|A|
4
).
THEOREM 5 Problem 2 is FNP-complete
We note that Problem 2 is a functional problem. The
complexity class FNP for functional problems is a counterpart
of the class NP for decision problems. The membership of
Problem 2 in FNP follows from Corollaries 1 and 2. To prove
its hardness, we use a reduction from FSAT.
In view of Theorems 4 and 5, we consider next more re-
stricted versions of Problems 1 and 2. We will assume that
there are no inferior examples (W = ?).
3.3 Using superior examples only
By the de?nition of the problem of discovering favoring/
disfavoring p-skyline relations, a user is required to iden-
tify both superior and inferior examples in a set of tuples.
In many scenarios, users are only indirectly involved in the
process of identifying such classes. For instance, the click-
through rate may be used to measure the popularity of prod-
ucts. Using this metric, it is easy to ?nd the superior exam-
ples – the tuples with the highest click-through rate. How-
ever, the problem of identifying inferior examples – those
which the user con?dently dislikes – is harder. Namely, low
click-through rate may mean that a tuple is inferior, as well
as that the user does not know about it, or it does not sat-
isfy the search criteria. Thus, there is a need for discovering
p-skyline relations based on superior examples only.
In this section, we consider restrictions of Problems 1 and
2 assuming that W = ?.
Problem 1

. Given a set of atomic preference relations
H and a set of superior examples G in a set O, determine
if there exists ? F
H
favoring G in O.
Problem 2

. Given a set of atomic preference relations
H and a set of superior examples G in a set O, discover an
optimal p-skyline relation ? F
H
favoring G in O.
As we show further, these problems are computationally
simpler than the corresponding problems for favoring/disfa-
voring p-skyline relations (if P = NP). We show polynomial
time algorithms for both of them.
Consider Problem 1

?rst. Theorem 2 implies that the
skyline preference relation is the least in the class F
H
. It
is also well known that smaller preference relations result
in larger sets of the most preferred tuples. Therefore, the
relation in F
H
which results in the largest set w

(O) is
the skyline preference relation sky
H
.
Proposition 3 There exists a p-skyline relation ? F
H
favoring G in O i? G ? w
sky
H
(O).
Proposition 3 implies that to solve Problem 1

, one needs
to evaluate a skyline algorithm over O and check if the result
contains G. This clearly can be done in polynomial time.
In order to e?ciently solve Problem 2

, we present a poly-
nomial time algorithm below. Before starting to describe
the algorithm, let us consider the constraints which should
be satis?ed by the p-graph ?

of such a discovered relation
. Since we do not deal with inferior examples here, the
set of constraints su?cient to guarantee favoring G in O is
SPO+Envelope and N(G, O). Furthermore, to make the re-
lation optimal favoring G in O, ?

has to be a maximal
graph satisfying these constraints.
The approach of constructing optimal favoring p-skyline
relations we propose here is based on iterative rewriting of
normalized syntax trees. We use a subset of the rewriting
rules shown in Figure 5. We assume that the provided set
of superior examples G satis?es Proposition 3, i.e., G ?
w
sky
H
(O). First, we generate the set of negative constraints
N(G, O). The p-skyline relation we start with is sky
H
since
it is the least relation in F
H
favoring G in O. In every
iteration of the algorithm, we pick an atomic preference in
H and apply to the current p-skyline relation’s syntax tree a
?xed set of rewriting rules. As a result, we obtain a “locally
maximal” p-skyline relation satisfying the given set N(G, O)
of negative constraints. Eventually, this technique produces
a maximal p-skyline relation satisfying N(G, O).
Let us describe what we mean by “locally maximal”.
De?nition 7 Let M be a nonempty subset of A. A p-
skyline relation ? F
H
which favors G in O such that
E(?

) ? M ×M is called M-favoring G in O. A maximal
relation among all of such relations is called optimal.
By De?nition 7, the edge set of a p-graph of every optimal
M-favoring relation is maximal among all the p-graphs of
M-favoring relations. Note that if M is a singleton, the edge
set of a p-graph ?

of an optimal M-favoring relation is
empty, i.e., = sky
H
. If M = A, then an optimal p-skyline
relation M-favoring G in O is also optimal favoring G in O.
Thus, if we had a method of transforming an optimal M-
favoring to an optimal (M?{A})-favoring p-skyline relation
for any attribute A, we could construct an optimal p-skyline
relation favoring G in O by induction.
To do this transformation, we use the following subset
of the rules shown in Figure 5: 1) Rule
1
if C
i+1
= A or
N
1
= A; 2) Rule
2
if C
i+1
= A or N
m
= A; and 3) Rule
3
if
C
i
= A or C
i+1
= A. These rules have the following prop-
erties. First, given a p-skyline relation , they add only
the edges to ?

which go to or from the node A. Second,
any optimal (M ? {A})-favoring p-skyline relation

can
be constructed from any optimal M-favoring p-skyline re-
lation (s.t. ?

) by some consecutive application of
these rules. Notice that Rule
4
is not used here because any
its application adds to a p-graph edges going from at least
two and to at least two attributes. At the same time, not
every consecutive application of the rules above results in
an optimal (M ? {A})-favoring relation. Further in Algo-
rithm 4 we show a rule application strategy which allows
to construct optimal (M?{A})-favoring p-skyline relations.
That strategy is exhaustive, i.e., we apply the rules while
the resulting relation satis?es N(G, O).
3.3.1 Ef?cient constraint checking
We have shown in Observation 1 that the edge set of the
p-graph of a transformed p-skyline relation monotonically
increases. Therefore, while we transform a p-skyline relation
, for every negative constraint ?, we can just drop the
elements of R
?
which already have incoming edges from L
?
.
Proposition 4 Let ? F
H
satisfy a system of negative
constraints N. Construct the system of negative constraints
N

from N in which every ?

? N

is created from a member
? of N in the following way
• L
?
= L
?
• R
?
= R
?
?{Y ? R
?
| ?X ? L
?
. ?

|= X ?Y }
Then any

? F
H
, which is a superset of , satis?es N i?

satis?es N

.
A system of negative constraints N

obtained as a result of
the transformation shown in Proposition 4 is called minimal
w.r.t. . The next proposition summarizes the constraint
checking rules over a minimal system of negative constraints.
Proposition 5 Let ? F
H
satisfy a system of negative
constraints N, and N be minimal w.r.t. . Let

? F
H
be obtained from using one of the rewriting rules. Denote
the added parents and children of A in ?

as P
A
and C
A
correspondingly. Then

violates N i? there is a constraint
? ? N s.t.
• R
?
= {A} ? P
A
? L
?
= ?, or
• A ? L
?
? R
?
? C
A
3.3.2 Algorithm for p-skyline relation discovery
Here we present an algorithm for constructing an optimal
p-skyline relation favoring G in O. This algorithm can only
be used if G ? w
sky
H
(O), otherwise, by Proposition 3, no
such p-skyline relation exists.
The algorithm provides a strategy for applying the p-
skyline syntax tree rewriting rules discussed above. The
function push (Algorithm 1) takes four arguments: a set M
of attributes, a normalized syntax tree T of an M-favoring
p-skyline relation, the current attribute A, and a system
of negative constraints N minimal w.r.t. ( here is a
p-skyline relation represented by the syntax tree T). It re-
turns true if a rewriting rule has been applied to T without
violating N, and false if no rewriting rule can be applied
to T without violating N. When the function returns, N
is minimal w.r.t. the p-skyline relation represented by the
modi?ed syntax tree, and T is normalized.
The goal of push is to ?nd an appropriate rewriting rule
which adds to the current p-graph edges going from M to A
or vice versa. The function has two branches: the parent of
the node A in the syntax tree T is a &-node (i.e., we may
apply Rule
1
where N
1
is A or Rule
2
where N
m
is A), or
a ?-node (i.e., we may apply Rule
1
or Rule
2
where C
i+1
is A, or Rule
3
where C
i
or C
i+1
is A). In the ?rst branch
(line 2-14), we distinguish between applying Rule
1
(line 3-8)
and Rule
2
(line 9-14). It is easy to notice that with the pa-
rameters speci?ed above, the rules are exclusive. However,
the application patterns are similar. First, we ?nd an ap-
propriate child C
i+1
of R (lines 4 and 10). It is important
for V ar(C
i+1
) to be a subset of M because we want to add
edges going from M to A or from A to M. Then we check
if the corresponding rule application will not violate N. To
do that, we use the function checkConstr (lines 5 and 11)
as per Proposition 5. If a rule application does not violate
N, we apply the corresponding rule (lines 6 and 12) and
minimize N w.r.t. the p-skyline which is the result of the
rewriting (Proposition 4). To do that, we use the function
minimize.
The second branch of push is similar to the ?rst one and
di?erent in only the rewriting rules applied. As a result, it is
easy to notice that push checks every possible rule applica-
tion not violating N and adds to the p-graph only the edges
going from A to the elements of M or vice versa.
The function discover (Algorithm 4) is the entry point
of the entire algorithm. It takes four arguments: a set of
superior examples G, the entire set of tuples O, the set H of
atomic preference relations, and the set of relevant attributes
A. It iteratively constructs an optimal p-skyline relation
favoring G in O. In each iteration, it calls push repeatedly
until it returns false, i.e., until every possible application of
rewriting rules violates N. After every rule application, it
performs normalization of the modi?ed syntax tree, which is
required by Proposition 2. Thus, the syntax tree T obtained
in the repeat loop represents an optimal (M?{A})-favoring
p-skyline relation. Note that the repeat loop eventually
terminates since every rewriting rule pushes A down the
syntax tree, and the number of non-leaf nodes in a p-skyline
syntax tree cannot be more than |A| ?1.
THEOREM 6 The function discover returns a syntax-
tree of a maximal p-skyline relation satisfying the system of
negative constraints N. Its running time is O(|N| · |A|
3
).
Theorem 6 implies that if N is N(G, O) then discover
returns an optimal p-skyline relation favoring G in O.
Algorithm 1 push(T, M, A, N)
Require: T is normalized
1: if the parent of A in T is of type &
2: C
i
:= parent of A in T; R := parent of C
i
in T;
3: if A is the ?rst child of C
i
4: for each child C
i+1
of R s.t. V ar(C
i+1
) ? M
5: if checkConstr(N, A, ?, V ar(C
i+1
))
6: apply Rule
1
(T, C
i
, C
i+1
)
7: N := minimize(N, V ar(A), V ar(C
i+1
))
8: return true
9: else if A is the last child of C
i
10: for each child C
i+1
of R s.t. V ar(C
i+1
) ? M
11: if checkConstr(N, A, V ar(C
i+1
), ?)
12: apply Rule
2
(T, C
i
, C
i+1
)
13: N := minimize(N, V ar(C
i+1
), V ar(A))
14: return true
15: else // the parent of A in T is of type ?
16: R := parent of A in T;
17: for each child C
i
of R s.t. V ar(C
i
) ? M
18: if C
i
is of type &
19: N
1
:= ?rst child of C
i
, N
m
:= last child of C
i
20: if checkConstr(N, A, V ar(N
1
), ?)
21: apply Rule
1
(T, C
i
, A)
22: N :=minimize(N, V ar(N
1
), V ar(A))
23: return true
24: else if checkConstr(N, A, ?, V ar(N
m
))
25: apply Rule
2
(T, C
i
, A)
26: N := minimize(N, V ar(A), V ar(N
m
))
27: return true
28: else // C
i
is a leaf node, since T is normalized
29: if checkConstr(N, A, V ar(C
i
), ?)
30: apply Rule
3
(T, C
i
, A)
31: N :=minimize(N, V ar(C
i
), V ar(A))
32: return true
33: else if checkConstr(N, A, ?, V ar(C
i
)
34: apply Rule
3
(T, A, C
i
)
35: N :=minimize(N, V ar(A), V ar(C
i
))
36: return true
37: return false
Algorithm 2 checkConstr(N, A, P
A
, C
A
)
1: for each ? ? N
2: if R
?
= {A} ? P
A
? L
?
= ? ? A ? L
?
? R
?
? C
A
3: return false
4: return true
In our implementation of the algorithm, all sets of at-
tributes are represented as bitmaps of ?xed size |A|. Simi-
larly, every negative constraint ? is represented as a pair of
bitmaps corresponding to L
?
and R
?
.
3.3.3 Reducing the number of negative constraints
As we showed in Theorem 6, the running time of the func-
tion discover linearly depends on the size of the system of
negative constraints N. If we construct it as shown in Sec-
tion 3.1, it will contain (|O| ? 1) · |G| constraints. In this
section, we propose two methods of reducing the size of N.
Both methods are based on applying the skyline operator.
Algorithm 3 minimize(N, U, D)
1: for each constraint ? in N
2: if U ? L
?
= ?
3: R
?
?R
?
?D
4: return N
Algorithm 4 discover(G, O, H, A)
Require: G ? w
sky
H
(O)
1: N = N(G, O)
2: T = a normalized syntax tree of sky
H
3: M = any attribute from A
4: for each attribute A in A?M
5: repeat
6: r = push(T, M, A, N);
7: Normalize the syntax tree T
8: until r is false
9: M = M ? {A}
10: return T
However, the ?rst method applies it to the tuple set O, and
the second to the set of constraints itself.
Note that each negative constraint is used to show that a
tuple should not be preferred to a superior example. We also
know that the relation sky
H
is the least in F
H
. By de?nition
of the winnow operator, for every o

? O ? w
sky
H
(O) there
is a tuple o ? w
sky
H
(O) s.t. o is preferred to o

according to
sky
H
. Since sky
H
is the least relation in F
H
, the same o is
preferred to o

according to every member of F
H
. Thus, in
order to guarantee favoring G in O, the system of negative
constraints needs to contain only the constraints showing
that the tuples in w
sky
H
(O) are not preferred to the superior
examples. The size of such a system of negative constraints
is (|w
sky
H
(O)| ?1) · |G|.
Another way to reduce the size of a system of negative
constraints is based on the following fact. Let us take two
negative constraints ?, ?

? N such that L
?
? L
?
, R
?
?
R
?
, and at least one of these relationships is strict. It is
easy to check that ? strictly implies ?

. Thus, the constraint
?

is redundant and may be deleted from N. This idea can
also be expressed as follows:
? strictly implies ?

i? L
?
? L
?
? (A?R
?
) ? (A?R
?
)
and at least one containment is strict.
Represent ? as the bitmap representation of (A?R
?
) ap-
pended to the bitmap representation of L
?
. We assume that
a bit is set to 1 i? the corresponding attribute is in the cor-
responding set. Denote such a representation as bitmap(?).
Example 10 Let L
?
= {A
1
, A
3
, A
5
}, R
?
= {A
2
}, L
?
=
{A
1
, A
5
}, R
?
= {A
2
, A
4
}. Let A = {A
1
, . . . , A
5
}. Then
bitmap(?) = 10101 10111 and bitmap(?

) = 10001 10101.
Consider bitmap(?) as a vector with 2 · |A| dimensions.
From the negative constraint implication rule, it follows that
? strictly implies ?

i? bitmap(?) and bitmap(?

) satisfy the
Pareto improvement principle, i.e., the value of every di-
mension of bitmap(?) is greater or equal to the correspond-
ing value in bitmap(?

), and there is at least one dimen-
sion whose value in bitmap(?) is greater than in bitmap(?

).
Therefore, the set of all non-redundant constraints in N cor-
responds to the skyline of the set of bitmap representations
of all constraints in N. Moveover, bitmap(?) can have only
two values in every dimension. Thus, an algorithm for com-
puting skylines over low cardinality domains (e.g. [21]) can
be used to compute the set of non-redundant constraints.
Example 11 Take O and H from Example 1, and G from
Example 7. Let us ?nd an optimal ? F
H
favoring G in
O by running discover. Among the constraints in Exam-
ple 7, only N = {? : Ch
?

({make, year}) ? {price}} is
nonredundant. Moreover, it cannot be further minimized be-
?
price make
year
(a)
?
&
year
price make
(b)
&
? price
make
year
(c)
&
price year
make
(d)
Figure 7: Example 11
cause R
?
is a singleton. Consider the attributes in the order:
make, price, and year.
We run discover. The tree T (line 2) is shown in Fig-
ure 7(a). The initial value of M is {make}. First, we call
push(T, M, price, N). The parent of price is the ?-node
(Figure 7(a)), so we go to line 16 of push, where R is set
to the ?-node (Figure 7(a)). After C
i
is set to the node
make in line 17, we go to line 29 because it is a leaf node.
The checkConstr test in line 29 fails because N prohibits
the edge make ? price. Hence, we go to line 33 where the
checkConstr test succeeds. We apply Rule
3
(T, price, C
i
),
push returns true, and the resulting syntax tree T is shown
in Figure 7(b). Next time we call push(T, M, price, N) (line
6 of discover), we get to line 4 of push. Since year ? M, we
immediately go to line 37 and return false. In discover we
set M to {make, price} and call push(T, M, year, N). There
we go to line 16 (R is set to the ?-node in Figure 7(b)), C
i
is
set to the &-node (Figure 7(b)), we apply Rule
1
(T, C
i
, year)
(the resulting tree T is show in Figure 7(c)), and true is re-
turned. When push(T, M, year, N) is called next time, we
?rst go to line 16, R is set to the ?-node (Figure 7(c)), and
C
i
to the node make. Then Rule
3
(T, C
i
, year) is applied
(line 30) resulting in the tree T shown in Figure 7(d), and
true is returned. push(T, M, year, N) is called once again
from discover, but it returns false, and thus the tree in
Figure 7(d) is the ?nal one. According to the corresponding
p-skyline relation, t
3
dominates all other tuples in O.
The ?nal p-skyline relation constructed in Example 11 is
a prioritized accumulation of all the atomic preference rela-
tions. This is due to the fact that N contained only one con-
straint. When more constraints are involved, a discovered
p-skyline relation generally also has occurrences of Pareto
accumulation.
4. EXPERIMENTS
We have performed extensive experimental study of the
proposed framework. The algorithms were implemented in
Java. The experiments were ran on Intel Core 2 Duo CPU
2.1 GHz with 2.0GB RAM. The experiments were run on
four data sets: one real-life and three synthetic.
4.1 Experiments with real-life data
In this section, we focus on experimenting with the accu-
racy of the p-skyline relation discovery algorithm and the
reduction of skyline sizes achieved by modeling user prefer-
ences using p-skyline relations. We used a data set O which
stores statistics of NHL players [22] and contains 9395 tu-
ples. We used three sets of relevant attributes A of 12, 9,
and 6 attributes, respectively. The sizes of the correspond-
ing skylines were 568, 114, and 33. In this experiment, we
randomly generated 100 p-skyline relations
fav
and com-
puted the sets w

fav
(O). For each w

fav
(O), we randomly
picked 5 tuples from it, and used them as superior exam-
ples G
fav
to discover three di?erent optimal p-skyline rela-
tions favoring G
fav
in O using the proposed algorithm.
Out of those three relations, we picked the one resulting in
w

(O) of the smallest size. Then we added 5 more tuples
from w

fav
(O) to G
fav
and repeated the same procedure.
We kept adding tuples to G
fav
from w

fav
(O) until G
fav
reached w

fav
(O).
To measure the accuracy of the p-skyline discovery al-
gorithm, we computed the following three values: (i) the
measure of similarity of w

(O) and w

fav
(O)
acc-ratio =
|w

(O) ? w

fav
(O)|
|w

fav
(O)|
,
(ii) the ratio of tuples which should have not been in w

(O)
fp-ratio =
|w

(O) ?w

fav
(O)|
|w

fav
(O)|
,
and (iii) the ratio of tuples which should have been in w

(O)
fn-ratio =
|w

fav
(O) ?w

(O)|
|w

fav
(O)|
.
We plotted the average values of these measures in Figure 8.
As we can observe, when the sets of superior examples G
fav
used to discover were of size 15 or more, the average values
of acc-ratio were greater than 0.83 for every set A we used.
We note that acc-ratio converges to 1 faster for smaller A
due to the smaller size of the corresponding skyline. At the
same time, even for |A| = 12, acc-ratio becomes close to
1 (> 0.95) when the number of superior examples were 40
or more. The fact that fn-ratio starts from comparatively
large values can be justi?ed by the fact that the discovery
algorithm computes an optimal relation favoring G
fav
in O.
Thus, when G
fav
is small, it is not su?cient to exactly cap-
ture the corresponding preference relation
fav
. However,
when we increase the number of superior examples, the value
of fn-ratio consistently goes down.
5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# superior examples
acc-ratio fp-ratio
fn-ratio
(a) |A| = 6
0 20 40 60 80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# superior examples
acc-ratio fp-ratio
fn-ratio
(b) |A| = 9
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# superior examples
acc-ratio fp-ratio
fn-ratio
(c) |A| = 12
Figure 8: Accuracy of p-skyline discovery
In Figure 9, we use a similar setup to show how the size of
a p-skyline query result varies as a function of the number
of superior examples. Here we used two kinds of superior
example sets: sets G
fav
which were drawn from the sets of
the best tuples according to real p-skyline relations (as dis-
cussed above), and sets G
rand
which were drawn randomly
from the skyline. Hence, a set G
rand
may not be favored
by any p-skyline relation (besides sky
H
, of course). We use
these sets to discover p-skyline relations favoring them.
In Figure 9, we plot
ratio =
|w

(O)|
|w
sky
H
(O)|
,
which shows the di?erence in the sizes of results of p-skyline
and skyline queries. As this ?gure suggests, using larger
sets of relevant attributes generally results in smaller ratio
values. Moreover, p-skyline query results are smaller when
superior examples are similar (i.e., G
fav
) rather then arbi-
trary (i.e., G
rand
).
5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# superior examples
r
a
t
i
o
Gfav Grand
(a) |A| = 6
0 20 40 60 80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# superior examples
r
a
t
i
o
Gfav Grand
(b) |A| = 9
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# superior examples
r
a
t
i
o
Gfav Grand
(c) |A| = 12
Figure 9: p-skyline size reduction
4.2 Experiments with synthetic data
The main focus of this section is to experiment with the
scalability of the proposed approach and the reduction of
large skyline sizes, achieved by modeling user preferences
using p-skyline relations. We used three synthetic data sets
O here: correlated, anticorrelated, and uniform. Each of
them contained 50000 tuples. We used three di?erent sets
A of 10, 15, and 20 relevant attributes. For each of them,
we picked di?erent sets of superior examples G. We con-
structed such sets of similar tuples, where similarity was
measured in terms of Euclidean distance. Given a set G,
we used discover to compute optimal p-skyline relations
favoring G. In Figure 10, we plot dependency of the
average running time of discover on the number of supe-
rior examples |G| used to discover a p-skyline relation (Fig-
ure 10(a), |O| = 50000, |A| = 20), the size of O (Figure
10(b), |G| = 50, |A| = 20), and the number |A| of rele-
vant attributes (Figure 10(c), |O| = 50000, |G| = 50). The
measured time does not include the time to construct the
system of negative constraints and ?nd the non-redundant
constraints in it. According to our experiments, the prepro-
cessing time predominantly depends on the performance of
the skyline computation algorithm.
According to Figure 10(a), the algorithm running time in-
creases until the size of G reaches 20. After that, it does not
vary much. This is due to the fact that the algorithm perfor-
mance depends on the number of negative constraints used.
We use only non-redundant constraints for discovery. As we
show further, the dependence of the size of a system of non-
redundant constraints on the number of superior examples
has a pattern similar to Figure 10(a).
The growth of running time with the increase in the data
set size (Figure 10(b)) is justi?ed by the fact that the num-
ber of negative constraints depends on skyline size (Section
3.3.3). For the data sets used in the experiment, skyline sizes
grow with the sizes of the corresponding data sets. Similarly,
the running time of the algorithm grows with the number
of relevant attributes (Figure 10(c)), due to the increase in
the corresponding skyline size.
0 50 100 150
10
0
10
1
10
2
10
3
# superior examples
m
s
anticorr uniform
corr
(a) vs # superior examples
2 4
·10
4
10
1
10
2
data set size
m
s
anticorr uniform
corr
(b) vs data set size
10 15 20
10
0
10
1
10
2
10
3
|A|
m
s
anticorr uniform
corr
(c) vs |A|
Figure 10: Performance of p-skyline discovery
In Figure 11(a), we show how the size of the p-skyline
query result varies with the number of superior examples
used to discover a p-skyline relation . We compare it with
the size of the corresponding skyline and plot the value of
ratio de?ned in the previous section. Here we used the an-
ticorrelated, the uniform, and the correlated data sets of
50000 tuples each. The number of relevant attributes was
20. The size of the corresponding skylines was: 41716 (an-
ticorrelated), 37019 (uniform), and 33888 (correlated). For
the anticorrelated and the uniform data sets, the values of
ratio quickly reach a certain bound and then grow slowly
with the number of superior examples. This bound is ap-
proximately 1% of the skyline size (i.e., about 350 tuples)
for both data sets. At the same time, the growth of ratio
for the correlated data set is faster. Note that the values of
ratio are generally lower for synthetic data sets in compari-
son to the real-life data set in the previous section. This is
due to the larger set of relevant attributes and larger skyline
sizes in the synthetic case.
In Figure 11(b), we show how the number of negative
constraints depends on the number of superior examples
used to construct them. For every data set, we plot two
measures: the number of unique negative constraints in
N(G, w
sky
H
(O)) (anticorr-un, uniform-un, and corr-un,
resp.) and the number of unique non-redundant constraints
0 50 100 150
10
?2
10
?1
10
0
# superior examples
r
a
t
i
o
anticorr
uniform
corr
(a) p-skyline size reduction
0 50 100 150
10
2
10
4
10
6
# superior examples
anticorr-un anticorr-nrd
uniform-un uniform-nrd
corr-un corr-nrd
(b) Constraint # reduction
Figure 11: Synthetic data experiments
in the corresponding system (anticorr-nrd, uniform-nrd,
and corr-nrd, resp.). Note that the reduction in the con-
straint number achieved using the methods we proposed in
Section 3.3.3 is signi?cant. Namely, for the anticorrelated
data set and G of size 150, the total number of constraints in
N(G, O) was approximately 7.5 · 10
6
. Among them, about
5.5· 10
6
were unique in N(G, w
sky
H
(O)). However, less than
1% of them (about 1.2 · 10
4
) were non redundant.
The experiments show that incorporating attribute impor-
tance into skyline relations allows for signi?cant reduction
in query result size. The algorithm for p-skyline relation
discovery has good scalability in terms of the data set size
and the number of relevant attributes. The algorithm has a
high accuracy even for small sets of superior examples used
to discover p-skyline relations.
5. RELATED WORK
The p-skyline framework is based on the preference con-
struction approach introduced in [15]. There, preferences
are constructed from user-selected preference constructors
using Pareto and prioritized accumulation operators. In the
current work, we restrict them by requiring every atomic
preference to be a total order and to be used at most once
to construct a p-skyline relation. This allows for an intu-
itive way of capturing relative attribute importance using p-
graphs. As far as we know, the problem of relative attribute
importance in this framework has not been addressed yet.
An approach to mine preferences aggregated using the ac-
cumulation operators has been proposed in [13]. Web server
logs are used to discover atomic preferences and accumu-
lation operators used to aggregate them. The mining ap-
proach is based on statistical properties of log data – more
preferable tuples appear more frequently. The authors pro-
vide heuristics for discovering atomic preferences on cate-
gorical and numerical domains.
A variation of the skyline framework which addresses the
problem of large skyline size was considered in [23]. The
authors propose to use subspace skylines (i.e., skylines over
subsets of relevant attributes) whose sizes tend to be smaller
than the size of the skyline over the entire set of relevant at-
tributes. They also address the problem of ?nding subsets
of relevant attributes guaranteeing a set of superior tuples
to be in the corresponding subspace skyline. This work is
orthogonal to the approach we propose here. Namely, we
assume that the set of relevant attributes is ?xed and show
how to construct a maximal p-skyline relation favoring a set
of tuples. A similarity between the frameworks is that for a
certain class of p-skyline relations, the p-skyline queries may
be represented as pipelined subspace skyline queries. In par-
ticular, those are the p-skyline relations whose p-graphs are
weak orders (i.e., negatively transitive SPO). At the same
time, this does not apply to all p-skyline relations since the
class of SPO+Envelope is more general than weak orders (e.g.,

4
from Example 9).
A framework of preference discovery which is complemen-
tary to the approach we propose here was presented in [14].
[14] was the ?rst paper to introduce the scenario of mining
preferences using superior and inferior examples. In that
work, preferences are modeled as skyline relations. The au-
thors focus on mining unknown atomic preferences using su-
perior and inferior examples. They try to discover minimal
(in terms of relation size) ?nite atomic preference relations.
The authors show that the problem of existence of such re-
lations is NP-complete, and the problem of computing them
is NP-hard. They also provide two heuristics for computing
such preferences. We note that these heuristics are greedy
and may fail to return a satisfying set of preferences for
certain sets of superior/inferior tuples. That approach and
the approach we present here are di?erent in the following
sense. First, [14] deals with skyline relations, and thus all
atomic preferences are considered to be equally important.
The focus of our work is to discover di?erences in attribute
importance. Second, the authors of [14] focus on mining
minimal ?nite atomic preferences. In contrast, we are inter-
ested in computing maximal preference relations since this
guarantees a better ?t to a provided set of superior exam-
ples. At the same time, our work and [14] complement each
other. Namely, when atomic preferences are not provided
explicitly by the user, the approach [14] may be used to
discover them.
Another approach of preference relation discovery in the
skyline framework is proposed in [18]. That work is mo-
tivated by the problem of large skyline sizes. The authors
propose to reduce skyline sizes through revising skyline rela-
tions by supplying additional tuple relationships: preference
and equivalence. The relationships are obtained from user’s
answers to simple comparison questions. [18] provides an
algorithm for constructing such questions.
A method of representing attribute priorities similar to p-
graphs is presented in [4]. The authors identify two kinds of
importance edges between attributes: conditional and un-
conditional. Preferences are modeled as conditional ceteris
paribus statements.
In quantitative preference frameworks [8], attribute prior-
ities are represented as weight coe?cients in numeric utility
functions. A number of methods have been proposed to
elicit such utility functions [5, 3, 9]
6. CONCLUSIONS AND FUTURE WORK
In this paper, we studied the problem of discovering at-
tribute importance in skylines. We proposed a framework
which intuitively captures relative attribute importance us-
ing the notion of p-skyline preference relation. We showed
that the problem of discovering optimal p-skyline prefer-
ence relations based on superior and inferior examples is
intractable in general. We also proposed a polynomial time
algorithm for discovering such preference relations based on
superior examples only. The experimental evaluation of the
proposed framework using synthetic as well as real-life data
shows its practicality.
According to our experiments, modeling user preferences
as p-skyline relations allows for signi?cant reduction in sky-
line size. At the same time, the superior tuples a user has
to provide to discover such a relation have to accurately
describe it. However, when the skyline is large, expecting
that a user will be able to explore it entirely is impractical.
Hence, there is a need to develop methods of suggesting good
candidates for superior examples. Some promising measures
of goodness of a skyline tuple are its representativeness [19],
its entropy value [11], or the number of p-skyline relations
favoring the tuple. This direction has to be thoroughly in-
vestigated. Another interesting direction of future work is
to generalize our attribute importance discovery approach
to variable sets of relevant attributes.
7. REFERENCES
[1] W.-T. Balke, W. Siberski, and U. G¨ untzer. Getting prime
cuts from skylines over partially ordered domains. In
BTW, pages 64–81, 2007.
[2] S. B¨ orzs¨onyi, D. Kossmann, and K. Stocker. The skyline
operator. In ICDE, pages 421–430, 2001.
[3] C. Boutilier. A POMDP formulation of preference
elicitation problems. In AAAI/IAAI, pages 239–246, 2002.
[4] R. I. Brafman and C. Domshlak. Introducing variable
importance tradeo?s into CP-nets. In UAI, 2002.
[5] U. Chajewska, D. Koller, and R. Parr. Making rational
decisions using adaptive utility elicitation. In AAAI/IAAI,
pages 363–369, 2000.
[6] J. Chomicki. Preference formulas in relational queries.
ACM Trans. Database Syst., 28(4):427–466, 2003.
[7] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline
with presorting. In ICDE, pages 717–719, 2003.
[8] P. Fishburn. Utility Theory for Decision-Making. John
Wiley & Sons, New York, 1970.
[9] K. Gajos and D. S. Weld. Preference elicitation for
interface optimization. In UIST, pages 173–182, 2005.
[10] P. Godfrey. Skyline cardinality for relational processing. In
FoIKS, volume 2942, pages 78–97, 2004.
[11] P. Godfrey, R. Shipley, and J. Gryz. Algorithms and
analyses for maximal vector computation. VLDB J.,
16(1):5–28, 2007.
[12] B. Hafenrichter and W. Kießling. Optimization of
relational preference queries. In ADC, pages 175–184, 2005.
[13] S. Holland, M. Ester, and W. Kießling. Preference mining:
A novel approach on mining user preferences for
personalized applications. In PKDD, pages 204–216, 2003.
[14] B. Jiang, J. Pei, X. Lin, D. W. Cheung, and J. Han.
Mining preferences from superior and inferior examples. In
KDD, pages 390–398, 2008.
[15] W. Kießling. Foundations of preferences in database
systems. In VLDB, pages 311–322, 2002.
[16] W. Kießling. Preference queries with SV-semantics. In
COMAD, pages 15–26, 2005.
[17] W. Kießling and G. K¨ostler. Preference SQL - Design,
Implementation, Experiences. In VLDB, pages 990–1001,
2002.
[18] J. Lee, G. won You, S. won Hwang, J. Selke, and W.-T.
Balke. Optimal preference elicitation for skyline queries
over categorical domains. In DEXA, pages 610–624, 2008.
[19] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting stars:
The k most representative skyline operator. In ICDE,
pages 86–95. IEEE, April 2007.
[20] D. Mindolin and J. Chomicki. P-Skyline Preference
Relations. http://mindolin.info/pskyline/.
[21] M. D. Morse, J. M. Patel, and H. V. Jagadish. E?cient
skyline computation over low-cardinality domains. In
VLDB, pages 267–278, 2007.
[22] NHL.com stats.
http://www.nhl.com/ice/playerstats.htm.
[23] J. Pei, Y. Yuan, X. Lin, W. Jin, M. Ester, Q. Liu,
W. Wang, Y. Tao, J. X. Yu, and Q. Zhang. Towards
multidimensional subspace skyline analysis. ACM Trans.
Database Syst., 31(4):1335–1381, 2006.

doc_255966287.pdf
 

Attachments

Back
Top