Dissertation Study on Personalizable Knowledge Integration

nityaaroma · Jul 9, 2013

Description
Personalization involves using technology to accommodate the differences between individuals.

Abstract
Title of dissertation: Personalizable Knowledge Integration
Maria Vanina Martinez,
Doctor of Philosophy, 2011
Dissertation directed by: Professor V.S. Subrahmanian
Department of Computer Science
Large repositories of data are used daily as knowledge bases (KBs) feeding com-
puter systems that support decision making processes, such as in medical or ?nancial
applications. Unfortunately, the larger a KB is, the harder it is to ensure its consistency
and completeness. The problem of handling KBs of this kind has been studied in the AI
and databases communities, but most approaches focus on computing answers locally to
the KB, assuming there is some single, epistemically correct solution. It is important to
recognize that for some applications, as part of the decision making process, users con-
sider far more knowledge than that which is contained in the knowledge base, and that
sometimes inconsistent data may help in directing reasoning; for instance, inconsistency
in taxpayer records can serve as evidence of a possible fraud. Thus, the handling of this
type of data needs to be context-sensitive, creating a synergy with the user in order to
build useful, ?exible data management systems.
Inconsistent and incomplete information is ubiquitous and presents a substantial
problem when trying to reason about the data: how can we derive an adequate model
of the world, from the point of view of a given user, from a KB that may be inconsis-
tent or incomplete? In this thesis we argue that in many cases users need to bring their
application-speci?c knowledge to bear in order to inform the data management process.
Therefore, we provide different approaches to handle, in a personalized fashion, some
of the most common issues that arise in knowledge management. Speci?cally, we focus
on (1) inconsistency management in relational databases, general knowledge bases, and a
special kind of knowledge base designed for news reports; (2) management of incomplete
information in the form of different types of null values; and (3) answering queries in the
presence of uncertain schema matchings. We allow users to de?ne policies to manage
both inconsistent and incomplete information in their application in a way that takes both
the user’s knowledge of his problem, and his attitude to error/risk, into account. Using
the frameworks and tools proposed here, users can specify when and how they want to
manage/solve the issues that arise due to inconsistency and incompleteness in their data,
in the way that best suits their needs.
PERSONALIZABLE
KNOWLEDGE INTEGRATION
by
Maria Vanina Martinez
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial ful?llment
of the requirements for the degree of
Doctor of Philosophy
2011
Advisory Committee:
Professor V.S. Subrahmanian, Chair/Advisor
Professor John Grant
Professor Sarit Kraus
Professor Dana Nau
Professor Jonathan Wilkenfeld
c _Copyright by
Maria Vanina Martinez
2011
Dedication
To: Gerardo, Cristina, Hector, Roman, and Mauricio.
ii
Acknowledgements
In the ?rst place, I would like to thank my family for having been there uncon-
ditionally for me throughout my studies. To my husband Gerardo, who supported and
encouraged me to keep going forward in my good and bad moments, especially when
being so far from my family was sometimes too hard. I am most grateful to my beloved
parents, Cristina and Hector, and my brothers Mauricio and Roman. They were with me
all this time, even in the distance, missing me very much, but always encouraging me to
do what I wanted to do. They taught me the importance of learning and seeking knowl-
edge, and that an education is one of the most important things that a person can have,
independently of what path is taken in life. To them I owe who I am and what I have done
so far. To the rest of my family: Gabriela, Guillermo, Amalia, and Patricio, who have
also been there for me, and to my little nephew Sebastian.
I want to thank my advisor, Professor V.S. Subrahmanian, for giving me the oppor-
tunity to work and learn from him, and for giving me the opportunity of joining his group
even before of?cially starting the program. His expertise as a researcher in the area and
his seemingly endless generation of ideas was crucial for me to start building my career.
He taught me to be a good researcher, and I was lucky to have him as an advisor. Thanks
are also due to Professors John Grant, Sarit Kraus, Dana Nau, and Jonathan Wilkenfeld
for agreeing to serve on my thesis committee, which meant putting valuable time aside
to review the manuscript and provide feedback. Also to Professors John Grant, Avigdor
Gal, and Sarit Kraus, from which I have learned so much by working with them.
During my time at UMD, I had the opportunity to meet many great people. some
of them I collaborated with, and some became very good friends. My thanks then to Amy
Sliva, Cristian Molinaro, Andrea Pugliese, Francesco Parisi, Massimiliano Albanese,
Matthias Br¨ ocheler, Paulo Shakarian, John Dickerson, and especially to Carlos Castillo
iii
and Gabriela Chavez, who became very good friends and with whom my husband and I
spent so many good times.
Finally, I want to thank my friends in Argentina, the rest of my family there, and
those who supported my decision of leaving my country to embark on this great adventure
that was getting my PhD.
iv
Contents
1 Introduction 1
1.1 Knowledge Integration:
Real World Application Issues . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 11
2.1 Inconsistency Management . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Database Repairing and Consistent Query Answering . . . . . . . 15
2.2 Partial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Entity Resolution and Deduplication . . . . . . . . . . . . . . . . . . . . 24
2.4 Schema Mappings and Data Exchange . . . . . . . . . . . . . . . . . . . 26
3 Inconsistency Management Policies 29
3.1 Introduction and Motivating Example . . . . . . . . . . . . . . . . . . . 29
3.2 Syntax and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Inconsistency Management Policies . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Singular IMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Multi-Dependency Policies . . . . . . . . . . . . . . . . . . . . . 40
3.4 Specifying IMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Relationship with belief change operators . . . . . . . . . . . . . . . . . 53
3.6 Relationship with preference-based approaches in
Consistent Query Answering . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.1 Active Integrity Constraints . . . . . . . . . . . . . . . . . . . . 58
3.6.2 Prioritized Repairs and CQAs . . . . . . . . . . . . . . . . . . . 61
3.7 Extensions of Classical Relational Algebra Operators with Multi-Dependency
Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8 Applying IMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8.1 Using DBMS-based Indexes . . . . . . . . . . . . . . . . . . . . 74
3.8.2 Cluster Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 84
3.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4 A General Framework for Reasoning about Inconsistency 90
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Tarski’s Abstract Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
v
4.3 A General Framework for Handling Inconsistency . . . . . . . . . . . . . 95
4.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5 Handling Inconsistency in Monotonic Logics . . . . . . . . . . . . . . . 117
4.5.1 Propositional Horn-clause Logic . . . . . . . . . . . . . . . . . . 117
4.5.2 Propositional Probabilistic Logic . . . . . . . . . . . . . . . . . . 119
4.5.3 Propositional Linear Temporal Logic . . . . . . . . . . . . . . . 126
4.5.4 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5.5 Belief Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5.6 Spatial Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.6 Link with Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . 138
4.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5 PLINI: A Probabilistic Logic for Inconsistent News Information 147
5.1 Introduction and Motivating Example . . . . . . . . . . . . . . . . . . . 147
5.2 What is an Event? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3 PLINI Wffs: Syntax and Semantics . . . . . . . . . . . . . . . . . . . . . 153
5.3.1 Syntax of Multi-sorted Wffs . . . . . . . . . . . . . . . . . . . . 153
5.4 Similarity Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.5 PLINI Probabilistic Logic Programs . . . . . . . . . . . . . . . . . . . . 172
5.6 Model Theory and Fixpoint Theory . . . . . . . . . . . . . . . . . . . . 177
5.7 Event Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.8 Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . 185
5.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6 Partial Information Policies 192
6.1 Introduction and Motivating Example . . . . . . . . . . . . . . . . . . . 192
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.3 Partial Information Policies . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.4 Ef?ciently Applying PIPs . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.4.1 Tuple insertions . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.4.2 Tuple deletions . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.4.3 Tuple updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.4.4 Applying PIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.4.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.5 Relational Algebra and PIPs . . . . . . . . . . . . . . . . . . . . . . . . 217
6.5.1 Projection and PIPs . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.5.2 Selection and PIPs . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.5.3 Cartesian Product and PIPs . . . . . . . . . . . . . . . . . . . . . 221
6.5.4 Join and PIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.5.5 Union and PIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.5.6 Difference and PIPs . . . . . . . . . . . . . . . . . . . . . . . . 224
6.5.7 Intersection and PIPs . . . . . . . . . . . . . . . . . . . . . . . . 225
6.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.6.1 Applying PIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.6.2 Updating the database . . . . . . . . . . . . . . . . . . . . . . . 229
vi
6.6.3 Execution times under different loads . . . . . . . . . . . . . . . 230
6.6.4 Query answer quality . . . . . . . . . . . . . . . . . . . . . . . . 231
6.6.5 Relational Algebra operators and PIPs . . . . . . . . . . . . . . . 233
6.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
7 Query Answering under Uncertain Schema Mappings 237
7.1 Introduction and Motivating Example . . . . . . . . . . . . . . . . . . . 237
7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
7.3.1 Semantics of Probabilistic Mappings . . . . . . . . . . . . . . . 244
7.4 Algorithms for Aggregate Query Answering . . . . . . . . . . . . . . . . 249
7.4.1 By-Table Semantics . . . . . . . . . . . . . . . . . . . . . . . . 249
7.4.2 By-Tuple Semantics . . . . . . . . . . . . . . . . . . . . . . . . 250
7.4.3 Summary of Complexity Results . . . . . . . . . . . . . . . . . . 267
7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.6 Schema Mappings, Integrity Constraints, and Partial Information . . . . . 274
7.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
8 Conclusions 280
Bibliography 284
vii
List of Tables
5.1 Examples of event descriptions . . . . . . . . . . . . . . . . . . . . . . . 153
5.2 Denotations for selected linguistically modi?ed numeric predicates . . . . 158
5.3 Denotations for selected linguistically modi?ed spatial predicates . . . . . 159
5.4 Denotations for selected linguistically modi?ed temporal predicates . . . 160
5.5 Similarity functions example . . . . . . . . . . . . . . . . . . . . . . . . 165
5.6 Example of Event Database extracted from news sources . . . . . . . . . 173
5.7 Degrees of similarity as given by lfp(T
?
)(e
i
? e
j
) . . . . . . . . . . . . 183
5.8 Performance of J48 for different values of ? . . . . . . . . . . . . . . . . 187
5.9 Performance of JRIP for different values of ? . . . . . . . . . . . . . . . 187
5.10 Average performance of JRIP . . . . . . . . . . . . . . . . . . . . . . . . 189
7.1 An instance D
S1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.2 An instance D
S2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.3 New semantics for Aggregate Queries . . . . . . . . . . . . . . . . . . . 249
7.4 Trace of ByTupleRANGE for query Q1 . . . . . . . . . . . . . . . . . . 252
7.5 Trace of ByTuplePDCOUNT for query Q1 . . . . . . . . . . . . . . . . 254
7.6 Trace of ByTupleRANGE for query Q2’ . . . . . . . . . . . . . . . . . 257
7.7 Computing Q2
t
under the by-tuple/expected value semantics . . . . . . . 259
viii
List of Figures
3.1 Partial order ?
T
for relational schema S. . . . . . . . . . . . . . . . . . 43
3.2 Example relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Cluster Table Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4 Updating a cluster table after inserting a tuple . . . . . . . . . . . . . . . 79
3.5 Updating a cluster table after deleting a tuple . . . . . . . . . . . . . . . 80
3.6 Updating a cluster table after updating a tuple . . . . . . . . . . . . . . . 81
3.7 Applying an IMP using a cluster table . . . . . . . . . . . . . . . . . . . 83
3.8 Policy application running time 1-2M tuples . . . . . . . . . . . . . . . . 85
3.9 Policy application running time 0.1% inconsistency . . . . . . . . . . . . 86
3.10 Storage Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1 Two-dimensional examples for RCC-8 base relations. . . . . . . . . . . . 136
5.1 Architecture of the PLINI-system . . . . . . . . . . . . . . . . . . . . . . 151
5.2 Denotation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3 Example of the application of similarity functions for sort ConnectedPlace 163
5.4 Example of application of similarity functions for sort Space . . . . . . . 168
5.5 Some automatically learned PLINI-rules from T-REX data using JRIP . . 177
5.6 Algorithm PLINI-Cluster(?, c, ?) . . . . . . . . . . . . . . . . . . . . . 183
5.7 Recall/Precision and F-measure for J48 and JRIP . . . . . . . . . . . . . 188
6.1 Updating index structures after a tuple insertion. . . . . . . . . . . . . . . 211
6.2 Updating index structures after a tuple deletion. . . . . . . . . . . . . . . 212
6.3 Applying a PIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.4 PIPs examples from experiments . . . . . . . . . . . . . . . . . . . . . . 226
6.5 Policy application running time . . . . . . . . . . . . . . . . . . . . . . . 226
6.6 Policy application running times for index structure . . . . . . . . . . . . 227
6.7 Multiple policy application running time . . . . . . . . . . . . . . . . . . 228
6.8 Tuple insertion running time . . . . . . . . . . . . . . . . . . . . . . . . 229
6.9 Tuple deletion running time . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.10 Tuple update running time . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.11 Execution times with different loads . . . . . . . . . . . . . . . . . . . . 232
6.12 Query answer quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
6.13 Running times of projection . . . . . . . . . . . . . . . . . . . . . . . . 234
6.14 Running times of join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.15 Running times of union . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
ix
7.1 Generic by-table algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 249
7.2 Algorithm ByTupleRangeCOUNT . . . . . . . . . . . . . . . . . . . . . 250
7.3 Algorithm ByTuplePDCOUNT . . . . . . . . . . . . . . . . . . . . . . . 252
7.4 Algorithm ByTupleRangeSUM . . . . . . . . . . . . . . . . . . . . . . . 257
7.5 Algorithm ByTupleRangeMAX . . . . . . . . . . . . . . . . . . . . . . . 266
7.6 Summary of complexity for the different aggregates . . . . . . . . . . . . 268
7.7 Running times for small number of tuples . . . . . . . . . . . . . . . . . 268
7.8 Running times for small number of mappings . . . . . . . . . . . . . . . 269
7.9 Running times for medium number of tuples . . . . . . . . . . . . . . . . 271
7.10 Running times for medium number of mappings . . . . . . . . . . . . . . 272
7.11 Running times for large number of tuples . . . . . . . . . . . . . . . . . 273
7.12 Running times for very large number of tuples . . . . . . . . . . . . . . . 273
x
Chapter 1
Introduction
1.1 Knowledge Integration:
Real World Application Issues
The process of knowledge integration involves combining knowledge bases resid-
ing in different sources and providing users with a uni?ed view of these data. Researchers
both in AI and databases, as well as in information retrieval, have been working on the
problems that arise with the integration of heterogeneous knowledge bases for decades.
The area of data integration is vast, and combines efforts from separate but related areas
of computer science. Work in this area has focused on many issues at different levels of
information integration during the last 30 years starting shortly after the global adoption
of relational databases, which naturally led to the need for sharing and/or merging existing
data repositories. Nowadays, solving problems associated with data integration applica-
tions is even more necessary given the information globalization produced by the WWW.
A huge number of data sets are published and shared over the Internet daily, increasing
the need for ef?cient and relatively accurate data integration tools.
1
The two main problems that arise when merging knowledge from different sources
are uncertainty and inconsistency. In data integration applications, uncertainty can appear
in different ways, but all of them can be traced to the subjectivity with which knowledge
bases are built: knowledge bases model the knowledge about the world, e.g., objects,
their relevant properties, and relations among them, based on the requirements of the
applications the were designed to serve. The same domain may be described in differ-
ent, sometimes con?icting, ways. Differences may arise in the particular aspects of the
domain they model, in the schemas designed for the domain (the structure of tables or
relations, attribute names, and types), and in the naming conventions used for data ob-
jects. Inconsistency appears also as a result of integration. On one side, different sources
of knowledge, which may be consistent in isolation, can each contain information that
contradict each other when they are considered together. On the other side, since many
data integration applications are based on structures that are automatically extracted from
unstructured data, there may even be uncertainty about the data itself, since the extrac-
tion techniques are approximate at best. As data integration applications strive to offer a
single objective and coherent integrated view of data sources, both uncertainty and incon-
sistency are bound to appear. These problems are pervasive especially in data integration
applications whose goal is to offer access to a large number of heterogeneous sources, for
instance when providing online services over the WWW. An everyday example of such
a service is a search engine serving consumer travel needs (?ights, rental cars, hotels,
tourist packages, travel insurance, cruises, etc.); since different companies feed their data
to such a service the problems discussed above seem inevitable.
Furthermore, large repositories of data are readily accessible and used daily as
knowledge bases (KBs) that feed a wide variety of computer systems. One important
type of systems are those that support decision making processes, i.e., human users use
2
them to interpret the data contained in a KB in order to make real world decisions. Ex-
amples of these applications are found in many environments such as medical, ?nancial,
cultural behavioral analysis, crime prevention, etc. Unfortunately, the larger a KB is, the
harder it is to ensure its consistency and completeness. Inconsistent and incomplete (or
partial) information is ubiquitous and presents a substantial problem when trying to rea-
son about the data: how can we derive an adequate model of the world from a KB that
we know may be inconsistent or incomplete? This situation can arise in the real world
when maintaining, for instance, a KB with information about vehicle position and status,
such as in robotics where Unmanned Ground Vehicles (UGVs) sense and act in an envi-
ronment. For many reasons, such as unreliable communication channels or redundancy,
there can be contradictory pieces of information regarding a vehicle in a certain instant of
time. Depending on how a data management system handles this inconsistency, the user
of this KB may have different options at hand when he comes into contact with the data.
In general, these applications are managed by power users: users that utilize advanced
features of programs which are beyond the abilities of “normal” or “end” users, but they
are not necessarily capable of programming or familiar with how database management
systems work. An important aspect to take into account in these applications is that, as
part of the decision making process, users consider far more knowledge than that which
is contained in the KB; they also incorporate into the process their domain expertise and
requirements, as well as their expectations.
The management of uncertain knowledge bases has been widely studied both in
the AI community and in the database community. Extensive work has been done in
AI regarding uncertain information for a long time now, especially in the areas of non-
monotonic reasoning [Rei80a, MD80, McC87, Gab85, Moo88, KLM90], and the var-
ious topics studied in probabilistic reasoning [Hai84, Nil86b, Pea88, FHM90, KTG92,
3
NS92, Poo93, FH94, Poo97, KIL04]. During the late 80’s and 90’s, proposals were
made in the database community to incorporate probabilities into deductive and relational
databases [CP87, BGMP92, LS94, NS94, LLRS97, FR97], each of them making different
dependency assumptions with respect to probabilities. More recently, the database com-
munity has regained interest in probabilistic approaches, particularly in the area of query
answering [AFM06, DS07, CKP03, BDSH
+
08] and top-k querying [LCcI
+
04, RDS07,
SI07]. There has also been interest in uncertainty produced by the presence of null values
or incomplete information [Gra80, IL84b, GJ86]. In this thesis we will focus on par-
ticular aspects of uncertainty related to inconsistency, schema matching, and incomplete
information.
The problem of identifying and solving inconsistency in knowledge bases has been
studied for many years by many different researchers [Gra78, BKM91, BDP97, BS98,
ABC99, BFFR05, CLR03, BC03, Cho07]. Traditionally, the arti?cial intelligence (AI)
and database theory communities held the posture that knowledge bases and software
speci?cations should be completely free of inconsistent and incomplete information, and
that inconsistency and incompleteness should be eradicated from them immediately. In
the last two decades, however, these communities have recognized that for many inter-
esting applications this posture is obsolete: Though approaches to allow inconsistency to
persist in relational DBs and KBs have existed since the late 80s ([BS98, KS92, KL92,
GS95, BKM91], etc.), there has been no method to date that gives the user the power to
bring his knowledge of the domain, his preferences, and his risks and objectives into ac-
count when reasoning about inconsistent data. In this thesis we argue that inconsistency
can often be resolved in different ways based on what the user wants. In the case of the
vehicle KB above, a data management system that ignores the inconsistency and gives an
“a priori” solution for it may hide the inconsistency from the user; this can be a problem
4
if it causes the user to make the wrong decision and, for instance, delay the sending of
rescue or support to disabled vehicles or to send it to the wrong location. Furthermore,
contradictory information can be used in detecting faulty sensors or communication chan-
nels.
Consider now a simpler database example, a database containing data about em-
ployees in a company. We will use it to show the importance of giving the user the power
to de?ne and control the uncertainty in his data.
Name Salary Tax bracket Source
t
1
John 70K 15 s
1
t
2
John 80K 20 s
2
t
3
John 70K 25 s
3
t
4
Mary 90K 30 s
1
Let us assume that salaries are uniquely determined by names, which means that for
every two records in the database that have the exact same name, they should also have the
exact same amount for salary. Clearly, there is an inconsistency regarding employee John
in the table above. In this case, a user may want to resolve the inconsistency about John’s
salary in many different ways. (C1) If he were considering John for a loan, he might
want to choose the lowest possible salary of John to base his loan on. (C2) If he were
assessing the amount of taxes John has to pay, he may choose the highest possible salary
John may have. (C3) If he were just trying to estimate John’s salary, he may choose some
number between 70K and 80K (e.g., the average of the three reports of John’s salary) as
the number. (C4) if he had different degrees of con?dence in the sources that provided
these salaries, he might choose a weighted mean of these salaries. (C5) He might choose
not to resolve the inconsistency at all, but to just let it persist until he can clear it up. (C6)
He might simply consider all the data about John unreliable and might want to ignore it
5
until it can be cleared up – this is the philosophy of throwing away all contaminated data.
1
[BKM91, SA07, ABC99, BFFR05, CLR03, BC03, Cho07] can handle cases C1 and C2,
but not the other cases.
1.2 Organization of this Thesis
In this thesis we propose to provide users with tools to manage their data in a per-
sonalized way in order to reason about it according to their needs. Given that it is impor-
tant to enable users to bring their application-speci?c knowledge to bear when resolv-
ing inconsistency, we propose two different approaches to personalizable inconsistency
management: Inconsistency Management Policies for relational databases and a general
framework for handling inconsistent knowledge bases. For the ?rst approach we de?ne
the concept of a policy for managing inconsistency in relational databases with respect to
functional dependencies, which generalizes other efforts in the database community by
allowing policies to either remove inconsistency completely or to allow part or all of the
inconsistency to persist depending on the users’ application needs. In the example above,
each of the cases C1 through C6 re?ects a policy that the user is applying to resolve in-
consistencies. We will discuss inconsistency management policies (IMPs for short) in
detail in Chapter 3.
Second, we propose a uni?ed framework for reasoning about inconsistency that ex-
tends the work in [SA07]. This framework applies to any monotonic logic, including ones
for which inconsistency management has not been well studied (e.g., temporal, spatial,
and probabilistic logics), and the main goal is to allow end-users to bring their domain
knowledge to bear by taking into account their preferences. In the example above neither
1
This is more likely to happen, for example, when there is a scienti?c experiment with inconsistent data
or when there is a critical action that must be taken, but cannot be taken on the basis of inconsistent data.
6
the bank manager nor the tax of?cer are making any attempt to ?nd out the truth (thus
far) about John’s salary; however, both of them are making different decisions based on
the same facts. The basic idea behind this framework is to construct what we call options,
and then using a preference relation de?ned by the user to compute the set of preferred
options, which are intended to support the conclusions to be drawn from the inconsistent
knowledge base. Intuitively, an option is a set of formulas that is both consistent and
closed with respect to consequence in a given monotonic logic. In [SA07] preferred op-
tions are consistent subsets of a knowledge base, whereas here this is not necessarily the
case since a preferred option can be a consistent subset of the deductive closure of the
knowledge base. We will present this framework in Chapter 4.
Applications dealing with the collection and analysis of news reports are highly af-
fected by integration techniques, especially since millions of reports can be extracted daily
by automatic means from different web sources. Oftentimes, even the same news source
may provide widely varying data over a period of time about the same event. Past work on
inconsistency management and paraconsistent logics assume that we have “clean” de?ni-
tions of inconsistency. However, when reasoning about this type of data there is an extra
layer of uncertainty that comes from the following two phenomena: (i) do two reports
correspond to the same event or different ones?; and (ii) what does it mean for two event
descriptions to be mutually inconsistent, given that these events are often described us-
ing linguistic terms that do not always have a uniquely accepted formal semantics? We
propose a probabilistic logic programming language called PLINI (Probabilistic Logic
for Inconsistent News Information) within which users can write rules specifying what
they mean by inconsistency in situation (ii) above. Extensive work has also been done
in duplicate record identi?cation and elimination [BD83, HS95, ME97, CR02, BM03,
BG04, BGMM
+
09]. The main difference between our approach and previous work is the
7
fact that the user is able to specify the notion of inconsistency that is of interest to him;
furthermore, news reports are in general unstructured data containing complex linguistic
modi?ers which different users may interpret in different ways. We devote Chapter 5 to
the treatment of this problem.
Another issue related to data integration is that of reasoning in the presence of
incomplete information, or null values, in knowledge bases. Incomplete information can
appear, just to give a common example, when merging knowledge bases with disparate
schemas into a global schema. The consolidated knowledge base often contains null
values for attributes that were not present in every source schema. Incomplete information
makes the process of reasoning much harder since, if not treated carefully, results can
present incorrect or biased information.
The problem of representing incomplete information in relational databases and
understanding its meaning has been extensively studied since the beginnings of relational
database theory Early work in this problem appears in [Cod74, Gra77, Gra79, Gra80,
Lip79, Lip81]. Incomplete information is so widely spread in today’s applications that
practically any data analysis tool has to deal with null values in some way. Many data
modeling and analysis techniques deal with missing values by removing from consider-
ation whole records if one of the attribute values is missing, or using ad hoc methods of
estimation for such values. Even though a wide variety of methods to deal with incom-
plete information have been proposed, which are in general highly tuned for particular
applications, no tools have been provided to allow end-users to easily specify different
ways of managing this type of data according to their needs and based on their expertise.
Consider another employee database and the following instance:
8
Name Year Department Salary Category
t
1
John 2008 CS 70K B
t
2
John 2009 CS 80K B
t
3
John 2010 Math ? A
t
4
Mary 2010 Math 90K A
This relation contains a record for employee John for year 2009, in which the at-
tribute Salary holds a null value, meaning that we do not know how much John earned
that year. The classical approach in data cleaning and query answering would be to dis-
card that record completely: since no information about the salary is provided, it is not
possible to reason with that data. However, in many applications, users ?ll in this type of
null values following strategies that are appropriate for the type of data, the type of appli-
cation, or the decision process the application is supporting. For instance, a user of this
database could decide to ?ll in the missing salary for John by using a regression model
with the data for other years we have for the same employee and extrapolate a value for
the missing year. Another user could decide to use the salary information from Mary,
who was also in the Math department in 2010 and had the same category as John.
In this thesis, we propose a policy based framework to allow end-users to personal-
ize the management of incomplete information by de?ning their own Partial Information
Policies (PIPs for short) without having to depend on decisions made by, for instance,
DBMS managers that might not know the data or the needs of the users. PIPs can be used
in combination with relational operators allowing the user to issue queries that perform
an assumption-based treatment of null values. This approach is developed in Chapter 6.
Finally, in the presence of structured and semi-structured knowledge bases such as
relational databases, RDF databases, ontologies, etc., one important issue in the design
of data integration systems is that of providing the users with a uni?ed view of the dif-
9
ferent sources that they can query, making the whole process of integration transparent
to the users. In such systems, the uni?ed view is represented by a target or mediated
schema. One of the main tasks in the design of such systems is to establish the map-
ping between the source schemas and the target schema. There has been intense work
during the last few years on schema matching in order to answer queries over multi-
ple databases [LC00, MKIS00, MHH
+
01, DNH04, HMYW05, ESS05, Gal07, BMP
+
08,
CSD
+
08, BV08]. More recently, there has been a realization that methods to automati-
cally match schemas are uncertain. That is, when an algorithm for schema matching is
executed, it might say that there are many possible mappings between one schema and an-
other, and that a probability distribution over this set of mappings speci?es the probability
that a speci?c mapping is the correct one [DHY07]. Work to date [DHY07, DSDH08] on
probabilistic schema mapping has studied the problem of answering SPJ (select-project-
join) queries over multiple databases. In Chapter 7 we extend previous work by focusing
on scalable algorithms for answering aggregate queries in the presence of probabilistic
schema mappings. We also study how the use of uncertain schema mappings relates to
inconsistency in the presence of integrity constraints.
In summary, the goal of this thesis is to attack particular problems of knowledge
integration and provide personalizable approaches to handle them. Previous works in
uncertainty and inconsistency management try to overcome these issues by considering
that there is one correct way of handling them. Using the frameworks and tools proposed
here, the users can specify when and how they want to manage/solve the issues that the
integration of several heterogeneous knowledge bases yield, in the way that best suits
their needs.
10
Chapter 2
Related Work
In this chapter, we will review the literature that is related to the work developed
in this thesis. In Section 2.1, we will review the literature on inconsistency management
both in arti?cial intelligence and database theory. Section 2.2 focuses on the work of
partial information, particularly in relational databases. Section 2.3 will focus on the area
of entity resolution and deduplication, which is related to our work described in Chapter 5
for identifying inconsistency in news reports about events. Finally, in support of the work
presented in Chapter 7, Section 2.4 reviews the literature on the problem of data exchange
and intregration, focusing on (probabilistic) schema matching.
2.1 Inconsistency Management
There has been a tremendous amount of work in inconsistency management since
the 60s and 70s, when paraconsistent logics where introduced [dC74, BS89] and logics of
inconsistency [Bel77, Gra78] were developed. In general, two kinds of approaches have
been proposed in the literature for solving the problem of inconsistency.
The ?rst type focuses on revising the knowledge base and restoring its consistency.
This approach, initiated in [RM70], proposes to give up some formulas of the knowledge
11
base in order to get one or several consistent subbases. More speci?cally, [RM70] consid-
ers expressing preferences among the maximal consistent subsets of the original (propo-
sitional) knowledge base, so that preferred maximal consistent subsets are determined.
In the case of prioritized knowledge bases, Brewka [Bre89] has proposed a de?nition
of a preferred subbase. The basic idea is to take as much important information into
account as possible. More speci?cally, two generalizations of Poole’s approach [Poo88]
have been proposed: (1) strati?ed knowledge bases, in which formulas in the same stra-
tum are equally preferred, whereas formulas in a stratum are preferred over formulas in
lower strata; (2) the de?nition of a partial order among the formulas of a knowledge base.
Regarding inconsistency management based on a partial order on the formulas of a
knowledge base, in [Roo92] the author de?nes the concept of a reliability theory, based
on a partial reliability relation among the formulas in a ?rst order logic knowledge base.
In terms of this theory, the set of all most reliable consistent sets of premisses is de?ned.
The set of theorems that can be proved from the theory is the set of propositions that
are logically entailed by every most reliable consistent set. The work de?nes a special
purpose logic based on ?rst order calculus, and a deduction process to obtain the set of
premises that can be believed to be true. The deduction process is based on the computa-
tion of justi?cations (premisses used in the derivation of contradictions) for believing or
removing formulas, and it iteratively constructs and re?nes these justi?cations.
Priority-based management of inconsistent knowledge bases has been addressed
in [BCD
+
93] (see also [CLS95]). Speci?cally, propositional knowledge bases are con-
sidered and a knowledge base is supposed to be organized into strata, where each stratum
has higher priority than the one directly below. Priorities in the knowledge base are used
to select preferred consistent subbases. Inferences are made from the preferred subbases
of the knowledge base, that is the knowledge base entails a formula ? iff ? can be clas-
12
sically inferred from every preferred subbase. In Section 4.6 we show how all of these
works can be expressed in our general framework for inconsistency management, de?ned
in Chapter 4.
The second approach to inconsistency accepts it and copes with it, prohibiting the
logic from deriving trivial inferences; this includes multi-valued, paraconsistent, and de-
fault logics. The four valued logic in [Bel77] was used for handling inconsistency in logic
programming [BS89] and was extended to the case of lattices [KS89, KS92] and bilat-
tices [Fit91]. Subsequently, [Rei80a] introduced default logic — a database DB with
integrity constraints IC could easily be written as a default logic theory ? = (D, W)
where D = ¦
:f
f
[ f ? DB¦ and W = IC — the “extensions” of the default theory
? correspond exactly to maximal consistent subsets of DB that are consistent with IC.
Later, [BKM91, BKMS91] applied these ideas, together with algorithms, to integrate
multiple knowledge bases. Kifer and Lozinskii [KL92] extended annotated logics of in-
consistency developed by Blair and Subrahmanian [BS89] to handle a full ?rst order case,
while [TK93] developed similar extensions to handle inheritance networks.
Finally, argumentation methods [SL92, PS97, AP07, BH05, AC02] have been used
for handling uncertainty and inconsistency by means of reasoning about how certain con-
tradictory arguments defeat each other. Argumentation frameworks range from abstract
argumentation to frameworks rooted in different types of logics. In abstract argumenta-
tion frameworks, proposed by Dung in [Dun95], arguments consist of just naming argu-
ments along with an attack relation representing the fact that an argument is challenged
by another, where no indication of the nature of that challenge is described in the re-
lation. Approaches to logical argumentation include argumentation frameworks based
on classical ?rst-order argumentation [BH05], default logics [BDKT97], and defeasible
logics [GS04, PS97], among others.
13
The area of belief change in arti?cial intelligence is closely related to the manage-
ment of inconsistent information. Belief change aims to adequately model changes in
knowledge in belief systems, i.e., modeling the dynamics of the knowledge that consti-
tutes the set of beliefs of an agent when new information is presented. There are different
kinds of belief changes (belief change operations): revision, contraction, and consolida-
tion.
It seems reasonable to think that inconsistency management techniques are mate-
rializations of certain variations of belief change methods. However, the seminal works
on belief change [AGM85] are done under the assumption that the knowledge base to be
contracted or revised is a belief set, i.e., a set of sentences that is deductively closed. In the
setting of a relational database, considering the logical closure of the ?rst-order encoding
of a database instance as the belief set to be revised or consolidated is not really prac-
tical since we would be considering as part of the knowledge base elements that do not
really have independent standing in the relational instance. For instance, for every record
of the form salary(John, 80K), the sentence salary(John, 80K) ? ?, where ? is any
other sentence that can be formed from the language, will hold. The latter sentence is a
mere logical consequence that should have no standing of its own, especially in relational
databases where we only desire to reason about the literals explicitly present as records
in the database. Knowledge bases in this setting are therefore belief bases, i.e., sets of
sentences that are not closed under logical consequence.
The AGM model proposed in [AGM85] presents a set of postulates for contraction
and revision operations. In [Gar78], the basic set of postulates for contraction and revision
operations were ?rst presented; later, in [AGM85] , the set of postulates were extended
and partial meet contraction over belief sets was presented together with a representation
theorem that shows that partial meet contraction exactly characterizes the set of con-
14
traction operators that satisfy the rationality postulates from [Gar78]. Unfortunately, not
all rationality postulates and belief change operators that satisfy the postulates from the
AGM model [AGM85, Gar88b], and its derivations, are applicable to belief bases. Al-
ternatively, Hansson [Han94, Han97] de?nes a different set of rationality postulates, and
corresponding partial meet revision and contraction operators, to deal exclusively with
belief bases. In Section 3.5, we discuss the relationship between belief revision in belief
bases and IMPs, by studying the rationality postulates from [Han97] if we consider an
IMP as a belief revision operator. Partial meet contraction and revision as de?ned above
is actually applicable for both belief sets and belief bases. However, it is important to note
that K?? is the set of maximal subsets of K that do not imply ?, and it is not enough
that they do not contain ?. Consider as an example the belief base K = ¦p ? q, p ? q¦,
in this case K?p = ¦¦p ? q¦, ¦p ? q¦¦. Repairing a relational database by keeping the
intersection of all maximal consistent subsets w.r.t. a set of integrity constraints can be
shown to be a special case of partial meet revision (i.e., full meet revision).
Finally, there are also several important works [Loz94, HK05, GH06, MPS
+
07] on
measuring the amount of inconsistency in a database or knowledge base that can be used
in combination with any of the inconsistency management techniques described above.
2.1.1 Database Repairing and Consistent Query Answering
In databases, the integrity constraints (ICs) capture the semantics of data with re-
spect to the external reality that the database is expected to model. Therefore, an incon-
sistent database instance, together with the integrity constraints, may be represented as an
inconsistent set of formulas, where integrity constraints are closed ?rst-order formulas.
Examples of the most common types of ICs treated in the literature are: (1) Universal
integrity constraints: allow the expression of constraints such as “the salary of every em-
15
ployee must be greater than or equal to zero”; (2) Denial constraints: allow the expression
of constraints such as “it is not possible for an employee to be married and single at the
same time”; (3) Functional Dependencies: allow expressions like “if two books have the
same ISBN then their titles must be the same”; and (4) Inclusion dependencies: allow ex-
pressions like “if a person is registered as a graduate student in the GradStudents relation,
then there must exist a record for that same person in the Students relation”.
A database instance I is said to be consistent w.r.t. a set of integrity constraints if it
satis?es the given set IC (in the standard model-theoretic sense), and inconsistent other-
wise. Methods to clean data and/or provide consistent query answers in the presence of
inconsistent data have been widely studied in the last decades [FUV83, FKUV86, JDR99,
SS03, Cho07, BFFR05]. [FUV83, FKUV86] introduced the use of maximal consistent
subsets (called “?ocks”) to accommodate updates to a database that violate integrity con-
straints.
The ?eld of database repairing and consistent query answering (CQA for short) has
gained much attention since the work of Arenas et al. [ABC99], which provided a model-
theoretic construct of a database repair: a repair of an inconsistent database is a model of
the set of ICs that is minimal, i.e., “as close as possible” to the original database. Clearly,
repairs may not be unique, and in the general case there can be a very large number of
them. The most widely accepted semantics for querying a possible inconsistent database
is that of consistent answers. A consistent answer for a query over a possibly inconsistent
database is comprised of the set of tuples that appear in the answer to the query over every
possible repair. CQA enforces consistency at query time as an alternative to enforcing it
at the instance level as conventional data cleaning techniques do. This allows us to focus
on a smaller portion of the database for which repairs can be computed more easily.
16
Furthermore, techniques have been developed so it is not necessary to materialize every
possible repair.
The work of [Cho07] addresses the basic concepts and results of the area of con-
sistent query answering. They consider universal and binary integrity constraints, de-
nial constraints, functional dependencies, and referential integrity constraints (e.g., for-
eign key constraints). [BFFR05] presents a cost-based framework that allows ?nding
“good” repairs for databases that exhibit inconsistencies in the form of violations to ei-
ther functional or inclusion dependencies. They propose heuristic approaches to con-
structing repairs based on equivalence classes of attribute values; the algorithms pre-
sented are based on greedy selection of least repair cost, and a number of performance
optimizations are also explored. The technique based on query rewriting introduced
in [ABC99] for quanti?er-free conjunctive queries and binary universal constraints was
extended in [FM05, FFM05] to work for a subclass of conjunctive queries in the pres-
ence of key constraints. The complexity of the consistent query answer problem was
investigated in [CLR03] in the presence of both functional and inclusion dependencies,
and further studied in [CM05] in the presence of denial constraints and inclusion de-
pendencies. The notion of consistent answer was extended to the case of aggregate
queries in [ABC
+
03b], where the evaluation of consistent answers of aggregate queries
was investigated in the presence of functional dependencies. The logic-based frameworks
in [ABC03a, GGZ03, BB03] assume that tuple insertions and deletions are the basic prim-
itives for repairing inconsistent data. Repairs also consisting of value-update operations
were considered in [FLPL
+
01, BBFL05, Wij03, Wij05, BFFR05, FFP05, FFP07]. In
particular, [Wij03, Wij05, FFP05] investigated the complexity of the consistent query an-
swering problem when the basic primitive for repairing data is the attribute-value update,
17
whereas [FLPL
+
01, BBFL05, BFFR05, FFP07] focused on the problem of computing
repairs rather than computing consistent answers.
In CQA, an answer to a given query posed to an inconsistent database is said to
be consistent if the same answer is obtained from every possible repair of the database.
Clearly, this is a very cautious approach, and it can yield a great loss of information.
Even though several works investigated the problem of repairing and querying inconsis-
tent data considering different classes of queries and constraints, only recently there have
been two proposals which shifted attention towards improving the quality of consistent
answers [CGZ09, SCM09]. These approaches developed more speci?c repairing strate-
gies that reduce the number of possible repairs to be considered and improve their quality
according to some criteria speci?ed by the database administrator on the basis of users’
preferences. We study these two approaches more in depth in Section 3.6 in comparison
with IMPs, our approach to inconsistency management in relational databases developed
in Section 3.5.
Almost all past approaches described above proceeded under the assumption that
there was some “epistemically correct” way of resolving inconsistencies or reasoning
in the presence of inconsistency. More recently, [SA07] argued that inconsistency can
often be resolved in different ways based on what the user wants, and they provided a
mechanism to reason about maximal consistent subsets using objective functions that the
user gets to choose.
With the exception of [SA07], to the best of our knowledge all past work on in-
consistency management in databases assumes that the database designer knows more
than the user about how to handle inconsistency, and hence, inconsistencies are resolved
within the database infrastructure. Unfortunately, this is not wise. A database designer
for Oracle is not likely to know how, for example, an astronomy database was collected,
18
what assumptions were made when collecting it, and how inconsistencies should be han-
dled in the context of such knowledge. In the same vein, a database developer for other
major DB vendors may not understand how some epidemiological data in a government’s
Health Ministry was collected, and what the rami?cations and risks are to a policy-maker
using that data to make decisions. The policy-maker may wish to make decisions on the
basis of some inconsistent data taking into account not only information about how the
data was collected, which sources were reliable, and so forth, but also about his own risk
if he makes the wrong decision. A database designer who has embedded some form of
inconsistency management within the DB infrastructure will not know this a priori. This
is the main difference between the approaches described in this section and our propos-
als of IMPs, in Chapter 3, and the general framework for inconsistency management in
Chapter 4.
2.2 Partial Information
Incomplete information can appear in knowledge bases for very different reasons.
For instance, people ?lling out forms or surveys often leave ?elds incomplete, in many
cases due to security or privacy issues. Data integration techniques are also other sources
of incomplete information, especially when performed automatically. For instance, when
merging knowledge bases with disparate schemas into a global schema, the consolidated
knowledge base often contains null values for attributes that were not present in ev-
ery source schema. The problem of representing incomplete information in relational
databases has been extensively studied since the introduction of the relational data model
[Cod74, Bis79, Cod79, Gra77, IJ81, IL84b, AM86, LL99, Lie82, Jr.79, Zan84]. Incom-
19
plete information makes the process of reasoning much harder since, if not treated care-
fully, results can present incorrect or biased information.
It was recognized early on that there are different types of null values, each of
which re?ects different intuitions about why a particular piece of information is missing.
Different types of null values have been proposed in the literature. [CGS97a] presents the
following list of types of null values:
• Existential Null. The value for an attribute A exists but it is not known. The actual
value for an existential null could be any value within the domain of A.
• Maybe Null. The value for an attribute A may exist but it is not known at the
moment of the record’s creation or may not exist at all, and it is not known which
is actually the case.
• Place holder Null. An attribute might not be applicable for a certain record.
• Partial Nulls. There exists a value for an attribute but it might be one out of a set of
possible values. The set of possible values is a subset of the attribute’s domain.
• Partial Maybe Nulls. An attribute may or may not be applicable for a particular
record, but if it is applicable then there exists a value and it must fall within a
speci?ed set.
There is a clear trade-off between the expressivity of the model for incomplete
information and the dif?culty of answering queries. Allowing more than one kind of null
value makes reasoning over the data even more complex, especially when several relations
are put together, since there might be inconsistency at the null value level. For instance,
in one relation the value for attribute Phone for an employee appears as an existential
null whereas in another database the record for the same employee contains a null value
20
for Phone but as a place holder. [CGS97a] presents a uni?ed framework to deal with
different kinds of null values.
No-information nulls were introduced in [Zan84] to deal with the case where it
is not known whether a missing value exists or not. They have also been considered
in [Lie82] and [AM86, HL10], where integrity constraints in the presence of no-information
nulls are studied.
A greater amount of work has been devoted to the study of databases containing
only unknown nulls. In this context, different problems have been addressed, such as
query answering [AKG91, Gra91, IL84a, Rei86, YC88], the characterization of consis-
tency in the presence of integrity constraints [Gra91, IL83, LL98, LL99, Vas80] and up-
dating the database [AG85, Gra91]. As stated in [IL84a], a condition for correctness is
that if a null value has a speci?ed semantic interpretation, that is, if we assume that a
speci?ed relation exists between a table with nulls and the real world, then this relation
should be similar for the tables that are arguments of a relational operator and for the
relation obtained as the result (this is the requirement for a strong representation system).
Codd’s approach [Cod74] is based on a three-valued logic with truth values True,
False, and Unknown. Under this approach, a condition over a relation containing null val-
ues can evaluate to Unknown depending on the logical operators. For instance, Unknown?
Unknown evaluates to Unknown. Codd’s semantics has been criticized on semantic
grounds by Grant [Gra77] and Lipski [Lip79]. Other works such as [Lip79, Lip81, Vas79]
propose different semantics for other subsets of the relational operators. Through the def-
inition of a representation system, [IL84a] formally presents the conditions that must be
satis?ed in any semantically meaningful extension of the relational algebra to handle in-
complete information.
21
More expressive data models where the values that (labeled) unknown nulls can
take can be constrained have been considered in [AKG91, Gra91, IL84b]. Codd tables,
V-tables, and C-tables are analyzed in these works as possible representation systems. In
Codd tables, null values are represented by a special symbol, and the semantics for that
symbol is that it is an unknown value. Codd tables are a strong representation system
for projection and selection. V-tables allows many different (“marked”) null values, or
variables. The meaning of a null value X is that the value is unknown but it is the same
value every time X appears. For this representation system, [IL84a] shows that V-tables
are a strong representation system for projection, positive selection, union, join, and re-
naming of attributes; therefore, arbitrary conjunctive queries can be processed correctly.
Finally, C-tables (or conditional tables) are V-tables with an additional column, condition,
containing a condition for the variables that represent null values. C-tables are a strong
representation system for projection, selection, union, join, and renaming.
The most common adopted semantics for this kind of databases is given in terms
of completions, or possible worlds (we will provide the formal de?nition in Chapter 6).
Two largely accepted semantics of query answering are: (i) certain answers – a tuple is
a certain answer to a query Q if it is an answer to Q in every completion; (ii) possible
answers – a tuple is a possible answer to a query Q if it is an answer to Q in some
completion.
The main difference between the aforementioned approaches and the treatment of
partial information that we propose in Chapter 6, is that in the former no attempt at re-
solving incompleteness is made. On the contrary, in Chapter 6, we argue that data stored
in the database and the knowledge the user has of it can be pro?tably exploited to add
valuable new information to the database.
22
Outside the database theory community, the ?elds of data mining, data warehous-
ing, and data management have addressed the problem of data quality in the presence
of incomplete information mainly providing approaches based on cleaning techniques.
In general, simple estimates for numeric and categorical data are used [MST94, Pyl99,
Qui93]. A more complex approach in classi?cation analysis was proposed in [BFOS84].
For numeric data analysis, methods based on regression techniques were also devel-
oped [FLMC02, Pyl99]. Probabilistic methods include probabilistic weighting methods
such as the one proposed in [Qui93] and Bayesian methods. In the last category, [CS86]
proposes a Bayesian procedure for estimating and replacing missing data based on some
prior knowledge about the distribution of the data; [CA03] estimates and replaces null val-
ues for categorical attributes using the uniform prior distribution and a Dirichlet posterior
distribution. Recently, in [Li09] the posterior probabilities of a missing value belonging
to a certain category are estimated using the simple Bayesian method and are used to
compute a replacement value.
The main difference between the works above and our approach is that the former
are ad-hoc statistical approaches tailored for particular domains and applications; in con-
trast, we develop a more general formal framework that allows users to specify a wide
set of strategies, and in particular those that he believes are adequate for the speci?c data
and application at hand. Furthermore, all of these works are designed, in general, to
be applied as “cleaning tools” to a database and they do not study the interaction of the
methods with relational algebra operators as we do in this work. Another difference is
that we adopt a richer data model which allows us to express different kinds of partial
information, whereas all the approaches mentioned above deal with unknown nulls only.
Our work also addresses the problem of supporting the ef?cient application of incom-
23
pleteness resolution methods, by providing index structures that allow us to manage large
databases, which is an issue addressed in none of the works above.
Finally, we mention that our work in Chapter 6 is also related to hypothetical rea-
soning [Res64], where one considers what follows from an assumption. Usually the as-
sumption is inconsistent with known information, requiring changes to the latter trying to
make it consistent with the assumption. In our case we may consider ?lling in a value as
a hypothesis.
Finally, many works in AI have tackled the issue of dealing with incompleteness.
These include logic program completions [Cla77], closed world assumption [Rei78],
auto-epistemic logic [Moo85], default logic [Rei80b], alternating-time temporal logic
with imperfect information [JD05, JD06], to mention a few. In these works, partial infor-
mation is de?ned in a different way, in a kind of uncertain information that does not deal
with null values.
2.3 Entity Resolution and Deduplication
The problem of event equivalence identi?cation addressed in Chapter 5 is closely
related to a class of problems called entity resolution in machine learning [BG07]. Given
a set of (potentially different types of) entities, entity resolution asks for a partition of
this set such that entities are grouped together iff they are equivalent. In our problem,
the entities of primary interest are events, but we also reason about the equivalence of ac-
tors, locations, and objects, which are entities of secondary interest. Traditional machine
learning approaches to entity resolution focus on pairwise entity equivalence determi-
nation using established techniques such as Bayesian networks [Hec98], support vector
machines [CS00], or logistic regression [NJ02]. Hence, for each pair of entities a clas-
24
si?er is used to determine equivalence. In a post processing step, inconsistencies due to
violations of transitivity are resolved.
Recent work has considered joint entity resolution in the case when entities are re-
lated [GD05]. Such approaches are termed relational, because they determine all event
equivalences at once and take relationships between entities into account instead of mak-
ing a series of independent decisions. Some proposals for relational entity resolution de-
?ne a joint probability distribution over the space of entity equivalences and approximate
the most likely con?guration using sampling techniques, such as Gibbs sampling [SNB
+
08],
or message passing algorithms, such as loopy belief propagation [MWJ99]. While these
approaches are based on a probabilistic model, they provide no convergence guarantee
and allow little theoretical analysis. Other relational approaches are purely procedural in
that they apply non-relational classi?ers in an iterative fashion until some convergence
criterion is met. While iterative methods are fast in practice, they are not amenable to
theoretical analysis or convergence guarantees. All these approaches are feature driven
and do not have a formal semantics.
Recently, [BGMM
+
09] proposed a generic approach to entity resolution (ER). In
this work the authors formalize the generic ER problem, treating the functions for com-
paring and merging records as black-boxes. They identify four important properties that,
if satis?ed by the match and merge functions, enable much more ef?cient ER algorithms:
idempotence, commutativity, associativity, and representativity (a record obtained from
merging two records represents the original records, in the sense that any other record that
would have matched the ?rst two will also match it). Three ef?cient ER algorithms are
developed: G-Swoosh for the case where the four properties do not hold, and R-Swoosh
and F-Swoosh that exploit the 4 properties. F-Swoosh in addition assumes knowledge of
the “features” (e.g., attributes) used by the match function. R-Swoosh (and F-Swoosh)
25
can be used also when the four match and merge properties do not hold, if an approximate
result is acceptable.
2.4 Schema Mappings and Data Exchange
Data exchange is the problem of taking data structured under a source schema and
translating into an instance of a target schema that re?ects the source data as accurately
as possible. Data exchange is an old and recurrent problem in databases. The work
of [FKMP05] reviews the principal components and solutions to the problem, focusing
on query answering. In data exchange, it is assumed that there is a set of constraints
that de?ne the relationship between the source schema and the target schema. These
constraints, called source-to-target dependencies, establish how and what source data
should appear in the target. The data exchange problem can then be speci?ed as: giving
an instance I over a source schema S, materialize an instance J over schema target T such
that J satis?es the set of source-to-target dependencies between S and T. The problem
of query answering in a data exchange setting has been also studied in the last decade.
[FKP05, GN06] focus on complexity results for computing (universal) certain answers,
in the presence of tuple generating and equality generating dependencies, apart from a
set of source-to-target dependencies.
A related problem is that of data integration, where the goal is to query hetero-
geneous data in different sources via a virtual global schema. There has been intense
work during the last few years on schema matching in order to answer queries over multi-
ple databases [LC00, MKIS00, MHH
+
01, DNH04, HMYW05, ESS05, Gal07, BMP
+
08,
CSD
+
08, BV08]. The works mentioned above assume that it is possible to exactly map
the translation. However, the work of [DHY07] argues that this is not necessarily always
26
the case and that therefore uncertainty on how to map the source schema to the target
schema may arise. For instance, methods to automatically match schemas are uncertain
— when an algorithm for schema matching is executed, it might determine that there are
many possible mappings between one schema and another, and that a probability distri-
bution over this set of mappings speci?es the probability that a speci?c mapping is the
correct one [DHY07, Gal06]. For example, in a residential or commercial real estate web
site that aggregates information from multiple realtors across the country, there is a need
to ?nd mappings between disparate schemas. In such a case, there may be many different
ways of mapping one schema to another and a probability distribution might tell us the
probability that a given mapping is correct. Alternatively, a web search engine doing a
product search over the databases of multiple vendors needs to ?nd mappings between
the product database of one vendor and the product database of another vendor. Multiple
ways of representing the data might lead to multiple possible schema mappings, together
with a probability distribution over the set of mappings.
Work to date [DHY07, DSDH08] on probabilistic schema mapping has studied
two semantics for answering queries over multiple databases using probabilistic schema
matches — a “by-table” semantics and a “by-tuple” semantics. In the “by-table” seman-
tics a single mapping should be applied to the entire set of tuples in the source relation.
Each query is answered separately and the answers are assigned a probability according to
the mappings that produced it. However, in the “by-tuple” semantics a choice of mapping
must be made for each tuple. So it is necessary to consider all possible ways of assign-
ing a mapping to a tuple, and answer the query for each one of them. Clearly, by-tuple
semantics is more complex and in the general case dif?cult to compute.
When aggregate queries are considered, then in addition to whether a by-table or
by-tuple semantics should be used, we need to consider the semantics of the aggregates
27
themselves in the presence of uncertainty. Some work has been done on aggregates over
uncertain data [RSG05, JKV07], yet none at all w.r.t. aggregate computations under prob-
abilistic schema mapping. In Chapter 7 we study answering aggregate queries in this
setting.
28
Chapter 3
Inconsistency Management Policies
The work described in this chapter appears in [MPP
+
08, MPP
+
10].
3.1 Introduction and Motivating Example
Almost all past approaches to inconsistency proceeded under the assumption that
there was some “epistemically correct” way of resolving inconsistencies or reasoning
in the presence of inconsistency. More recently, [SA07] argued that inconsistency can
often be resolved in different ways based on what the user wants, and they provided a
mechanism to reason about maximal consistent subsets (also called “repairs” by others
such as [ABC99]) using objective functions that the user gets to choose.
With the exception of [GH92, GH93], and [SA07], all past work on inconsistency
management in databases assumes that the database designer knows more than the user
about how to handle inconsistency, and hence, inconsistencies are resolved within the
database infrastructure. Unfortunately, this is not wise. A database designer for Ora-
cle is not likely to know how, for example, an astronomy database was collected, what
assumptions were made when collecting it, and how inconsistencies should be handled
in the context of such knowledge. In the same vein, a database developer for other ma-
29
jor DB vendors may not understand how some epidemiological data in a government’s
Health Ministry was collected, and what the rami?cations and risks are to a policy maker
using that data to make decisions. The policy-maker may wish to make decisions on the
basis of some inconsistent data taking into account not only information about how the
data was collected, which sources were reliable, and so forth, but also about his own risk
if he makes the wrong decision. A database designer who has embedded some form of
inconsistency management within the DB infrastructure will not know this a priori.
The work in this chapter aims to support the two types of users mentioned above.
The main goal of this work is developing the theory and implementation support required
in databases so that end users can bring both their application speci?c knowledge as well
as their own personal risk to bear when resolving inconsistencies. To reiterate why this
is important, let us consider the following example.
Example 1. The Emp relation below represents data about employees, their salaries and
tax brackets; each tuple t
i
in the relation is provided by a source s
j
(which is annotated
on the same row).
Name Salary Tax_bracket Source
t
1
John 70K 15 s
1
t
2
John 80K 20 s
2
t
3
John 70K 25 s
3
t
4
Mary 90K 30 s
1
Let us assume that salaries are uniquely determined by names. In this case, a user
may want to resolve the inconsistency about John’s salary in many different ways.
C1 If he were considering John for a loan, he might want to choose the lowest possible
salary of John to base his loan on. Here, the user’s decision is based on the risk he
would take if he gave John a loan amount based on a higher income.
30
C2 If he were assessing the amount of taxes John has to pay, he may choose the highest
possible salary John may have — in this case, the end user is basing his decision on
his own reward (assuming his salary is somehow based on the amount of additional
taxes he can collect).
C3 If he were just trying to estimate John’s salary, he may choose some number between
70K and 80K (e.g., the average of the three reports of John’s salary) as the number.
C4 If he has different degrees of con?dence in the sources that provided these salaries,
he might choose a weighted mean of these salaries.
C5 He might choose not to resolve the inconsistency at all, but to just let it persist until
he can clear it up – there may be no need to deal with John’s salary with respect to
his current task.
C6 He might simply consider all the data about John unreliable and might want to ignore
it until it can be cleared up – this is the philosophy of throwing away all contami-
nated data.
Each of cases C1 through C6 re?ects a policy that the user is using to resolve in-
consistencies. We are not aware of a single piece of past work that can handle all six
reasonable possibilities mentioned above. [BKM91, SA07, ABC99, BFFR05, CLR03,
BC03, Cho07] can handle cases C1 and C2, but not the other cases. Much as a carpenter
has tools like hammers and saws, we propose to put data cleaning policies in the hands of
users so that they can use them as tools, when appropriate, for reasoning about the data.
It is important to enable users to bring their application speci?c knowledge to bear when
resolving inconsistencies. Section 2.1 describes work in AI and databases that relate to
the work developed in this chapter.
31
The contributions of this chapter are as follows: in Section 3.3 we ?rst de?ne the
concept of a policy for managing inconsistency. Our notion of an inconsistency man-
agement policy generalizes [BKM91, BKMS91, ABC99] by allowing policies that either
remove inconsistency completely or allow part or all of the inconsistency to persist. Our
notion of a policy accounts for all six cases above, and many more. We start with policies
applicable to a single functional dependency (FD for short), one of the most common
kinds of integrity constraints used in databases, then we discuss policies to manage mul-
tiple FDs. In Sections 3.5 and 3.6 we give a detailed overview and results of how our
framework relates to postulates for the revision of belief bases [Han97] and to recent
research in the area of consistent query answering [CGZ09, SCM09]. In Section 3.7
we show that our policies can be embedded as operators within the relational algebra in
two different ways – one where the policy is applied ?rst (before other relational opera-
tors) and another where it is applied last. We study the interaction of these policy usage
methods together with the standard relational operators and provide several interesting re-
sults. In Section 3.8 we present several approaches to ef?ciently implement an IMP-based
framework. Finally, Section 3.9 outlines conclusions.
3.2 Syntax and Notation
We assume the existence of relational schemas of the form S(A
1
, . . . , A
n
) [Ull88]
where the A
i
’s are attributes. Each attribute A
i
has an associated domain, dom(A
i
). A
tuple over S is a member of dom(A
1
) dom(A
n
), and a set of such tuples is called
a relation. We use t[A
i
] to denote the value of the A
i
attribute of tuple t. We use Attr(S)
to denote the set of all attributes in S.
32
Given the relational schema S(A
1
, . . . , A
n
), a functional dependency (FD) fd over
S is an expression of the form A
t
1
, . . . , A
t
k
? A
t
k+1
, . . . , A
t
m
, where ¦A
t
1
, . . . , A
t
m
¦ ?
Attr(S). A relation R over the schema S satis?es the above FD iff ? t
1
, t
2
? R, t
1
[A
t
1
] =
t
2
[A
t
1
] ? . . . ? t
1
[A
t
k
] = t
2
[A
t
k
] ? t
1
[A
t
k+1
] = t
2
[A
t
k+1
] ? . . . ? t
1
[A
t
m
] = t
2
[A
t
m
]. With-
out loss of generality, we assume that every functional dependency fd has exactly one
attribute on the right-hand side (i.e., k + 1 = m) and denote this attribute as RHS(fd).
Moreover, with a little abuse of notation, we write that fd is de?ned over R.
De?nition 1. Let R be a relation and T a set of functional dependencies. A culprit is a
set c ? R not satisfying T such that ? c
t
? c, c
t
satis?es T.
For instance, the culprits in the example of the Introduction are ¦t
1
, t
2
¦ and ¦t
2
, t
3
¦.
We use culprits(R, T) to denote the set of culprits in R w.r.t. T.
De?nition 2. Let R be a relation and T a set of functional dependencies. Given two
culprits c, c
t
? culprits(R, T), we say that c and c
t
overlap, denoted c ´ c
t
, iff c ?c
t
,= ?.
De?nition 3. Let ´
?
be the re?exive transitive closure of relation ´. A cluster is a set
cl =

c?e
c where e is an equivalence class of ´
?
.
In the example of the Introduction, the only cluster is ¦t
1
, t
2
, t
3
¦. We will denote
the set of all clusters in R w.r.t. T as clusters(R, T).
3.3 Inconsistency Management Policies
In this section, we introduce the concept of policy for managing inconsistency in
databases violating a given set of functional dependencies. Basically, applying an incon-
sistency management policy on a relation results in a new relation with the intention of
obtaining a lower degree of inconsistency.
33
De?nition 4. An inconsistency management policy (IMP for short) for a relation Rw.r.t. a
set T of functional dependencies over Ris a function ?
T
fromRto a relation R
t
= ?
T
(R)
that satis?es the following axioms:
Axiom A1 If t ? R ?

c?culprits(R,T)
c, then t ? R
t
. This axiom says that tuples that do
not belong to any culprit cannot be eliminated or changed.
Axiom A2 If t
t
? R
t
? R, then there exists a cluster cl and a tuple t ? cl such that for
each attribute A not appearing in any fd ? T, t.A = t
t
.A. This axiom says that
every tuple in R
t
must somehow be linked to a tuple in R.
Axiom A3 ?fd ? T, [culprits(R, ¦fd¦)[ ? [culprits(R
t
, ¦fd¦)[. This axiom says that
the IMP cannot increase the number of culprits.
Axiom A4 [R[ ? [R
t
[. This axiom says that the IMP cannot increase the cardinality of
the relation.
?
T
is a singular IMP iff T is a singleton. When T = ¦fd¦ we write ?
fd
instead of
?
¡fd¦
.
It is important to note that Axioms A1 through A4 above are not meant to be ex-
haustive. They represent a minimal set of conditions that we believe any inconsistency
management policy should satisfy. Speci?c policies may satisfy additional properties.
3.3.1 Singular IMPs
In this section, we de?ne three important families of singular IMPs (tuple-based,
value-based, and interval-based), which satisfy Axioms A1 through A4 and cover many
possible real world scenarios. Clearly, De?nition 4 allows to specify many other kinds of
IMPs, based on the user’s needs.
34
De?nition 5 (tuple-based family of policies). An IMP ?
fd
for a relation R w.r.t. a func-
tional dependency fd is said to be a tuple-based policy if each cluster cl ?clusters(R, ¦fd¦)
is replaced by cl
t
? cl in ?
fd
(R).
Tuple-based IMPs generalize the well known notion of maximal consistent sub-
sets [BKM91] and repairs [ABC99] by allowing a cluster to be replaced by any subset
of the same cluster. Notice that tuple-based IMPs allow inconsistency to persist – a user
may choose to retain all inconsistency (case C5) or retain part of the inconsistency. For
instance, if the user believes only sources s
1
, s
2
in Example 1, he might choose to replace
the cluster ¦t
1
, t
2
, t
3
¦ by the cluster ¦t
1
, t
2
¦ as shown below.
Name Salary Tax_bracket
t
1
John 70K 15
t
2
John 80K 20
t
4
Mary 90K 30
[BKM91, ABC99] do not allow this possibility. Observe that this kind of policy can cause
some information to be lost as a side effect. In our example, although the Tax bracket 25
is not involved in any FD, it is lost when the policy is applied. We now introduce two
kinds of policies that avoid this problem. The ?rst kind of policy is based on the notion
of cluster simpli?cation.
De?nition 6. Given a cluster cl ? clusters(R, ¦fd¦), cl
t
is a cluster simpli?cation of cl
iff ? t
1
, t
2
? cl such that t
1
[RHS(fd)] = t
2
[RHS(fd)], either t
1
, t
2
? cl
t
or there exist
t
t
1
, t
t
2
? cl
t
obtained from tuples t
1
, t
2
by replacing t
1
[RHS(fd)] and t
2
[RHS(fd)] with
t
3
[RHS(fd)] where t
3
? cl.
A simpli?cation allows replacement of values in tuples in the same cluster (in the attribute
associated with the right-hand side of an FD).
35
Example 2. A cluster simpli?cation of the cluster cl = ¦t
1
, t
2
, t
3
¦ of Example 1 may be
the cluster cl
t
= ¦t
t
1
, t
2
, t
t
3
¦ where t
t
1
and t
t
3
are obtained from t
1
and t
3
by replacing
t
t
1
[Salary] = t
t
3
[Salary] = 70K with the value t
2
[Salary] = 80K.
This leads to the following kind of IMP.
De?nition 7 (value-based family of policies). An IMP ?
fd
for a relation R w.r.t. a func-
tional dependency fd is said to be a value-based policy if each cluster cl ? clusters(R, ¦fd¦)
is replaced by a cluster simpli?cation of cl in ?
fd
(R).
Thus, a value-based IMP either leaves a cluster unchanged or reduces the number of
distinct values for the attribute in the right-hand side of the functional dependency. A user
may, for example, decide to use his knowledge that s
1
re?ects more recent information
than s
2
to reset the s
2
information to that provided by s
1
. In this case, the relation returned
by the value-based policy is:
Name Salary Tax_bracket
t
1
John 70K 15
t
2
John 70K 20
t
3
John 70K 25
t
4
Mary 90K 30
We now show that value-based policies satisfy Axiom A3, by deriving the number of
culprits in a cluster.
Theorem 1. Let R be a relation over the relational schema S(A
1
, . . . , A
n
) and fd :
A
t
1
, . . . , A
t
k
? A
t
k+1
with ¦A
t
1
, . . . , A
t
k+1
¦ ? Attr(S) a FD over S. For each cl ?
clusters(R, ¦fd¦), assume that the values t[A
t
k+1
] of tuples t ? cl are the union of single-
value multi-sets V
1
, V
2
, . . . , V

(where every multi-set V
i
contains the single value v
i
with
cardinality C
i
). Then:
1. [culprits(cl, ¦fd¦)[ =

i 0, there is a
point p
1
? a and a point p
2
/ ? a such that the distance between p and each of p
1
, p
2
is less
6
We are assuming there is one such rectangle; otherwise a more complex method is used.
159
Predicate Symbol Signature Denotation Associated Region
morning (Date, Date) {(d
1
, d
2
) | d
1
, d
2
? U
Date
? The entire ?rst half
GLB(d
1
) ? d
2
? (GLB(d
1
) + LUB(d
1
)/2} of a day
last month (Date, Date) for m = 1, {((m, d
0
, y), z) | (m, d
0
, y), z ? U
Date
? The denotation of the
(?i) s.t. (12, i, y ?1) ? Date ? z ? (12, i, y ?1)} month immediately
for m ? 2, {((m, d
0
, y), z) | (m, d
0
, y) ? U
Date
, z ? U
Time
preceding m
? (?i) s.t. (m?1, i, y) ? Date ? z ? (m?1, i, y)}
around (Date, Real , Time Interval ) {((m, d
0
, y), k, (zs, ze)) | (m, d
0
, y) ? U
Date
The time points which
? zs, ze ? U
Time
? k ? Real ? zs = inf((ms, ds, ys) ? are within a few days
ze = sup((me, de, ye))}, where (ms, ds, ys) and (me, de, ye) of a given date
refer to the days which are exactly k days before and after (m, d
0
, y)
shortly before (Date, Real , Time Interval ) {((m, d
0
, y), k, (zs, ze)) | (m, d
0
, y) ? U
Date
The period shortly
? zs, ze ? U
Time
? k ? U
Real
? zs = inf((ms, ds, ys)) before a given date
? ze = inf((m, d
0
, y))]}, where (ms, ds, ys) refers to the day
which is exactly k days before (m, d
0
, y)
shortly after (Date, Real , Time Interval ) {((m, d
0
, y), k, (zs, ze)) | (m, d
0
, y) ? U
Date
The period shortly
? zs, ze ? U
Time
? k ? U
Real
? zs = sup((m, d
0
, y)) after a given date
? ze = inf((me, de, ye))]}, where (me, de, ye) refers to the day
which is exactly k days after (m, d
0
, y)
Table 5.4: Denotations for selected linguistically modi?ed temporal predicates
than . Using this de?nition, we see immediately that the point (1, 1) is on the boundary
of the rectangle a
t
, but (1, 2) is not.
Now consider the predicate symbol nw de?ning the northwest of a region (set of
points). According to this de?nition, a point p is to the northwest of a region a w.r.t. cone-
angle ? iff there exists a point p
0
in a such that p is in NWCone(?, p
0
). NWCone(?, p
0
)
7
is de?ned to be the set of all points p
t
obtained by (i) drawing a ray L
0
of slope ?1 to the
left of vertex p
0
, (ii) drawing two rays with vertex p
0
at an angle of ±? from L
0
and (iii)
looking at between the two rays in item (ii). Figure 5.2(a) shows this situation. Suppose a
is the shaded region and ? = 20 (degrees). We see that p is to the northwest of this region
according to the de?nition in Table 5.3.
5.4 Similarity Functions
We now propose similarity functions for many of the major sorts discussed in this
chapter. We do not claim that these are the only de?nitions – many de?nitions are possi-
ble, often based on application needs. We merely provide a few in order to illustrate that
reasonable de?nitions of this kind exist.
7
The other cones referenced in Table 5.3 can be similarly de?ned.
160
(a) (b)
Figure 5.2: Example of (a) point p in the northwest of a region a; (b) application of sim
P
1
and sim
P
2
We assume the existence of an arbitrary but ?xed denotation function for each sort.
Given a sort s, a similarity function is a function sim
s
: dom(s)dom(s) ?[0, 1], which
assigns a degree of similarity to each pair of elements in dom(s). All similarity functions
are required to satisfy two very basic axioms.
sim
s
(a, a) = 1 (5.1)
sim
s
(a, b) = sim
s
(b, a) (5.2)
161
Sort Point
Consider the sort Point, with denotation universe |
Point
= R R. Given two terms a
and b of sort Point, we can de?ne the similarity between a and b in any of the following
ways.
sim
P
1
(a, b) = e
??d(a,b)
(5.3)
where d(a, b) is the distance in RR between the denotations of a and b
8
, and ? is a
factor that controls how fast the similarity decreases as the distance increases.
sim
P
2
(a, b) =
1
1 + ? d(a, b)
(5.4)
where d() and ? have the same meaning as in Equation 5.3.
Example 40. Assuming that street addresses can be approximated as points, consider the
points a = “8500 Main St.” and b = “1100 River St.” in Figure 5.2(b), with denotations
(4, 8) and (9, 2) respectively. Assuming ? = 0.3, then d(a, b) = 7.81, sim
P
1
(a, b) =
0.096, and sim
P
2
(a, b) = 0.299.
Sort ConnectedPlace
Consider the sort ConnectedPlace, with denotation universe |
ConnectedPlace
=
¦a ? |
Space
[ a is connected¦. Given two terms a and b of sort ConnectedPlace, the sim-
ilarity between a and b can be de?ned in any of the following ways.
sim
CP
1
(a, b) = e
??d(c(a),c(b))
(5.5)
8
If elements in |
Point
are pairs of latitude, longitude coordinates, then d() is the great-circle distance.
We will assume that d() is the Euclidean distance, unless otherwise speci?ed.
162
(a) (b)
Figure 5.3: Example of the application of similarity functions for sort ConnectedPlace
where c(a), c(b) in RR are the centers of a and b respectively, d(c(a), c(b))
is the distance between them, and ? is a factor that controls how fast the similarity de-
creases as the distance between the centers of the two places increases. This similarity
function works well when comparing geographic entities at the same level of granularity.
When places can be approximated with points, it is equivalent to sim
P
1
(a, b).
sim
CP
2
(a, b) =
1
1 + ? d(c(a), c(b))
(5.6)
where c(), d() and ? have the same meaning as in Equation 5.5.
Example 41. Consider the two places a = “Lake District” and b = “School District” in
Figure 5.3(a), and suppose their denotations are the two shaded rectangles in the ?gure.
It is easy to observe that c(a) = (13, 7.5), c(b) = (9.5, 2.5), and d(c(a), c(b)) =
6.103. Hence, for ? = 0.3, sim
CP
1
(a, b) = 0.160 and sim
CP
2
(a, b) = 0.353.
Other two similarity functions can be de?ned in terms of the areas of the two regions.
sim
CP
3
(a, b) =
A(a ? b)
A(a ? b)
(5.7)
163
where A(t) is a function that returns the area of t. Intuitively, this function uses the
amount of overlap between the denotations of a and b as their similarity.
sim
CP
4
(a, b) =
A(a ? b)
max
t?¡a,b¦
A(t)
(5.8)
where A(t) has the same meaning as in Equation 5.7.
Example 42. Consider the two connected places a = and b = in Figure 5.3(a), and
their respective denotations. The intersection of the two denotations is the darker shaded
region, whereas their union is the whole shaded region. It is straightforward to see
that A(a) = 42, A(b) = 65, A(a ? b) = 6, and A(a ? b) = 101. Thus,
sim
CP
3
(a, b) = 0.059 and sim
CP
4
(a, b) = 0.092
In order to better illustrate the great expressive power of our framework, we now
consider a more complex scenario, where the terms being compared are linguistically
modi?ed terms. We show how the similarity of such terms depends on the speci?c deno-
tations assumed by the user for each predicate symbol.
Example 43. Consider the two linguistically modi?ed terms of sort ConnectedPlace a =
“In the center of Weig?eld” and b = “Northeast of Oak St. Starbucks”, where Weig?eld
is the ?ctional city depicted in Figure 5.3. Assuming the denotation of center and ne
shown in Table 5.3, we now compute the similarity between a and b for different values
of ? and ?. Figure 5.3(b) shows denotations of a for values of ? of 0.2, 0.4, 0.6, and 0.8,
and denotations of b for values of ? of 15
?
, 30
?
, and 45
?
. In order to simplify similarity
computation, we make the following assumptions (without loss of generality): (i) the term
“Oak St. Starbucks” can be interpreted as a term of sort Point; (ii) the denotation of
“Oak St. Starbucks” coincides with the geometrical center (8, 5.5) of the bounding box
of “Weig?eld”; (iii) the cones do not extend inde?nitely, but rather within a ?xed radius
164
? = 0.2 ? = 0.4 ? = 0.6 ? = 0.8
? = 15
?
0.0132 0.0276 0.0346 0.0380
? = 30
?
0.0157 0.0413 0.0593 0.0699
? = 45
?
0.0167 0.0494 0.0777 0.0970
Table 5.5: Value of sim
CP
3
(a, b) for different values of ? and ?
(8 units in this example) from their vertex. Table 5.5 reports the value of sim
CP
3
(a, b) for
different values of ? and ?. The highest similarity corresponds to the case where ? = 0.8
and ? = 45
?
, which maximizes the overlap between the two regions. Intuitively, this result
tells us that a user with a very restrictive interpretation of center and ne (i.e., ? ¸1 and
? ¸ 90
?
respectively) will consider a and b less similar than a user with a more relaxed
interpretation of the same predicates.
Another similarity function can be de?ned in terms of the Hausdorff distance [Mun74].
sim
CP
5
(a, b) = e
??H(a,b)
(5.9)
where H(P, Q) = max(h(P, Q), h(Q, P)), with P, Q ? ?(R R), is the Hausdorff
distance, where h(P, Q) = max
p?P
min
q?Q
d(p, q) is the distance between the point
p ? P that is farthest from any point in Q and the point q ? Q that is closest to p.
Intuitively, the Hausdorff distance is a measure of the mismatch between P and Q; if the
Hausdorff distance is d, then every point of P is within distance d of some point of Q and
vice versa.
Example 44. Consider again the two connected places a = “Lake District” and b =
“School District” in Figure 5.3, and their respective denotations. In this example, the
Hausdorff distance between a and b can be interpreted as the distance between the two
points A and B shown in the ?gure. Therefore, H(a, b) = 8.062 and sim
CP
5
(a, b) =
165
0.089 for ? = 0.3. Exchanging the roles of a and b would lead to a shorter value of
the distance, whereas H() selects the maximum.
sim
CP
6
(a, b) = e
??d(c(a),c(b))
e
??(1?o(a,b))
(5.10)
where c(), d() and ? have the same meaning as in Equation 5.5, o(a, b) =
A(a?b)
A(a?b)
is the amount of overlap between a and b, and ? is a factor that controls how fast the
similarity decreases as the amount of overlap between the two places decreases
9
.
Example 45. Consider again the two connected places in Figure 5.3, and their respective
denotations. In this example, sim
CP
6
(a, b) = 0.056 for ? = 0.3 and ? = 0.5.
The similarity function sim
CP
1
below considers two places equivalent when their
denotations are included into one another. We can de?ne sim
CP
2
, . . . , sim
CP
6
in a similar
way by modifying sim
CP
2
, . . . , sim
CP
6
analogously.
sim
CP
1
(a, b) =
_
¸
_
¸
_
1 if a ? b ? b ? a
sim
CP
1
(a, b) otherwise
(5.11)
Sort Space
Consider the sort Space, with denotation universe |
Space
= ?(R R), where ?(R R)
is the power set of R R. Given a term a of sort Space, let P(a) denote a subset of
|
ConnectedPlace
such that

x?P(a)
x = a, elements in P(a) are pairwise disjoint and
maximal, i.e. y ? |
ConnectedPlace
, x
1
, x
2
? P(a) s.t. y = x
1
? x
2
. Intuitively, P(a)
is the set of the denotations of all the connected components of a. Given two terms a and
9
Alternatively, one could specify o(a, b) =
A(a?b)
max
t?{a,b}
A(t)
166
b of sort Space, the distance between a and b may be de?ned in many ways – two are
shown below.
d
S
c
(a, b) = avg
a
i
?P(a),b
i
?P(b)
d(c(a
i
), c(b
i
)) (5.12)
where c() and d() have the same meaning as in Equation 5.5.
d
S
h
(a, b) = avg
a
i
?P(a),b
i
?P(b)
H(a
i
, b
i
) (5.13)
where H() is the Hausdorff distance.
Intuitively d
S
c
and d
S
h
measure the average distance between any two connected
components of the two spaces being compared. Alternatively, the avg operator could
be replaced by either min or max. As in the case of sort ConnectedPlace, a similarity
function over sort Space can be de?ned in any of the following ways.
sim
S
1
(a, b) = e
??d
S
c
(a,b)
(5.14)
sim
S
2
(a, b) =
1
1 + ? d
S
c
(a, b)
(5.15)
where d
S
c
is the distance de?ned by Equation 5.12 and ? is a factor that controls how fast
the similarity decreases as the distance increases.
Example 46. Consider the terms a = “City buildings” and b = “Schools” of sort Space
in Figure 5.4 with denotations a = ¦a
1
, a
2
¦ and b = ¦b
1
, b
2
¦ respectively. By comput-
ing and averaging the distances between the centers of all pairs a
i
, b
j
? P(a) P(b)
167
Figure 5.4: Example of application of similarity functions for sort Space
(see dashed lines in the ?gure), we obtain d
S
c
(a, b) = 7.325 and sim
S
1
(a, b) = 0.111, and
sim
S
2
(a, b) = 0.313 for ? = 0.3.
sim
S
3
(a, b) =
A(a ? b)
A(a ? b)
(5.16)
sim
S
4
(a, b) =
A(a ? b)
max
t?¡a,b¦
A(t)
(5.17)
where A(t) is a function that returns the area of t.
sim
S
5
(a, b) = e
??d
S
h
(a,b)
(5.18)
where d
S
h
is the distance de?ned by Equation 5.13 and ? is a factor that controls how fast
the similarity decreases as the distance increases.
sim
S
6
(a, b) = e
??d
S
c
(a,b)
e
??(1?o(a,b))
(5.19)
168
where d
S
c
is the distance de?ned by Equation 5.12, ? has the usual meaning, o(a, b) =
A(a?b)
A(a?b)
is the amount of overlap between a and b, and ? is a factor that controls how
fast the similarity decreases as the overlap between the two places decreases.
Example 47. Consider again the two terms of sort Space in Figure 5.4. It is straightfor-
ward to see that A(a) = 30, A(b) = 32.5, A(a?b) = 4.5, and A(a?b) = 58.
Therefore, sim
S
3
(a, b) = 0.078, sim
S
4
(a, b) = 0.138, and sim
S
6
(a, b) = 0.044, for ? = 0.3
and ? = 1.
Sort Time Interval
Consider the sort Time Interval , with denotation universe |
Time Interval
= ¦I ? ?(Z) [
I is connected¦
10
. Given two terms a and b of sort Time Interval , the similarity between
a and b can be de?ned in any of the following ways.
sim
TI
1
(a, b) = e
??[c(a)?c(b)[
(5.20)
where, for each time interval t ? dom(Time Interval ), c(t) = avg
z?t
z is the center
of t, and ? is a factor that controls how fast the similarity decreases as the distance
between the centers of the two time intervals increases.
sim
TI
2
(a, b) =
1
1 + ? [c(a) ?c(b)[
(5.21)
where c() and ? have the same meaning as in Equation 5.20.
Example 48. Consider the two terms of sort Time Interval a = “around May 13, 2009”
and b = “shortly before May 16, 2009”, and assume that the denotation of around is a
10
Each t ? Z encodes a point in time, i.e. the number of time units elapsed since the origin of the time
scale adopted by the user.
169
time interval extending 4 days before and after the indeterminate date, and the deno-
tation of shortly before is the time interval extending 2 days before the indeterminate
date. Then, a is the time interval [05/9/09, 05/17/09] and b is the time interval
[05/14/09, 05/16/09]. Assuming a time granularity of days, we have c(a) = 05/13/09
and c(b) = 05/15/09
11
. Therefore, assuming ? = 0.3, we conclude that sim
TI
1
(a, b) =
0.549 and sim
TI
2
(a, b) = 0.625.
sim
TI
3
(a, b) =
[a ? b[
[a ? b[
(5.22)
Intuitively, sim
TI
3
is the ratio of the number of time units in the intersection of the deno-
tations of a and b to the number of time units in the union.
sim
TI
4
(a, b) =
[a ? b[
max
t?¡a,b¦
[t[
(5.23)
sim
TI
5
(a, b) = e
??H(a,b)
(5.24)
where H(P, Q)=max(h(P, Q), h(Q, P)), with P, Q??(Z), is the Hausdorff distance.
sim
TI
6
(a, b) = e
??[c(a)?c(b)[
e
??(1?o(a,b))
(5.25)
where c() and ? have the same meaning as in Equation 5.20, o(a, b) =
[a?b[
[a?b[
is
the amount of overlap between a and b, and ? is a factor that controls how fast the
similarity decreases as the amount of overlap between the two time intervals decreases.
Example 49. Consider again the two terms of sort Time Interval of Example 48. We
observe that [a[ = 9, [b[ = 3, [a ? b[ = 3, and [a ? b[ = 9. Therefore,
11
Since we are assuming a time granularity of days, we are abusing notation and using 05/13/09 instead
of the corresponding value z ? Z.
170
sim
TI
3
(a, b) = 0.333 and sim
TI
4
(a, b) = 0.333. In addition, H(a, b) = 5, which
implies sim
TI
5
(a, b) = 0.22 and sim
TI
6
(a, b) = 0.469, when ? = 0.045 and ? = 1.
Sort NumericInterval
Consider the sort NumericInterval , with denotation universe |
NumericInterval
= ¦I ?
?(N) [ I is connected¦
12
. As in the case of the sort Time Interval , given two terms a
and b of sort NumericInterval , the similarity between a and b can be de?ned in any of
the following ways.
sim
NI
1
(a, b) = e
??[c(a)?c(b)[
(5.26)
where, for each numeric interval t ? dom(NumericInterval ), c(t) = avg
n?t
n is
the center of t, and ? is a factor that controls how fast the similarity decreases as the
distance between the centers of the two numeric intervals increases.
sim
NI
2
(a, b) =
1
1 + ? [c(a) ?c(b)[
(5.27)
where c() and ? have the same meaning as in Equation 5.26
Example 50. Consider the two terms of sort NumericInterval a = “between 10 and 20”
and b = “at least 16”, and assume that the denotation of between and at least are those
shown in Table 5.2, with = 0.1 and = 0.5 respectively. Then, a is the interval [9, 22]
and b is the interval [16, 24]. We have c(a) = 16 and c(b) = 20. Therefore, for
? = 0.3, sim
NI
1
(a, b) = 0.301 and sim
NI
2
(a, b) = 0.455.
12
This seems to be a natural denotation for indeterminate expressions such as “between 3 and 6”, “more
than 3”, etc. An exact quantity can be also represented as a singleton.
171
sim
NI
3
(a, b) =
[a ? b[
[a ? b[
(5.28)
sim
NI
4
(a, b) =
[a ? b[
max
t?¡a,b¦
[t[
(5.29)
sim
NI
5
(a, b) = e
??H(a,b)
(5.30)
where H(P, Q) is the Hausdorff distance.
sim
NI
6
(a, b) = e
??[c(a)?c(b)[
e
??(1?o(a,b))
(5.31)
where c() and ? have the same meaning as in Equation 5.26, o(a, b) =
[a?b[
[a?b[
is the
amount of overlap between a and b, and ? controls how fast the similarity decreases
as the amount of overlap between the two numeric intervals decreases.
Example 51. Consider again the two terms of sort NumericInterval of Example 50. We
observe that [a[ = 14, [b[ = 9, [a ? b[ = 7, and [a ? b[ = 16. Therefore,
sim
NI
3
(a, b) = 0.438 and sim
NI
4
(a, b) = 0.5. Moreover, H(a, b) = 7, which implies
sim
NI
5
(a, b) = 0.122 and sim
NI
6
(a, b) = 0.447, when ? = 0.045 and ? = 1.
5.5 PLINI Probabilistic Logic Programs
In this section, we de?ne the concept of a PLINI-rule and a PLINI-program. Infor-
mally speaking, a PLINI-rule states that when certain similarity-based conditions associ-
ated with two events e
1
, e
2
are true, then the two events are equivalent with some proba-
bility. Thus, PLINI-rules can be used to determine when two event descriptions refer to
172
Event name Property Value
Event1 date 02/28/2005
location Hillah
number of victims 125
weapon car bomb
Event2 location Hilla , south of Baghdad
number of victims at Least 114
victim people
weapon massive car bomb
Event3 killer twin suicide attack
location town of Hilla
number of victims at least 90
victim Shia pilgrims
Event4 date 02/28/2005
weapon suicide car bomb
location Hilla
number of victims 125
Event5 killer suicide car bomber
location Hillah
number of victims at least 100
Event6 location Hillah
number of victims 125
victim Iraqis
weapon suicide bomb
Event7 weapon suicide bombs
location Hilla south of Baghdad
number of victims at least 27
Event8 date 2005/02/28
location Hilla
number of victims between 136 and 135
victim people queuing to obtain medical identi?cation cards
weapon suicide car bomb
Event9 date 2005/03/28
location Between Hillah and Karbala
number of victims between 6 and 7
victim Shiite pilgrims
weapon Suicide car bomb
Table 5.6: Example of Event Database extracted from news sources
the same event, and when two event descriptions refer to different events. PLINI-rules are
variants of annotated logic programs [KS92] augmented with methods to handle similar-
ity between events, as well as similarities between properties of events. Table 5.6 shows
a small event database that was automatically extracted from news data by the T-REX
system [AS07]. We see here that an event can be represented as a set of (property,value)
pairs. Throughout this chapter, we assume the existence of some set c of event names.
173
De?nition 45. An event pair over sort s is a pair (p, v) where p is a property of sort s and
v ? dom(s). An event is a pair (e, EP) where e ? c is an event name and EP is a ?nite
set of event pairs such that each event pair ep ? EP is over some sort s ? o.
We assume that a set / of properties is given and use the notation eventname.property
to refer to the property of an event. We start by de?ning event-terms.
De?nition 46 (Event-Term). Suppose c is a ?nite set of event names and 1 is a possibly
in?nite set of variable symbols. An event-term is any member of c ? 1.
Example 52. Consider the event e
S3
presented in Section 5.2. Both e
S3
and v, where v is
a variable symbol, are event-terms.
We now de?ne the concept of an equivalence atom. Intuitively, an equivalence atom
says that two events (or properties of events) are equivalent.
De?nition 47 (Equivalence Atom). An equivalence atom is an expression of the form
• e
i
? e
j
, where e
i
and e
j
are event-terms, or
• e
i
.a
k
? e
j
.a
l
, where e
i
, e
j
are event-terms, a
k
, a
l
? /, and a
k
, a
l
are both of sort
s ? o, or
• e
i
.a
k
? b, where e
i
is an event-term, a
k
? / is an attribute whose associated sort
is s ? o and b a ground term of sort s.
Example 53. Let us return to the case of the events e
S1
, e
S2
, e
S3
from Table 5.1. Some
example equivalence atoms include:
e
S1
? e
S2
.
e
S1
.place ? e
S3
.place
e
S3
.place ? Ahmedabad.
174
Note that two events need not be exactly identical in order for them to be considered
equivalent. For instance, consider the events e
S1
, e
S2
, e
S3
given in Section 5.2. It is clear
that we want these three events to be considered equivalent, even though their associated
event pairs are somewhat different. In order to achieve this, we ?rst need to state what
it means for terms over various sorts to be equivalent. This is done via the notion of a
PLINI-atom.
De?nition 48 (PLINI-atom). If A is an equivalence atom and µ ? [0, 1], then A : µ is a
PLINI-atom.
The intuitive meaning of a PLINI-atom can be best illustrated via an example.
Example 54. The PLINI-atom (e
1
.weapon ? e
2
.weapon) : 0.683 says that the weapons
associated with events e
1
and e
2
are similar with a degree of at least 0.683. Likewise, the
PLINI-atom (e
1
.date ? e
2
.date) : 0.575 says that the dates associated with events e
1
and
e
2
are similar with a degree of at least 0.575.
When providing a semantics for PLINI, we will use the notion of similarity function
for sorts as de?ned in Section 5.4. There we gave speci?c examples of similarity functions
for the numeric, spatial, and temporal domains. Our theory will be de?ned in terms of any
arbitrary but ?xed set of such similarity functions. The heart of our method for identifying
inconsistency in news reports is the notion of PLINI-rules which intuitively specify when
certain equivalence atoms are true.
De?nition 49 (PLINI-rule). Suppose A is an equivalence atom, A
1
: µ
1
, . . . , A
n
: µ
n
are
PLINI-atoms, and p ? [0, 1]. Then
A
p
?? A
1
: µ
1
? . . . ? A
n
: µ
n
175
is a PLINI-rule. If n = 0 then the rule is called a PLINI-fact. A is called the head of the
rule, while A
1
: µ
1
? . . . ? A
n
: µ
n
is called the body. A PLINI-rule is ground iff it
contains no variables.
De?nition 50 (PLINI-program). A PLINI-program is a ?nite set of PLINI-rules where no
rule may appear more than once with different probabilities.
Note that a PLINI-programis somewhat different in syntax than a probabilistic logic
program [NS92] as no probability intervals are involved. However, it is a variant of a gen-
eralized annotated program due to [KS92]. In classical logic programming [Llo87], there
is a general assumption that logic programs are written by human (logic) programmers.
However, in the case of PLINI-programs, they can also be inferred automatically from
training data. For instance, we learned rules (semi-automatically) to recognize when cer-
tain violent events were equivalent to other violent events in the event database generated
by the information extraction program T-REX [AS07] mentioned earlier. To do this, we
?rst collected a set of 110 events (“annotation corpus”) extracted by T-REX from news
events and then manually classi?ed which of the resulting pairs of events from the an-
notation corpus were equivalent. We then used two classical machine learning programs
called JRIP and J48 from the well known WEKA library
13
to learn PLINI-rules automat-
ically from the data. Figure 5.5 shows some of the rules we learned automatically using
JRIP.
We brie?y explain the ?rst two rules shown in Figure 5.5 that JRIP extracted auto-
matically from the T-REX annotated corpus. The ?rst rule says that when the similarity
between the date ?eld of events e
1
, e
2
is at least 95.5997%, and when the similarity be-
tween the number of victims ?eld of e
1
, e
2
is 100%, and the similarity between their
location ?elds is also 100%, then the probability that e
1
and e
2
are equivalent is 100%.
13http://www.cs.waikato.ac.nz/ml/weka/
176
e
1
? e
2
1.0
?? e
1
.date ? e
2
.date : 0.955997 ?
e
1
.number of victims ? e
2
.number of victims : 1 ?
e
1
.location ? e
2
.location : 1.
e
1
? e
2
0.75
?? e
1
.date ? e
2
.date : 1 ? e
1
.killer ? e
2
.killer : 0.574707.
e
1
? e
2
0.5833
?? e
1
.date ? e
2
.date : 1 ?
e
1
.weapon ? e
2
.weapon : 0.634663 ?
e
1
.location ? e
2
.location : 1.
Figure 5.5: Some automatically learned PLINI-rules from T-REX data using JRIP
The second rule says that when the dates of events e
1
, e
2
are 100% similar, and the killer
?elds are at least 57.4707% similar, then the events are at least 75% similar.
We see from this example that PLINI-programs weave together notions of similarity
from different domains (within the annotations of equivalence atoms in the rule body)
and the notion of probability attached to a rule. We now recall the standard concept of a
substitution.
De?nition 51 (Substitution). Suppose R is a PLINI-rule. A substitution ? = [X
1
/e
1
, . . . ,
X
n
/e
n
] for R is a ?nite set of pairs of terms where each e
i
is an event-term and X
i
,= X
j
when i ,= j.
A ground instance of R under ? is the result of simultaneously replacing all vari-
ables X
i
in R with the event-term e
i
where X
i
/e
i
? ?.
5.6 Model Theory and Fixpoint Theory
In this section, we specify a formal model theory for PLINI-programs by leveraging
the semantics of generalized annotated programs [KS92]. For each sort s ? o, we assume
the existence of a similarity function sim
s
: dom(s) dom(s) ? [0, 1]. Intuitively,
177
sim
s
(v
1
, v
2
) returns 0 if domain values v
1
, v
2
are completely different and returns 1 if
the two values are considered to be the same. We have already provided many possible
de?nitions for similarity functions in Section 5.4. We ?rst need to de?ne the Herbrand
Base.
De?nition 52 (Herbrand Base). B
c
is the set of all ground equivalence atoms that can be
formed from the event-terms, attributes, and constant symbols associated with c.
Clearly, B
c
is ?nite. We now de?ne the concept of an interpretation.
De?nition 53 (Interpretation). Any function I : B
c
?[0, 1] is called an interpretation.
Thus, an interpretation just assigns a number in [0, 1] to each ground equivalence
atom. We now de?ne satisfaction of PLINI-rules by interpretations.
De?nition 54 (Satisfaction). Let I be a interpretation, and let A, A
1
, . . . , A
n
? B
c
. Then:
• I [= A : µ iff I(A) ? µ.
• I [= A
1
: µ
1
? . . . ? A
n
: µ
n
iff I [= A
i
: µ
i
for all 1 ? i ? n.
• I [= A
p
?? A
1
: µ
1
? . . . ? A
n
: µ
n
iff either I ,[= A
1
: µ
1
? . . . ? A
n
: µ
n
or
I(A) ? p¦.
I satis?es a non-ground rule iff it satis?es all ground instances of the rule. I satis?es a
PLINI-program ? iff it satis?es all PLINI-rules in ?.
The ?rst part of the above de?nition says that for A : µ to be true w.r.t. an inter-
pretation I, we should just check that I(A) is greater than or equal to µ. Satisfaction of
conjunctions is de?ned in the obvious way. Satisfaction of a ground PLINI-rule is de?ned
178
in a more complex way. Either the body of the rule should be false with respect to I or I
must assign a value at least p to the head. As usual, A : µ is a logical consequence of ?
iff every interpretation that satis?es ? also satis?es A : µ.
De?nition 55. Suppose ? is a PLINI-program and we have a ?xed set of similarity func-
tions sim
s
for each sort s. The augmentation of ? with similarity information is the
PLINI-program ?
sim
= ? ? ¦(x ? y)
sims(x,y)
?? [ x, y are ground terms of sort s in ?¦.
Throughout the rest of this chapter, we only consider the augmented program ?
sim
.
We are interested in characterizing the set of ground equivalence atoms that are logical
consequence of ?
sim
. Given a PLINI-program ?, we are now ready to associate with ?,
a ?xpoint operator T
?
which maps interpretations to interpretations.
De?nition 56. T
?
(I)(A) = A : sup¦p[ A
p
?? A
1
: µ
1
? . . . ? A
n
: µ
n
is a ground
instance of a rule in ? and for all 1 ? i ? n, either I(A
i
) ? µ
i
or A
i
has the form
x
i
? y
i
and sim
s
(x
i
, y
i
) ? µ
i
¦.
The above de?nition says that in order to ?nd the truth value assigned to a ground
atom A by the interpretation T
?
(I), we ?rst need to look at all rules in ? that have A as
the head of a ground instance of that rule. To check whether the body of such a rule is
true w.r.t. I, we need to look at each ground equivalence atom A
i
: µ
i
where A
i
has the
form (x
i
? y
i
). This atom is satis?ed if either I(x
i
? y
i
) is greater than or equal to µ
i
or if the similarity function for sort s (of the type of x
i
, y
i
) assigns a value greater than or
equal to µ
i
to (x
i
, y
i
). Note that the T
?
operator operates on the ?
sim
program without
explicitly adding equivalence atoms of the form (x ? y)
sims(x,y)
?? to ?, thus ensuring a
potentially large saving.
179
It is easy to see that the set of all interpretations forms a complete lattice under the
following ordering: I
1
? I
2
iff for all ground equivalence atoms A, I
1
(A) ? I
2
(A). We
can de?ne the powers of T
?
as follows.
T
?
? 0(A) = A : 0 for all ground equivalence atoms A.
T
?
? (j + 1)(A) = (T
?
(T
?
? j))(A).
T
?
? ?(A) =

¦T
?
? j(A) [ j ? 0¦.
The result below follows directly from similar results for generalized annotated programs
[KS92] and shows that the T
?
operator has some nice properties.
Proposition 26. Suppose ? is a PLINI-program and sim
s
is a family of similarity func-
tions for a given set of sorts. Then:
1. T
?
is monotonic, i.e. I
1
? I
2
?T
?
(I
1
) ? T
?
(I
2
).
2. T
?
has a least ?xpoint, denoted lfp(T
?
) which coincides with T
?
? ?.
3. I satis?es ?
sim
iff T
?
(I) = I.
4. A : µ is a logical consequence of ?
sim
iff lfp(T
?
)(A) ? µ.
5.7 Event Clustering Algorithm
Suppose e
1
, e
2
are any two reported events. The least ?xpoint of the T
?
opera-
tor gives the probability that the two events are equivalent (i.e., they refer to the same
real-world event). Alternatively, lfp(T
?
)(e
1
? e
2
) can be interpreted as the similarity
between e
1
and e
2
, meaning that if the similarity between two events is high, they are
180
likely to refer to the same real-world event. In other words, the least ?xpoint of the T
?
operator gives us some information on the pairwise similarity between events. However,
we may have a situation where the similarity according to lfp(T
?
) between events e
1
and e
2
is 0.9, between e
2
, and e
3
is 0.7, but the similarity between e
1
and e
3
is 0.5. In
general, given a ?nite set c of events, we would like to look at the results computed by
lfp(T
?
), and cluster the events into buckets of equivalent events. We then need to ?nd a
partition T = ¦P
1
, . . . , P
k
¦ of c, such that similar events are assigned to the same par-
tition and dissimilar events are assigned to different partitions. In this section, we de?ne
the PLINI-Cluster algorithm, which can ?nd a sub-optimal solution – w.r.t. the score
function de?ned below – in polynomial time.
De?nition 57 (Score of Event Partition). Let ? be a PLINI-program and ? ? [0, 1] a
threshold. Let c be a set of events and T = ¦P
1
, . . . , P
k
¦ a partition of c, i.e.

k
i=1
P
i
=
c, and (?i ,= j) P
i
? P
j
= ?. We de?ne the score of partition T as
S(T) = S
i
(T) + S
e
(T)
where S
i
(T) is the internal score of partition T given by
S
i
(T) =

P
j
?1

er,es?P
j
,er,=es
(lfp(T
?
)(e
r
? e
s
) ??)
and S
e
(T) is the external score of partition T given by
S
e
(T) =

er,es?c,er?Pu,es?Pv,u,=v
(? ?lfp(T
?
)(e
r
? e
s
))
Intuitively, S
i
(T) measures the similarity between objects within a partition com-
ponent. Two events e
1
, e
2
in the same cluster contribute positively to the internal score if
181
their similarity is above the threshold ?. Instead, if their similarity is below the threshold
the partition is penalized. Analogously, S
e
(T) measures the similarity across multiple
partition components. The external score of a partition is higher when events in different
clusters have lower similarity. Two events e
1
, e
2
in different clusters contribute positively
to the external score if their similarity is below the threshold ?.
Finding the partition T = ¦P
1
, . . . , P
k
¦ which maximizes the score S de?ned above
gives rise to a combinatorial optimization problem. Note that k is a problem parameter
and not known a priori, in contrast to common clustering problems in machine learning
and other areas. Ozcan et al. [OS04] were the ?rst to determine how to partition a set of
entities under this intuition – they did so in the context of partitioning a set of activities
an agent is supposed to perform so that the activities in each component of the partition
could be executed by leveraging commonalities of tasks amongst those activities. They
proved that the partitioning problem was NP-hard and provided ef?cient heuristics to
solve the problem. Later, Bansal et al. [BBC04] showed that ?nding the optimal partition
with respect to the score S is NP-complete. Demaine and Immorlica [DI03] presented an
ef?ciently computable O(log n) approximation based on integer program relaxation and
proved that the bound is optimal for such relaxations. We now present a simple, greedy,
hill-climbing algorithm for this task (Algorithm 5.6). The algorithm starts by assigning
each event to a different cluster. Then, at each iteration, it merges the two clusters that
provide the highest increase of the score. It stops when no further increase of the score
can be achieved by merging two clusters.
Example 55. Suppose we have 4 events e
1
, e
2
, e
3
, e
4
, and assume ? = 0.5. Table 5.7
shows the degree of similarity for each pair of events according to some ?xed PLINI-
program ?, i.e. the values of lfp(T
?
)(e
i
? e
j
). The algorithm starts by initializing
P
1
= ¦e
1
¦, P
2
= ¦e
2
¦, P
3
= ¦e
3
¦, P
4
= ¦e
4
¦. In the ?rst iteration, P
1
and P
3
are
182
Algorithm PLINI-Cluster
Input: PLINI-program ?; Set of Events c; Threshold ?
1 T ??
2 For all e
i
? c,
3 T ?T ? ¦¦e
i
¦¦
4 Repeat
5 For all P
i
, P
j
? T s.t. i ,= j,
/*Compute the change in the score of the partition, if we merge clusters P
i
, P
j
*/
6 s
i,j
?S ((T ¸ ¦P
i
, P
j
¦) ? ¦P
i
? P
j
¦) ?S(T) =
= 2

er?Pi,es?Pj
(lfp(T
?
)(e
r
? e
s
) ??) /*Note that s
i,j
= s
j,i
*/
7 s
u,v
?max
i,j?[1,|P|]?i=j
¦s
i,j
¦
8 If (s
u,v
> 0) then
9 T ?(T ¸ ¦P
u
, P
v
¦) ? ¦P
u
? P
v
¦
10 Until (s
u,v
? 0)
Figure 5.6: Algorithm PLINI-Cluster(?, c, ?)
e
1
e
2
e
3
e
4
e
1
1 0.2 0.9 0.3
e
2
1 0 0.8
e
3
1 0.6
e
4
1
Table 5.7: Degrees of similarity as given by lfp(T
?
)(e
i
? e
j
)
merged, because this leads to the largest increase in S(T), s
1,3
= 0.8. In the second
iteration, P
2
and P
4
are merged for the same reason, with s
2,4
= 0.6. In the third itera-
tion, the algorithm terminates because no score increase can be achieved, as merging the
remaining two clusters would decrease the score by 1.8. Hence, the resulting partition is
¦¦e
1
, e
3
¦, ¦e
2
, e
4
¦¦.
The following result shows that the above algorithm ?nds the locally optimal parti-
tion w.r.t the cost function and that its worst case runtime complexity is O(n
3
).
Proposition 27. Suppose ? is a PLINI-program, c a set of events, and ? a threshold.
Then PLINI-Cluster(?, c, ?) ?nds a locally optimal partition of c w.r.t. to the score
function S, and its worst case complexity is O(n
3
) where n is the total number of events.
183
Proof. Let n denote the total number of events. We assume that we have an oracle to look
up or compute lfp(T
?
) for any pair of events in constant time. During each iteration of
the algorithm either two clusters are merged or the algorithm terminates. The algorithm
starts with a partition of size n and since a merger reduces the partition size by one, there
can be at most n iterations. Now we turn to the complexity of each iteration, namely the
cost of computing the s
i,j
’s. For any given iteration, we have at most O

clusters of
constant size and a constant number of clusters of size O

, since the total number of
elements (i.e. events) is n. If [P
i
[ ? O(1) and [P
j
[ ? O(1), then the cost of computing
s
i,j
is O(1) as well. Similarly, if [P
i
[ ? O

and [P
j
[ ? O(1), this cost is O

and if
[P
i
[ ? O

and [P
j
[ ? O

the cost is O(n
2
). Since we compute s
i.j
for all pairs of
clusters, the overall complexity is
O

O

O(1) + O

O(1) O

+ O(1) O(1) O(n
2
) = O(n
2
)
where the ?rst summand is the complexity for pairs of constant size partitions, the second
summand for pairs of linear with constant size partitions, and the last summand for pairs
of linear size partitions. Hence, the complexity of each iteration is O(n
2
) and therefore
the overall runtime complexity of the event clustering algorithm is in O(n
3
).
Note that due to the sparsity of event similarities in real world datasets, we can
effectively prune a large number of partition comparisons. We can prune the search
space for the optimal merger even further, by considering highly associated partitions
?rst. These optimizations do not impact the worst case runtime complexity, but render
our algorithm very ef?cient in practice.
184
5.8 Implementation and Experiments
Our experimental prototype PLINI system was implemented in approximately 5700
lines of Java code. In order to test the accuracy of PLINI, we developed a training data set
and a separate evaluation data set. We randomly selected a set of 110 event descriptions
from the millions automatically extracted from news sources by T-REX [AS07]. We then
generated all the 5,995 possible pairs of events from this set and asked human reviewers
to judge the equivalence of each such pair. The ground truth provided by the reviewers
was used to learn PLINI-programs for different combinations of learning algorithms and
similarity functions. Speci?cally, we considered 588 different combinations of similarity
functions and learned the corresponding 588 PLINI-programs using both JRIP and J48.
The evaluation data set was similarly created by selecting 240 event descriptions from
those extracted by T-REX.
All experiments were run on a machine with multiple, multi-core Intel Xeon E5345
processors at 2.33GHz, 8GB of memory, running the Scienti?c Linux distribution of the
GNU/Linux operating system. However, the current implementation has not been paral-
lelized and uses only one processor and one core at a time.
PLINI-programs corresponding to each combination of algorithms and similarity
functions were run on the entire set of 28,680 possible pairs of events in the test set.
However, evaluation was conducted on subsets of pairs of a manageable size for human
reviewers. Speci?cally, we selected 3 human evaluators and assigned each of them two
subsets of pairs to evaluate. The ?rst subset was common to all 3 reviewers and included
50 pairs that at least one program judged equivalent with con?dence greater than 0.6 (i.e.
T
?
returned over 0.6 for these pairs) and 100 pairs that no program judged equivalent
with probability greater than 0.6. The second subset was different for each reviewer, and
185
included 150 pairs, selected in the same way as the ?rst set. Thus, altogether we evaluated
a total of 600 distinct pairs.
We then computed precision and recall as de?ned below. Suppose c
p
is the set of
event pairs being evaluated. We use e
1
?
h
e
2
, to denote that events e
1
and e
2
were judged
to be equivalent by a human reviewer. We use P(e
1
? e
2
) to denote the probability
assigned by the algorithm to the equivalence atom e
1
? e
2
. Given a threshold value
? ? [0, 1], we de?ne the following sets.
• T T
?
1
= ¦(e
1
, e
2
) ? c
p
[P(e
1
? e
2
) ? ? ? e
1
?
h
e
2
¦ is the set of pairs ?agged as
equivalent (probability greater than the threshold ?) by the algorithm and actually
judged equivalent by human reviewers;
• T T
?
0
= ¦(e
1
, e
2
) ? c
p
[P(e
1
? e
2
) < ? ?e
1
,?
h
e
2
¦ is the set of pairs ?agged as not
equivalent by the algorithm and actually judged not equivalent by human reviewers;
• T
?
1
= ¦(e
1
, e
2
) ? c
p
[P(e
1
? e
2
) ? ?¦ is the set of pairs ?agged as equivalent by
the algorithm;
• T
?
0
= ¦(e
1
, e
2
) ? c
p
[P(e
1
? e
2
) < ?¦ is the set of pairs ?agged as not equivalent
by the algorithm;
Given a threshold value ? ? [0, 1], we de?ne precision, recall, and F-measure as follows.
P
?
1
=
[T T
?
1
[
[T
?
1
[
P
?
0
=
[T T
?
0
[
[T
?
0
[
P
?
=
[T T
?
1
[ +[T T
?
0
[
[c
p
[
R
?
1
=
[T T
?
1
[
[¦(e
1
, e
2
) ? c
p
[e
1
?
h
e
2
¦[
14
F
?
=
2 P
?
1
R
?
1
P
?
1
+ R
?
1
14
Given the nature of the problem, most pairs of event descriptions are not equivalent. Therefore, the
best indicators of our system performance are recall/precision w.r.t. equivalent pairs.
186
? P
?
1
P
?
0
P
?
R
?
1
F
?
0.50 83.5% 88.6% 88.2% 36.9% 0.512
0.60 84.4% 88.5% 88.2% 36.5% 0.510
0.70 84.5% 88.5% 88.2% 36.4% 0.509
0.80 89.7% 87.9% 88.1% 32.5% 0.477
0.90 92.3% 87.3% 87.6% 28.2% 0.432
0.95 91.8% 86.5% 86.7% 22.7% 0.364
Table 5.8: Performance of J48 for different values of ?
? P
?
1
P
?
0
P
?
R
?
1
F
?
0.50 78.6% 94.8% 92.3% 74.0% 0.762
0.60 82.9% 94.2% 92.6% 70.6% 0.763
0.70 86.8% 92.0% 91.4% 57.9% 0.695
0.80 91.6% 86.6% 86.8% 23.5% 0.374
0.90 90.7% 86.0% 86.1% 19.3% 0.318
0.95 96.3% 85.2% 85.4% 13.7% 0.240
Table 5.9: Performance of JRIP for different values of ?
Note that all the quality measures de?ned above are parameterized by the threshold ?.
Accuracy results. Tables 5.8 and 5.9 report the overall performance of the PLINI-
programs derived using J48 and JRIP, respectively for different values of the threshold
?. Performance measures are averaged over all the 588 combinations of similarity met-
rics and over all the reviewers. Figure 5.7(a) shows precision/recall curves for both algo-
rithms, while Figure 5.7(b) plots the F-measure vs. different values of ?.
As expected, when ? increases P
?
1
increases and R
?
1
decreases. However, R
?
1
de-
creases more rapidly than the increase in P
?
1
, causing F
?
to degrade. In general, rules
extracted with JRIP outperform those extracted using J48, but it is also clear that the
performance of JRIP in terms of F-measure degrades more rapidly than J48 due to a dras-
tic drop in recall for higher thresholds. In fact, for thresholds above 0.75, J48-derived
PLINI-programs surpass JRIP-derived PLINI-programs in terms of F-measure. Also note
187
0.70
0.75
0.80
0.85
0.90
0.95
1.00
P
r
e
c
i
s
i
o
n
0.60
0.65
0.0 0.2 0.4 0.6 0.8 1.0
RecaII
J48 JRIP JMAX
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
F
-
m
e
a
s
u
r
e
0.0
0.1
0.2
0.4 0.5 0.6 0.7 0.8 0.9 1
ThreshoId
J48 JRIP JMAX
(a) (b)
Figure 5.7: (a) Recall/Precision and (b) F-measure for J48 and JRIP
that when ? increases P
?
0
decreases. This is because higher thresholds cause a larger num-
ber of equivalent pairs to be classi?ed as non equivalent, therefore the fraction of pairs
?agged as not equivalent which are actually not equivalent decreases. However, preci-
sion is very high. When J48 and JRIP are used, the optimal value of the F-measure is
achieved for ? = 0.6, which corresponds to 84% precision and 37% recall for J48, and
83% precision and 71% recall for JRIP.
JMAX (which we de?ned) assigns to each equivalence atom the maximum of the
probabilities assigned by JRIP-derived rules and J48-derived PLINI-rules. Figure 5.7(a)
shows that JMAX produces signi?cantly higher recall at the expense of a slight decrease
in precision. The overall gain in performance is more evident in Figure 5.7(b), which
shows that the F-measure for JMAX is higher than that for both J48 or JRIP for virtually
all values of the threshold ?. The optimal value of F is achieved for ? = 0.6, which
corresponds to 80% precision and 75% recall.
Table 5.10 reports the average performance of PLINI-programs derived using the
3 algorithms – J48, JRIP, JMAX when individually compared with the ground truth pro-
vided by each of the 3 reviewers enrolled for the evaluation. There is no signi?cant
188
Algorithm Reviewer P
?
1
P
?
0
P
?
R
?
1
F
?
1 87.6% 87.9% 87.9% 33.6% 0.486
J48 2 74.4% 89.9% 88.8% 36.1% 0.486
3 90.7% 87.7% 87.9% 39.3% 0.548
1 84.6% 93.9% 92.6% 68.8% 0.759
JRIP 2 80.2% 95.9% 93.7% 76.1% 0.781
3 83.9% 92.9% 91.6% 67.8% 0.750
1 82.5% 94.8% 92.9% 73.8% 0.779
JMAX 2 74.3% 96.5% 93.0% 80.2% 0.771
3 82.9% 94.6% 92.6% 75.8% 0.792
Table 5.10: Average performance of JRIP for ? = 0.6 when compared with different
revilers
difference between the reviewers and, in fact, they unanimously agreed in 138 out of the
150 common cases (92%).
We found that in general using both J48-based PLINI-rules and JRIP-based PLINI-
rules (encompassed in our JMAX strategy) offers the best performance, while using only
J48-derived PLINI-rules is the worst.
5.9 Concluding Remarks
The number of “formal” news sources on the Internet is mushrooming rapidly.
Google News alone covers thousands of news sources from around the world. If one adds
consumer generated content and informal news channels run by individual amateur news-
men and women who publish blogs about local or global items of interest, the number of
news sources reaches staggering numbers. As shown in the Introduction, inconsistencies
can occur for many reasons.
The goal of this work is not to resolve these inconsistencies, but to identify when
event data reported in news sources is inconsistent. When information extraction pro-
189
grams are used to automatically mine event data from news information, the resulting
properties of the events extracted are often linguistically quali?ed. In this chapter, we
have studied three kinds of linguistic modi?ers typically used when such programs are
used – linguistic modi?ers applied to numbers, spatial information, and temporal infor-
mation. In each case, we have given a formal semantics to a number of linguistically
modi?ed terms.
In order to determine whether two events described in one or more news sources are
the same, we need to be able to compare the attributes of these two events. This is done
via similarity measures. Though similarity measures for numbers are readily available,
no formal similarity mechanisms exist (to the best of our knowledge) for linguistically-
modi?ed numbers. The same situation occurs in the case of linguistically-modi?ed tem-
poral information and linguistically modi?ed geospatial information. We provide formal
de?nitions of similarity for many commonly used linguistically modi?ed numeric, tem-
poral, and spatial information.
We subsequently introduce PLINI-programs as a variant of the well known general-
ized annotated program (GAP) [KS92] framework. PLINI-programs can be learned auto-
matically froma relatively small annotated corpus (as we showed) using standard machine
learning algorithms like J48 and JRIP from the WEKA library. Using PLINI-programs,
we showed that the least ?xpoint of an operator associated with PLINI-programs tells us
the degree of similarity between two events. Once such a least ?xpoint has been com-
puted, we present the PLINI-Cluster algorithm to cluster together sets of events that are
similar, and sets of events that are dissimilar.
We have experimentally evaluated our PLINI-framework using many different sim-
ilarity functions (for different sorts), many different threshold values, and three alternative
ways of automatically deriving PLINI-programs from a small training corpus. Our experi-
190
ments show that the PLINI-framework produced high precision and recall when compared
with human users evaluating whether two reports talked about the same event or not.
There is much work to be done in the future. PLINI-programs do not include nega-
tion. A sort of stable models semantics [GL98] can be de?ned for PLINI-programs that
include negation. However, the challenge will be to derive such programs automatically
from a training corpus (standard machine learning algorithms do not do this) and to apply
them ef?ciently as we can do with PLINI.
191
Chapter 6
Partial Information Policies
The work presented in this chapter is taken from [MMGS11].
6.1 Introduction and Motivating Example
Partial information arises when information is unavailable to users of a database
when they enter new data. All commercial real-world relational database systems im-
plement some ?xed way of managing incomplete information; but neither the RDBMS
nor the user has any say in how the partial information is interpreted. But does the
user of a stock database really expect an RDBMS designer to understand his risks and
his mission in managing the incomplete information? Likewise, does an epidemiolo-
gist collecting data about some disease have con?dence that an RDBMS designer under-
stands how his data was collected, why some data is missing, and what the implications
of that missing data are for his disease models and applications? The answer is usu-
ally no. While database researchers have understood the diversity of types of missing
data [AM86, Bis79, CGS97b, Cod79, Gra91, IL84b, LL98, Zan84] (e.g. a value exists
but is unknown — this may happen when we known someone has a phone but do not
know the number; a value does not exist in a given case because the ?eld in question
192
is inapplicable - this may happen when someone does not have a spouse, leading to an
inapplicable null in a relation’s spouse ?eld; or we have no-information about whether a
value exists or not - as in the case when we do not know if someone has a cell phone), the
SQL standard only supports one type of unmarked null value, so RDBMSs force users to
handle all partial information in the same way, even when there are differences.
We have worked with two data sets containing extensive partial information. A data
set about education from the World Bank and UNESCO contains data for each of 221
countries with over 4000 attributes per country. As the data was collected manually, there
are many incomplete entries. The incompleteness is due to many factors (e.g., con?ict in
the country which made collecting dif?cult during certain time frame).
Example 56. The relation below shows a very small number of attributes associated with
Rwanda for which con?ict in the 90s led to a lot of incomplete information. The relation
only shows a few of the 4000+ attributes (GER and UER stand for gross and under-age
enrollment ratio, respectively). Even in this relatively small relation, we see there are a
lot of unknown values (here the U
i
’s denote unknown values).
193
% of female Net ODA from
Country Year unemployment GER UER non-DAC donors
(current US$)
Rwanda 1995 U
1
U
11
U
15
-260000
Rwanda 1996 0.0 U
12
U
16
1330000
Rwanda 1997 U
2
U
13
U
17
530000
Rwanda 1998 U
3
U
14
U
18
130000
Rwanda 1999 U
4
99.37 U
19
90000
Rwanda 2000 9.4 103.55 U
20
170000
Rwanda 2001 U
5
104.06 4.59 130000
Rwanda 2002 U
6
107.77 4.76 140000
Rwanda 2003 U
7
116.5 4.62 110000
Rwanda 2004 U
8
128.05 4.42 90000
Rwanda 2005 U
9
139.21 4.81 120000
Rwanda 2006 U
10
149.88 U
21
450000
Users may want to ?ll in the missing values in many possible ways. For instance, User A
may want to ?ll the under-age enrollment ratio (UER) column via linear regression.
User B may ?ll in missing values by choosing the interval [4.81, 16.77] that says the
missing value is unknown but lies in this interval. User C may require that the only pos-
sible values are the under-age enrollment ratios appearing in the tuples of the relation.
User D may want to learn this value by studying its relationship with the ODA from non-
DAC donors and extrapolating - this would occur when the user believes the under-age
enrollment ratio is correlated with the ODA column and, in this case, he learns that UER
is a function of the latter and uses this for extrapolation. User E may want to overestimate
a missing UER by replacing it with the maximum UER for the same year from the other
countries. User F may want to replace a missing under-age enrollment ratio by looking at
the gross enrollment ratios of the other countries for the same year and taking the under-
age enrollment ratio corresponding to average gross enrollment ratio. Users may wish to
194
specify many other policies based on their application, their mission, their attitude to risk
(of being wrong), the expectations of their bosses and customers, and other factors.
There are many queries of interest that an education analyst may want to pose
over the data above. He may be interested in the years during which the % of female
unemployment was above a certain threshold and want to know what were the gross and
under-age enrollment ratios in those years. He may want to know the countries with
the highest average UER in the 90’s. It is easy to see that such queries would yield
poor results when evaluated on the original database whereas higher quality answers are
obtained if the missing values are populated according to the knowledge the user has of
the data.
Useful computing systems must support users’ desires. Though the database theory
literature counts several works on null values (e.g., [AKG91, CGS97b, Gra91, IL84b,
Zan84]), all of them provide a ?xed “a priori” semantics for nulls, allowing the user none
of the ?exibility required by users A, B, C, D, E, and F above. Other works in the ?elds
of data mining, data warehousing, and data management, such as [Qui93, MST94, Pyl99,
BFOS84, Li09], have proposed ?xed approaches for replacing nulls that work in speci?c
domains and applications and do not allow modeling different kinds of partial information
and different ways of resolving incompleteness. In contrast, we want users to be able to
specify policies to manage their partial information and then have the RDBMS directly
answer queries in accordance with their PIP.
The principal novelty in this chapter is that partial information policies (PIPs)
allowend-users the ?exibility to specify howthey want to handle partial information,
something the above frameworks do not do.
The main contributions of this chapter are the following.
195
1. We propose the general notion of partial information policy for resolving various
kinds of incompleteness and give several useful and intuitive instances of PIPs.
2. We propose index structures to support the ef?cient application of PIPs and show
how to maintain them incrementally as the database is updated.
3. We study the interaction between relational algebra operators and PIPs. Speci?-
cally, we identify conditions under which applying PIPs before or after a relational
algebra operator yield the same result – this can be exploited for optimization pur-
poses.
4. We experimentally assess the effectiveness of the proposed index structures with a
real-world airline data set. Speci?cally, we compare an algorithm exploiting our
index structures with a naive one not relying on them and show that the former
greatly outperforms the latter and is able to manage very large databases. More-
over, we experimentally evaluate the effect of the index structures when PIPs are
combined with relational algebra operators and study whether applying a policy be-
fore or after a relational algebra operator, under the conditions which guarantee the
same result, may lead to better performance. Finally, we carry out an experimental
assessment of the quality of query answers with and without PIPs.
In classical RDBMS architectures, users specify an SQL query which is typically
converted into a relational algebra query. A cost model and a set of query rewrite rules
allow an RDBMS query optimizer to rewrite the query into a minimal cost query plan
which is then executed. Standard SELECT A
1
,...,A
k
FROM R
1
,...,R
m
WHERE
cond
1
,...,cond
n
queries can be expanded easily to specify PIPs as well. A possible
syntax could be
196
SELECT A
1
,...,A
k
FROM R
1
,...,R
m
WHERE cond
1
,...,cond
n
USING POLICY ? [LAST|FIRST]
where ? is one of a library of PIPs in the system. The keyword at the end of the
clause will determine the semantics of the policy application. Choosing FIRST yields a
policy ?rst semantics which would ?rst apply ? to all relations in the FROM clause and
then execute the SELECT...FROM...WHERE... part of the query on the modi?ed
relation instances. Choosing LAST yields a policy last semantics which would ?rst exe-
cute the SELECT...FROM...WHERE... query and then apply the PIP ? to the result.
We consider both these options in this work.
The rest of the chapter is organized as follows. In Section 6.2, we de?ne the syntax
and semantics of databases containing the three types of null values mentioned before.
Then, in Section 6.3, we introduce the notion of partial information policy and show dif-
ferent families of PIPs. In Section 6.4, we propose index structures to ef?ciently apply
PIPs. In Section 6.5, we study the interaction between PIPs and relational algebra opera-
tors. Section 6.6 reports experimental results.
6.2 Preliminaries
Syntax. We assume the existence of a set 1of relation symbols and a set Att of attribute
symbols. Each relation symbol r has an associated relation schema r(A
1
, . . . , A
n
), where
the A
i
’s are attribute symbols. Each attribute A ? Att has a domain dom(A) containing
a distinguished value ?, called inapplicable null
1
– in addition, there are two in?nite
disjoint sets (also disjoint from dom(A)) |(A) and A(A) of variables associated with
A. Intuitively, |(A) is a set of variables denoting unknown nulls, while A(A) is a set
1
Note that we treat an inapplicable null as a value in dom(A) since it does not represent uncertain
information.
197
of variables that denote no-information nulls. We require that |(A) ? |(B) = ? if
dom(A) ,= dom(B) and |(A) = |(B) if dom(A) = dom(B), for any A, B ? Att. The
same assumptions are made for the A(A
i
)’s. We de?ne Dom =

A?Att
dom(A), | =

A?Att
|(A), A =

A?Att
A(A). For each A ? Att we de?ne dom(A) = dom(A) ?
¦?¦.
Given a relation schema S = r(A
1
, . . . , A
n
), a tuple over S is an element of
(dom(A
1
) ? |(A
1
) ?A(A
1
)) (dom(A
n
) ? |(A
n
) ?A(A
n
)); a relation over S is
a ?nite multiset of tuples over S. A complete tuple belongs to dom(A
1
) dom(A
n
)
and a relation R is complete iff every tuple in R is complete. The restriction of a tuple t
to a set X of attributes (or a single attribute) is denoted by t[X]. The set of attributes of a
relation schema S is denoted by Att(S).
A database schema DS is a set of relation schemas ¦S
1
, . . . , S
m
¦; a database in-
stance (or simply database) I over DS is a set of relations ¦R
1
, . . . , R
m
¦, where each
R
i
is a relation over S
i
. The set of all possible databases over a database schema DS is
denoted by db(DS). Multiple occurrences of the same null occur in a database.
We consider the relational algebra operators ? (projection), ? (selection), (carte-
sian product), (join), ? (union), ? (intersection), and ? (difference) (note that since
relations are multisets, the multiset semantics is adopted for the operators, see [UW02],
Ch. 5.1).
Semantics. We now provide semantics for the types of databases described thus far.
A valuation is a mapping v : | ? A ? Dom such that U
i
? |(A) implies
v(U
i
) ? dom(A) and N
j
? A(A) implies v(N
j
) ? dom(A). A valuation v can be
applied to a tuple t, relation R, and database I in the obvious way – the result is denoted
by v(t), v(R), and v(I), respectively.
198
Thus, for each attribute A, the application of a valuation replaces each no-information
null with a value in dom(A) (? allowed) and each unknown null with a value in dom(A)
(? not allowed) with multiple occurrences of the same null replaced by the same value.
The result of applying a valuation is a complete database.
De?nition 58. The set of completions of a database I is
comp(I) = ¦ v(I) [ v is a valuation ¦.
6.3 Partial Information Policies
In this section we introduce partial information policies which allow users to make
assumptions about missing data in a database, taking into account their own knowledge
of how the data was collected, their attitude to risk, and their mission needs.
De?nition 59. Given a database schema DS, a partial information policy (PIP) is a
mapping ? : db(DS) ? 2
db(DS)
s.t. ?(I) is a non-empty subset of comp(I) for every
I ? db(DS).
Thus, a PIP maps a database to a subset of its completions that we call preferred
completions.
Example 57. The completions of the relation in Example 56 are the complete DBs ob-
tained by replacing every unknown value with an actual value. Each user has expressed
preferences on which completions are of interest to him. The completions chosen as pre-
ferred by user A are those where each unknown under-age enrollment ratio is replaced
with a value determined by linear regression; for user B the preferred completions are
199
those where unknown under-age enrollment ratios are replaced with values in the range
[4.81, 16.77]; and so forth for the other users.
Note that the preferred completions chosen by users A, D, E, F (but not B and
C) can be represented with the data model of Section 6.2, that is, ?I ? db(DS)?I
t
?
db(DS)(comp(I
t
) = ?(I)). This is so because the policies expressed by users A, D,
E, F determine a single actual value for each null value, whereas the policies expressed
by users B and C give a set of possible actual values for each null value. The impor-
tant advantage of this property is that the result of applying a policy can be represented
as a database in the same data model of the original database (i.e., the data model of
Section 6.2), whereas policies that do not satisfy the property need more expressive and
complex data models (e.g., c-tables [Gra91, IL84b]). We now present some families of
PIPs which enjoy the aforementioned property (the next section de?nes index structures
which allow us to ef?ciently apply policies with large datasets). In addition, these policies
can be used as building blocks to de?ne much more complex policies.
Henceforth, we assume that I is a database and R ? I is a relation over schema
S; A, B ? Att(S) and X, Y, Z ? Att(S), with A, B and attributes in Y having nu-
meric domains; µ, ? and ? are aggregate operators in ¦MIN, MAX, AVG, MEDIAN,
MODE¦.
Given a tuple t ? R, we de?ne (i) the relation V (t, X, Z) = ¦t
t
[ t
t
? R ? t
t
[X] =
t[X] ? ?A
i
? Z (t
t
[A
i
] ? dom(A
i
))¦, that is, the multiset of tuples in R having the
same X-value as t and a Z-value consisting of values in Dom, and (ii) the relation
V
?
(t, X, Z) = ¦t
t
[ t
t
? R ? t
t
[X] = t[X] ? ?A
i
? Z (t
t
[A
i
] ? dom(A
i
))¦, that is,
the multiset of tuples in R having the same X-value as t and a Z-value consisting of
values in Dom?¦?¦.
200
Family of Aggregate Policies. ?
agg
(µ, ?, A, X) is de?ned as follows. If t ? R and
t[A] ? |(A), then V = V
?
(t, X, ¦A¦), else if t[A] ? A(A), then V = V (t, X, ¦A¦). If
µ¦t
t
[A] [ t
t
? V ¦ exists, then we say that it is a candidate value for t[A]. Let I
t
be the
database obtained from I by replacing every occurrence of a null ? ? A ?| appearing in
?
A
(R) with ?¦v
1
, . . . , v
n
¦ (if the latter exists), where the v
i
’s are the candidate values for
?. The preferred completions of this policy are the completions of I
t
. Note that for each
selection of µ, ?, A, X, this single de?nition de?nes a different PIP - all belonging to the
family of aggregate policies.
Example 58. For the purpose of illustrating the roles of the different parameters of PIPs,
consider the simple relation below.
Country Year GER UER
Mali 1996 94.67 3.84
Mali 1997 94.83 U
1
Mali 1998 95.72 4.36
Rwanda 1996 98.84 4.67
Rwanda 1997 103.76 5.38
Rwanda 1998 105.24 U
1
Senegal 1997 93.14 4.52
Senegal 1998 95.72 4.87
Sudan 1997 102.83 5.03
Sudan 1998 103.76 5.12
Suppose we want to apply the policy ?
agg
(AVG, MAX, UER, ¦Country¦). This
policy looks at missing values under attribute UER(third parameter of the policy). When
the ?rst occurrence of U
1
is considered, a candidate value is computed as follows. Since
the last parameter of the policy is Country, only tuples for Mali are considered and their
average (?rst parameter) UER is a candidate value, i.e., 4.1. Likewise, when the second
occurrence of U
1
is considered, the average UER for Rwanda, i.e. 5.025, is another
candidate value. Eventually, the two occurrences of U
1
are replaced by 5.025, i.e. the
201
maximum candidate value (as speci?ed by the second parameter of the policy). If the
relation above belongs to a database I, then every occurrence of U
1
elsewhere in I is
replaced by 5.025.
Family of Regression Oriented Policies. ?
reg
(?, A, X, Y ) is de?ned as follows. If t ? R
and t[A] is a null ? ? A ? |, then T = ¦¸t
t
[Y ], t
t
[A]¸ [ t
t
? V
?
(t, X, Y ? ¦A¦)¦.
Let f be a model built from T via linear regression
2
, if T , = ?, where values on Y are
the independent variables and values on A are the dependent variables. If t[Y ] consists
of values in Dom ? ¦?¦ only, then f(t[Y ]) is a candidate value for ?. Suppose I
t
is
the database obtained from I by replacing every occurrence of a null ? ? A ? | in
?
A
(R) with ?¦v
1
, . . . , v
n
¦ (if the latter exists), where the v
i
’s are the candidate values
for ?. The preferred completions returned by this policy are the completions of I
t
. Note
that this de?nition de?nes a very large family of policies - one for each possible way of
instantiating ?, A, X, Y .
Example 59. Consider the relation of Example 58 and suppose we want to apply the
policy ?
reg
(AVG, UER, ¦Country¦, ¦Y ear¦). This policy looks at missing values under
attribute UER (second parameter of the policy). When the ?rst occurrence of U
1
is
considered, a candidate value is computed as follows. As Country is speci?ed as third
parameter of the policy, only tuples for Mali are considered. A linear model is built
from T = ¦¸1996, 3.84¸, ¸1998, 4.36¸¦. The independent variable of T is Y ear (last
parameter of the policy). The UER corresponding to 1997 given by the linear model is 4.1,
which is a candidate value. Likewise, when the second occurrence of U
1
is considered, a
linear model is built from T = ¦¸1996, 4.67¸, ¸1997, 5.38¸¦ and the candidate value 6.09
2
For the sake of simplicity we restrict ourselves to linear regression, but other policies using different
regression methods may be de?ned analogously.
202
is determined. The two occurrences of U
1
are replaced by the average (?rst parameter of
the policy) of the two candidate values, i.e. 5.095.
Family of Policies Based on Another Attribute. The policy ?
att
(µ, ?, ?, A, B, X) is
de?ned as follows. If t ? R and t[A] ? |(A), then V = V
?
(t, X, ¦A¦), else if
t[A] ? A(A), then V = V (t, X, ¦A¦). If ? = µ¦t
t
[ t
t
? V ¦ exists, then let
?
?
= min¦[t
t
??[ : t
t
? V ¦
3
and V
t
= ¦t
t
[ t
t
? V ? [t
t
??[ = ?
?
¦; we say that
?¦t
t
[A] [ t
t
? V
t
¦ is a candidate value for t[A]. Suppose I
t
is the database obtained from I
by replacing every occurrence of a null ? ? A?| appearing in ?
A
(R) with ?¦v
1
, . . . , v
n
¦
(if the latter exists), where the v
i
’s are the candidate values for ?. The preferred comple-
tions returned by this policy are the completions of I
t
. This de?nition also de?nes a very
large family of policies - one for each possible way of instantiating µ, ?, ?, A, B, X.
Example 60. Consider again the relation of Example 58 and suppose we want to apply
the policy ?
att
(MIN, AVG, MAX, UER, GER, ¦Y ear¦). This policy looks at missing
values under attribute UER (fourth parameter of the policy). A candidate value for the
?rst occurrence of U
1
is determined as follows. Tuples referring to 1997 are considered
because the last parameter of the policy is Y ear. Then, the min GER for such tuples is
found (this is speci?ed by the ?rst and ?fth parameters), i.e. 93.14, and the corresponding
UER is a candidate value, viz. 4.52. Consider now the second occurrence of U
1
. Tuples
referring to 1998 are considered and the minimum GER is found among those tuples, i.e.
95.72. However, there are two tuples having such a value, so there are two corresponding
UERs, i.e. 4.36 and 4.87. The second parameter of the policy states that their average
is a candidate, i.e. 4.615. Every occurrence of U
1
is replaced by the maximum candidate
value, i.e. 4.615, as speci?ed by the third parameter of the policy.
3
When at least one of x and y is a null, if x ,= y, then [x ?y[ = ?, else [x ?y[ = 0.
203
When applying a policy to a database, one relation is used to determine how to
replace nulls – once replacements have been determined, they are applied to the whole
database – thus, different occurrences of the same null are replaced with the same value.
Given a database I over schema DS, a relation schema S ? DS, and a policy ?, we
use ?
S
(I) to specify that ? is applied to I and the relation in I over schema S is used
to determine the replacements. Once again, note that the preferred completions ?
S
(I)
obtained by applying the above policies can be represented by a database, i.e., there exists
a database I
t
s.t. comp(I
t
) = ?
S
(I); with a slight abuse of notation we use ?
S
(I) to denote
I
t
.
4
Example 61. The policies expressed by Users A, D, E, F in Example 56 can be respec-
tively formulated in the following way:
1. a regression policy ?
reg
(?, UER, ¦Country¦, ¦Y ear¦),
2. a regression policy ?
reg
(?, UER, ¦Country¦, ¦NetODA¦),
3. an aggregate policy ?
agg
(MAX, ?, UER, ¦Y ear¦), and
4. a policy based on another attribute ?
att
(AVG, ?, ?, UER, GER, ¦Y ear¦).
In the PIPs above, ? determines how multiple candidate values are aggregated and ? is
used as shown in Example 60. Different users may want to apply different PIPs depending
on what they believe is more suitable for their purposes – depending on the chosen PIP
and the input database, they may get different results.
The above policies are not exhaustive: they are basic policies that can be combined
to obtain more complex ones, e.g., different aggregate policies (on different attributes) can
4
I

itself need not be complete as nulls may remain in the database in attributes not affected by ?.
204
be de?ned over the same relation schema or an aggregate policy can be combined with
a regression oriented policy, and so forth. Furthermore, PIPs can be combined with rela-
tional algebra operators allowing users to express even more complex ways of managing
their incomplete data – we will deal with relational algebra and PIPs in Section 6.5.
6.4 Ef?ciently Applying PIPs
In this section, we present index structures to ef?ciently apply policies and show
how they can be incrementally maintained when the database is updated (Section 6.6
presents experimental results showing the index’s effectiveness).
Given a PIP ? of the form ?
agg
(µ, ?, A, X), ?
reg
(?, A, X, Y ), ?
att
(µ, ?, ?, A, B, X),
we call A the incomplete attribute of ?, denoted as inc(?), whereas X is the set of selec-
tion attributes of ?, denoted as sel(?). Throughout the chapter we will use vector notation
to refer to pointers; thus, given a tuple t,
? ?
t denotes a pointer to t; likewise, given a set c
of tuples,
? ?
c denotes the set of pointers to tuples in c. We start by introducing the notion
of cluster in the following de?nition.
De?nition 60. Given a relation R over schema S, and a set of attributes Z ? Att(S),
a cluster of R w.r.t. Z is a maximal subrelation c of R s.t. ? t, t
t
? c, t[Z] = t
t
[Z]. We
write cluster(R, Z) to denote the set of clusters of R w.r.t Z; it is the quotient multiset
obtained from the identity on Z between tuples in R.
Example 62. Consider the relation salary below (throughout this section we use this
simple relation as it allows us to clearly illustrate the use of indexes and has all types of
incompleteness).
205
Name Y ear Salary
t
1
John 2008 ?
t
2
John 2009 60K
t
3
John 2010 U
1
t
4
Alice 2009 70K
t
5
Alice 2010 U
2
t
6
Bob 2009 60K
t
7
Bob 2010 70K
t
8
Carl 2010 N
1
There are four clusters w.r.t. ¦Name¦, namely c
1
= ¦t
1
, t
2
, t
3
¦, c
2
= ¦t
4
, t
5
¦,
c
3
= ¦t
6
, t
7
¦ and c
4
= ¦t
8
¦.
The next example shows the idea behind our index structures.
Example 63. Suppose we want to apply the policy ?
agg
(AVG, ?, Salary, ¦Name¦), where
? is any aggregate operator, to the relation salary of Example 62. To determine how to
replace missing salaries, we need to retrieve every cluster in cluster(salary, ¦Name¦)
which (i) contains at least one tuple having a missing salary, i.e., a salary in | ? A
(otherwise there is no need apply the policy to that cluster), and (ii) contains at least one
tuple having a non-missing salary (otherwise there is no data from which to infer missing
salaries). Clusters satisfying these conditions yield possible candidates – other clusters
do not play a role and so can be ignored. In Example 62, we need to retrieve only clusters
c
1
and c
2
.
To leverage this idea, we associate a counter with each cluster to keep track of the
number of tuples in the cluster containing standard constants, unknown, no-information,
206
and inapplicable nulls on a speci?c attribute – the role of such counters will be made clear
shortly. Let Rbe a relation over schema S, Z ? Att(S), and B ? Att(S). Given a cluster
c ? cluster(R, Z), we de?ne
• C
v
(c, B) = [¦t [ t ? c ? t ? dom(B)¦[,
• C
?
(c, B) = [¦t [ t ? c ? t = ?¦[.
• C
|
(c, B) = [¦t [ t ? c ? t ? |(B)¦[,
• C
A
(c, B) = [¦t [ t ? c ? t ? A(B)¦[,
We now introduce the ?rst data structure that will be used for the ef?cient applica-
tion of PIPs.
De?nition 61. Let Rand ? be a relation and a PIP, respectively. Moreover, let X = sel(?)
and A = inc(?). A cluster table for R and ? is de?ned as follows:
ct(R, ?) =
¦¸t[X],
? ?
c , C
v
(c, A), C
?
(c, A), C
|
(c, A), C
A
(c, A)¸
s.t. c ? cluster(R, X) ? t ? c¦
Example 64. The cluster table T for the relation salary of Example 62 and the policy
?
agg
(AVG, ?, Salary, ¦Name¦), where ? is an arbitrary aggregate is:
t[¦Name¦]
? ?
c C
v
C
?
C
|
C
A
s
1
John ¦
? ?
t
1
,
? ?
t
2
,
? ?
t
3
¦ 1 1 1 0
s
2
Alice ¦
? ?
t
4
,
? ?
t
5
¦ 1 0 1 0
s
3
Bob ¦
? ?
t
6
,
? ?
t
7
¦ 2 0 0 0
s
4
Carl ¦
? ?
t
8
¦ 0 0 0 1
207
where C
v
stands for C
v
(c, Salary), C
?
stands for C
?
(c, Salary), and so forth.
The counters associated with each cluster in a cluster table determine whether a pol-
icy needs the cluster to determine candidate values. For instance, the PIP in Example 64
has to look at those clusters having C
|
(c, Salary) + C
A
(c, Salary) > 0 (i.e., having
some missing salaries) and C
v
(c, Salary) > 0 (i.e., having some salaries to be exploited
for inferring the missing ones). Different conditions determine whether a given PIP needs
to consider a given cluster.
De?nition 62. Suppose Ris a relation, ? is a PIP, and c is a cluster in cluster(R, sel(?)).
1) If ? is an aggregate policy ?
agg
(µ, ?, A, X) (? being any aggregate operator), then
• if µ ? ¦MAX, MIN, AVG, MEDIAN¦ and C
|
(c, A)+C
A
(c, A) > 0?C
v
(c, A) >
0, then we say that c is relevant w.r.t. ?;
• if µ = MODE and
_
(C
v
(c, A) > 0 ?C
|
(c, A) > 0) ?(C
A
(c, A) > 0 ?C
v
(c, A) +
C
?
(c, A) > 0)
_
, then we say that c is relevant w.r.t. ?.
2) If ? is a regression oriented policy ?
reg
(?, A, X, Y ) (? being any aggregate operator)
and C
|
(c, A) + C
A
(c, A) > 0 ? C
v
(c, A) > 0, then we say that c is relevant w.r.t. ?.
3) If ? is a policy based on another attribute ?
att
(µ, ?, ?, A, B, X) (µ, ? and ? are any
aggregate operators) and
_
(C
v
(c, A) > 0 ?C
|
(c, A) > 0) ?(C
A
(c, A) > 0 ?C
v
(c, A) +
C
?
(c, A) > 0)
_
, then c is relevant w.r.t. ?.
The counters associated with each cluster in a cluster table allow us to determine
whether a cluster is relevant or not without scanning the entire cluster. Furthermore, as
we will discuss later, when the database is modi?ed (i.e., tuple insertions, deletions, or
updates occur) such counters allow us to determine whether the “relevance” of a cluster
changes or not without scanning it.
208
For a cluster table T we maintain an additional data structure that allows us to
retrieve the tuples of T which refer to relevant clusters.
De?nition 63. Let T = ct(R, ?) be the cluster table for a relation R and a PIP ?. Then,
Relevant(T, ?) =
¦¸t[X],
? ?
s ¸ [ s = ¸t[X],
? ?
c , C
v
, C
?
, C
|
, C
A
¸ ? T
? c is relevant w.r.t. ?¦
Example 65. Consider the cluster table T and the PIP of Example 64; Relevant(T, ?)
is as follows:
t[¦Name¦]
? ?
s
John
? ?
s
1
Alice
? ?
s
2
The following proposition states the complexity of building a cluster table T =
ct(R, ?) and Relevant(T, ?) for a relation R and a PIP ?.
Proposition 28. Let R and ? be a relation and a PIP, respectively. Independent of ?
the worst-case time complexity of building T = ct(R, ?) and Relevant(T, ?) is O([R[
log[R[).
A cluster table T = ct(R, ?) and Relevant(T, ?) are maintained for each policy
?. Note that policies having the same sel(?) and inc(?) can share the same cluster table
T; moreover, if the the criterion that determines whether a cluster is relevant or not is the
same (see De?nition 62), they can also share the same Relevant(T, ?).
Recall that when a policy determines a value c which has to replace a null ?, then
every occurrence of ? in the database has to be replaced with c. Thus, whenever a value
209
has been determined for a null, we need to retrieve all those tuples containing that null.
To this end, we maintain the data structure de?ned in the de?nition below. Given a null
? ? A ? | and a database I, I
?
denotes the set of tuples in I containing ?.
De?nition 64. Given a database I, we de?ne
Null(I) = ¦¸?,
? ?
I
?
¸ [ ? ? A ? | ?
? ?
I
?
,= ?¦
Clearly, Null(I) is shared by all the policies.
Proposition 29. Given a database I, the worst-case time complexity of building Null(I)
is O([I[ (log N
null
+ log[I
?
max
[)), where [I[ is the number of tuples in I, N
null
is the
number of distinct nulls in A ? | appearing in I, and I
?
max
is the I
?
with maximum
cardinality.
In the rest of this section we show how to update a cluster table T, Relevant(T, ?)
and Null(I) when tuples are inserted, deleted, or updated. We also show how to apply a
PIP exploiting the data structures presented thus far and introduce further optimizations
that can be applied for speci?c policies.
6.4.1 Tuple insertions
Figure 6.1 reports an algorithm to update a cluster table T, Relevant(T, ?) and
Null(I) after a tuple t is inserted. The algorithm ?rst updates Null(I) (lines 1–3). After
that, if there already exists a cluster c for t, then
? ?
t is added to
? ?
c (line 5), the counters
associated with c are properly updated (line 6), and if c becomes a relevant cluster, then a
tuple for c is added to Relevant(T, ?) (line 7). Note that in order to determine whether
a cluster is relevant or not, it suf?ces to check its associated counters instead of scanning
210
the entire cluster. If there does not exist a cluster for t, then a new tuple for it is added to
T (lines 8–11) – in this case the cluster is certainly non-relevant.
Algorithm CT-insert
Input: A relation R ? I, a PIP ?, cluster table T = ct(R, ?),
Relevant(T, ?), Null(I), and a new tuple t
(X = sel(?) and A = inc(?))
1 For each ? ? A ? | appearing in t
2 If ?¸?,
? ?
I
?
¸ ? Null(I) then Add
? ?
t to
? ?
I
?
3 else Add ¸?, ¦
? ?
t ¦¸ to Null(I)
4 If ?s = ¸t[X],
? ?
c , C
v
, C
?
, C
U
, C
N
¸ ? T then
5 Add
? ?
t to
? ?
c
6 Update one of C
v
, C
?
, C
U
, C
N
according to t[A]
7 If c has become relevant then Add ¸t[X],
? ?
s ¸ to Relevant(T, ?)
8 else If t[A] ? dom(A) then Add ¸t[X], ¦
? ?
t ¦, 1, 0, 0, 0¸ to T
9 else If t[A] = ? then Add ¸t[X], ¦
? ?
t ¦, 0, 1, 0, 0¸ to T
10 else If t[A] ? |(A) then Add ¸t[X], ¦
? ?
t ¦, 0, 0, 1, 0¸ to T
11 else Add ¸t[X], ¦
? ?
t ¦, 0, 0, 0, 1¸ to T
Figure 6.1: Updating index structures after a tuple insertion.
Example 66. Suppose we add tuple t = ¸Bob, 2011, U
4
¸ to the relation salary of Exam-
ple 62. First, a new tuple ¸U
4
, ¦
? ?
t ¦¸ is added to Null(¦salary¦). As there is already
a cluster for Bob, s
3
is retrieved from T (see the cluster table T in Example 64),
? ?
t is
added to the set of pointers of the cluster and C
|
is incremented by one, i.e., s
3
becomes
¸Bob, ¦
? ?
t
6
,
? ?
t
7
,
? ?
t ¦, 2, 0, 1, 0¸. As the cluster is relevant w.r.t. the policy of Example 64,
¸Bob,
? ?
s
3
¸ is added to Relevant(T, ?).
The following two propositions state the correctness and the complexity of Algo-
rithm CT-insert. With a slight abuse of notation, we use I ? ¦t¦ to denote the database
obtained by adding t to R, R being a relation of I.
Proposition 30. Let R be a relation of a database I, ? a PIP, and t a tuple. Algorithm
CT-insert computes T
t
= ct(R ? ¦t¦, ?), Relevant(T
t
, ?) and Null(I ? ¦t¦).
211
Proposition 31. The worst-case time complexity of Algorithm CT-insert is
O(log[Null(I)[ + log[I
?
max
[ + log[T[ + log[c
max
[), where c
max
is the cluster with max-
imum cardinality and I
?
max
is the set of tuple pointers in Null(I) with maximum cardi-
nality.
6.4.2 Tuple deletions
Figure 6.2 presents an algorithm to update a cluster table T, Relevant(T, ?) and
Null(I) when deleting a tuple t.
Algorithm CT-delete
Input: A relation R ? I, a PIP ?, cluster table T = ct(R, ?),
Relevant(T, ?), Null(I), and a tuple t
(X = sel(?) and A = inc(?))
1 For each ? ? A ? | appearing in t
2 Get ¸?,
? ?
I
?
¸ from Null(I)
3 Delete
? ?
t from
? ?
I
?
4 If
? ?
I
?
= ? then Delete ¸?,
? ?
I
?
¸ from Null(I)
5 Get s = ¸t[X],
? ?
c , C
v
, C
?
, C
U
, C
N
¸ from T
6 Delete
? ?
t from
? ?
c
7 Update one of C
v
, C
?
, C
U
, C
N
according to t[A]
8 If c has become non-relevant then
9 Delete ¸t[X],
? ?
s ¸ from Relevant(T, ?)
10 If
? ?
c = ? then Delete s from T
Figure 6.2: Updating index structures after a tuple deletion.
Example 67. Suppose we delete t
4
from the relation salary of Example 62. Then, no
changes are made to Null(¦salary¦). s
2
is retrieved from T (see the cluster table T in
Example 64),
? ?
t
4
is deleted from the set of pointers of the cluster and C
v
is decremented by
one, that is, s
2
becomes ¸Alice, ¦
? ?
t
5
¦, 0, 0, 1, 0¸. As the cluster is not relevant anymore,
¸Alice,
? ?
s
2
¸ is deleted from Relevant(T, ?).
212
The propositions below state the correctness and the complexity of Algorithm CT-
delete. With a slight abuse of notation, we use I ? ¦t¦ to denote the database obtained
by deleting t from R, R being a relation of I.
Proposition 32. Let R be a relation of a database I, ? a PIP, and t a tuple in R. Algorithm
CT-delete computes T
t
= ct(R ?¦t¦, ?), Relevant(T
t
, ?), and Null(I ?¦t¦).
Proposition 33. The worst-case time complexity of Algorithm CT-delete is the same as
for Algorithm CT-insert.
6.4.3 Tuple updates
An algorithm for updating a cluster table T, Relevant(T, ?) and Null(I) after a
tuple t is updated to t
t
can be simply de?ned by ?rst calling CT-delete with t as parameter
and then calling CT-insert with t
t
as parameter. We call this algorithm CT-update. The
following two propositions state the correctness and the complexity of Algorithm CT-
update. With a slight abuse of notation, we use I ? ¦t¦ ? ¦t
t
¦ to denote the database
obtained by updating t ?R into t
t
, R being a relation of I.
Proposition 34. Let R be a relation of a database I, ? a PIP, t a tuple in R, and t
t
a
tuple. Algorithm CT-update computes T
t
= ct(R ?¦t¦ ? ¦t
t
¦, ?), Relevant(T
t
, ?) and
Null(I ?¦t¦ ? ¦t
t
¦).
Proposition 35. The worst-case time complexity of Algorithm CT-update is the same as
for Algorithm CT-insert.
213
Algorithm CT-update can be optimized when t and t
t
belong to same cluster; we
omit the optimized algorithm and illustrate the basic intuition below. Consider the re-
lation salary of Example 62 and suppose we modify t
4
= ¸Alice, 2009, 70K¸ to t
t
4
=
¸Alice, 2009, 80K¸. If we ?rst execute Algorithm CT-delete with t
4
, then its cluster
becomes irrelevant and the corresponding tuple is deleted from Relevant(T, ?). When
we execute CT-insert with t
t
4
, Alice’s cluster becomes relevant again and a tuple for it
is inserted into Relevant(T, ?). As another example, consider t
8
= ¸Carl, 2010, N
1
¸
and suppose Carl’s salary is modi?ed. By executing CT-delete and CT-insert, s
4
is ?rst
deleted from T (see the cluster table in Example 64) and then it is added again.
Deleting from and inserting into Relevant(T, ?) or T can be avoided if we ?rst
check whether t and t
t
belong to the same cluster and if so, then we do not call CT-delete
and CT-insert, but directly update Relevant(T, ?) and T according to t[A] and t[A
t
] (in
addition, Null(I) is updated according to the null values in t and t
t
).
6.4.4 Applying PIPs
Figure 6.3 shows the CT-ApplyPIP algorithm to apply a PIP on top of our data
structures. CT-ApplyPIP ?rst retrieves the relevant clusters so that only a subrelation R
t
of R has to be considered in order to determine how to replace null values (lines 1–4).
The policy ? then tries to determine a value for each null appearing in R
t
on attribute
A (lines 5–6) – this depends on the adopted policy and is accomplished as described in
Section 6.3. If a value v for a null ? has been determined, then every occurrence of ? in
the database is replaced with v (lines 7–10). It is worth noting that when a null is replaced
with a value, then CT-update is executed.
214
Algorithm CT-ApplyPIP
Input: A relation R ? I, a PIP ?, T = ct(R, ?),
Relevant(T, ?) and Null(I)
(X = sel(?) and A = inc(?))
1 R

= ?
2 For each ¸t[X],
? ?
s ¸ ? Relevant(T, ?)
3 Get s = ¸t[X],
? ?
c , C
v
, C
?
, C
U
, C
N
¸ from T
4 R

= R

? c
5 For each ? ? A ? | appearing in R

on A
6 Determine a value v for ? according to ?
7 If v exists then
8 Get ¸?,
? ?
I
?
¸ from Null(I)
9 For each
? ?
t ?
? ?
I
?
10 Replace every occurrence of ? in t with v
Figure 6.3: Applying a PIP
The following two propositions state the correctness and the complexity of Algo-
rithm CT-ApplyPIP.
Proposition 36. Let I be a database, R ? I a relation over schema S, and ? a PIP.
AlgorithmCT-ApplyPIPcorrectly computes I
t
= ?
S
(I), T
t
= ct(R
t
, ?), Relevant(T
t
, ?)
and Null(I
t
), where R
t
? I
t
is the relation over schema S.
Proposition 37. The worst-case time complexity of Algorithm CT-ApplyPIP is O([R
t
[
(cost
?
(R
t
) +log[Null(I)[ +[I
?
max
[ cost
CT-update
)), where R
t
is the union of the relevant
clusters of R (w.r.t. ?), cost
?
is the cost of determining a value for a null according to
policy ?, I
?
max
is the set of tuple pointers in Null(I) with maximum cardinality, and
cost
CT-update
is the cost of updating a tuple (see Proposition 35).
Basically, applying a policy consists of determining how to replace every null ap-
pearing for the incomplete attribute and then replacing every occurrence of it in the
database (note that the former step needs the clusters to be identi?ed). When applying
a policy, the data structures introduced in this section have the following bene?ts. 1) The
relation from which candidate values are determined does not have to be scanned to iden-
215
tify its clusters, as they can be ef?ciently retrieved from the cluster table. 2) By looking
at Relevant(T, ?), only those clusters from which candidate values can be determined
(i.e., the relevant clusters) are considered when applying a policy, thus avoiding looking
at those tuples in the relation that can be disregarded. 3) Tuples containing nulls that have
to be replaced can be retrieved from Null(I) without scanning the whole database. 4)
When the database is modi?ed because of tuple insertions, deletions or updates, our data
structures can be ef?ciently updated.
Experiments (cf. Section 6.6) show that these indexes yield signi?cant performance
gains on large datasets.
6.4.5 Optimizations
The data structures and the algorithms presented above can be optimized as illus-
trated in the following example.
Example 68. Consider the relation salary from Example 62. Let ? is any aggregate oper-
ator, then assuming a policy ?
agg
(AVG, ?, Salary,¦Name¦), the corresponding cluster
table T and Relevant(T, ?) are reported in Examples 64 and 65, respectively. For each
tuple ¸t[X],
? ?
c , C
v
, C
?
, C
|
, C
A
¸ in T we might keep the average of the salaries (in gen-
eral, the average of A-values, A being the incomplete attribute of the PIP) of the tuples
in c. Thus, when the policy is applied, the average salary of each cluster can be obtained
without scanning the entire cluster.
The average salary can be inexpensively computed when the cluster table is ?rst
built. In addition, this value can be easily updated when the relation salary is updated.
For instance, when a new tuple t is inserted, if t[Salary] ,? dom(Salary), then nothing
has to be done. If t[Salary] ? dom(Salary) , then the new average salary is computed as
216
avg
new
=
avg
old
Cv+t[Salary]
Cv+1
, where avg
old
is the old average salary and C
v
is the number
of salaries in t’s cluster (before inserting t). Likewise, the average salary can be updated
when tuples are deleted or updated.
The optimization in the example above can be applied to other policies as well, the
basic idea is to associate each tuple in a cluster table with a pre-computed value (or a set
of pre-computed values) which turns out to be useful when determining candidate values
– such a value is then incrementally updated when the database is modi?ed.
6.5 Relational Algebra and PIPs
In this section, we study when applying a policy before a relational algebra operator
gives the same result as applying it after. This can be exploited for query optimization
purposes. Note that PIPs cannot be expressed using the relational algebra because PIPs
modify the database by replacing null values; while the relational algebra operators can-
not modify the database. Throughout this section, a policy is either an aggregate policy,
or a regression policy, or a policy based on another attribute. We adopt the SQL seman-
tics [SQL03] for the evaluation of relational algebra operators.
We now de?ne the database obtained after applying a relational algebra operator.
The database obtained by applying projection to a relation R is de?ned as the database
obtained by replacing R with its projection. More formally, consider a database I over
schema DS, and a relation R ? I over schema S ? DS. In addition, let Z ? Att(S). We
de?ne ?
S
Z
(I) = (I ? ¦?
Z
(R)¦) ? ¦R¦. Thus, the notation ?
S
Z
(I) means that the relation
R in I over schema S is replaced by ?
Z
(R). Likewise, the database obtained as the result
of performing the cartesian product of two relations R
1
and R
2
is de?ned as the database
217
obtained by replacing R
1
and R
2
with R
1
R
2
. Stated more formally, consider a database
I over schema DS, and two relations R
1
, R
2
? I over schemas S
1
, S
2
? DS. We de?ne

S
1
,S
2
(I) = (I ? ¦R
1
R
2
¦) ? ¦R
1
, R
2
¦. The result databases for the other relational
algebra operators are de?ned similarly and analogous notations will be used for them.
6.5.1 Projection and PIPs
We consider both the case where projection returns a set and the case where a
multiset is returned. For notational convenience, we use ?
m
Z
to denote the projection
operator which returns a multiset, whereas ?
Z
denotes the projection operator that returns
a set. In order for a PIP to make sense after projection, we assume that the attributes
on which the projection is performed include the attributes appearing in the policy. The
following proposition considers the projection operator that returns a set and provides
suf?cient conditions under which applying a policy before or after projection gives the
same result.
Proposition 38. Suppose I is a database over schema DS, R ? I is a relation over
schema S ? DS, Z ? Att(S), and ? is a policy. Moreover, let S
t
denote the schema of
?
Z
(R).
1. If ? is an aggregate policy ?
agg
(µ, ?, A, X), then
(a) if µ ? ¦MAX, MIN¦, then ?
S

(?
S
Z
(I)) = ?
S
Z
(?
S
(I));
(b) if µ ? ¦AVG, MEDIAN, MODE¦, then ?
S

(?
S
Z
(I)) = ?
S
Z
(?
S
(I)) if ?
Z
(C) =
?
m
Z
(C), where C =

c?cluster(R,X)?c is relevant w.r.t. ?
c.
2. If ? is a policy based on another attribute ?
att
(µ, ?, ?, A, B, X), then
218
(a) if µ, ? ? ¦MAX, MIN¦, then ?
S

(?
S
Z
(I))=?
S
Z
(?
S
(I));
(b) otherwise ?
S

(?
S
Z
(I)) = ?
S
Z
(?
S
(I)) if ?
Z
(C) = ?
m
Z
(C),
where C =

c?cluster(R,X)?c is relevant w.r.t. ?
c.
3. If ? is a regression oriented policy, then ?
S

(?
S
Z
(I)) = ?
S
Z
(?
S
(I)).
Thus, applying a PIP before or after projection does not always give the same result
when we consider aggregate policies or policies based on another attribute using one of
the operators AVG, MEDIAN, MODE. Here the point is that projection loses duplicates;
while this makes no difference when MAX or MIN are used, such a loss may change the
result of the other aggregate operators. When a regression policy is applied, the loss
of duplicates does not change the set of data used to build the regression model. The
following example shows that the suf?cient conditions stated above are not necessary
conditions.
Example 69. Consider the database I consisting of the relation Rbelow and let S denote
the schema of R.
A B C D
U
1
b 1 d
1
2 b 2 d
2
2 b 2 d
3
Let ? be an aggregate policy ?
agg
(µ, ?, A, ¦B¦) or a policy based on another at-
tribute ?
att
(µ, ?, ?, A, C, ¦B¦), with µ ? ¦AVG, MEDIAN, MODE¦ and ?, ? arbitrary
aggregate operators. Moreover, let S
t
denote the schema of ?
ABC
(R). We have that
?
S

(?
S
ABC
(I)) = ?
S
ABC
(?
S
(I)) even though ?
ABC
(C) ,= ?
m
ABC
(C) – here C is de?ned as
in Proposition 38.
219
If the projection operator which returns a multiset instead of a set is used, then the
two orders in which a policy can be applied always give the same result.
Proposition 39. Suppose I is a database over schema DS, R ? I is a relation over
schema S ? DS, Z ? Att(S), and ? is a policy. Moreover, let S
t
denote the schema of
?
m
Z
(R). If projection returns a multiset, then ?
S

(?
S
Z
(I)) = ?
S
Z
(?
S
(I)).
6.5.2 Selection and PIPs
Applying a PIP before or after selection yields different results in very simple cases
as shown in the example below. The intuitive reason is that the two orders give different
results when the selection applied ?rst does not keep tuples which affect the application
of the policy.
Example 70. Consider the database I consisting of the relation Rbelow and let S denote
the schema of R.
A B C
U
1
b 1
2 b 2
Consider ?
S
(I), where ? is one of the following policies:
• ?
agg
(µ, ?, A, ¦B¦). This policy replaces U
1
with 2, for any aggregate operators µ
and ?.
• ?
reg
(?, A, ¦B¦, ¦C¦). By applying linear regression, U
1
is replaced by 1 for any
aggregate operator ?.
220
• ?
att
(µ, ?, ?, A, C, ¦B¦). For any aggregate operators µ, ? and ?, U
1
is replaced
with 2.
For any of the policies above, ?
S
(?
S
C=1
(I)) ,= ?
S
C=1
(?
S
(I)). In each case, ?
S
(I) is ob-
tained by replacing U
1
with a value which is determined using the second tuple in R (this
happens because the two tuples in R have the same B-value). Clearly, ?
S
C=1
(?
S
(I)) re-
turns the ?rst tuple in R where U
1
has been replaced with an actual value. On the other
hand, when selection is ?rst applied, the second tuple in R is deleted and then the sub-
sequent application of a policy has no effect because there is no data to infer an actual
value for U
1
. Thus, ?
S
(?
S
C=1
(I)) gives exactly the ?rst tuple in R leaving U
1
as is. Note
that neither of the two results contains the other.
6.5.3 Cartesian Product and PIPs
In the following proposition we identify different ways in which cartesian product
and PIPs interact one another.
Proposition 40. Suppose I is a database over schema DS, R
1
, R
2
? I are relations
over schemas S
1
, S
2
? DS, and ?
1
, ?
2
are policies for the former and latter relations,
respectively. Furthermore, let S
t
denote the schema of R
1
R
2
, and W
1
, W
2
be the
attributes appearing in ?
1
and ?
2
, respectively. Then,
1) ?
S

1
(
S
1
,S
2
(I)) =
S
1
,S
2
(?
S
1
1
(I)).
2) ?
S

2
(
S
1
,S
2
(I)) =
S
1
,S
2
(?
S
2
2
(I)).
3) ?
S

2
(?
S

1
(
S
1
,S
2
(I))) =
S
1
,S
2
(?
S
2
2
(?
S
1
1
(I))).
4) ?
S

1
(?
S

2
(
S
1
,S
2
(I))) =
S
1
,S
2
(?
S
1
1
(?
S
2
2
(I))).
5) If ?
W
1
(R
1
) and ?
W
2
(R
2
) do not have nulls in common, then ?
S

1
(?
S

2
(
S
1
,S
2
(I))) =
?
S

2
(?
S

1
(
S
1
,S
2
(I))).
221
The ?fth item above provides a suf?cient condition to guarantee that the two differ-
ent orders in which ?
1
and ?
2
are applied after performing the cartesian product give the
same result. The following example shows that this is not a necessary condition.
Example 71. Consider the database I consisting of the following two relations R
1
and
R
2
(whose schemas are denoted by S
1
and S
2
, respectively):
A B C
U
1
b 1
2 b 1
D E F
U
1
e 1
2 e 1
Let ?
1
be either of this policies, ?
agg
(µ
1
, ?
1
, A, ¦B¦), or ?
reg
(?
1
, A, ¦B¦, ¦C¦), or
?
att
(µ
1
, ?
1
, ?
1
, A, C, ¦B¦); and ?
2
be either ?
agg
(µ
2
, ?
2
, D, ¦E¦) or ?
reg
(?
2
, D, ¦E¦, ¦F¦)
or ?
att
(µ
2
, ?
2
, ?
2
, D, F, ¦D¦). Let S
t
denote the schema of R
1
R
2
. For any choice of
the aggregate operators, even though ?
W
1
(R
1
) and ?
W
2
(R
2
) have nulls in common (here
W
1
and W
2
are de?ned as in Proposition 40), the following holds ?
S

1
(?
S

2
(
S
1
,S
2
(I))) =
?
S

2
(?
S

1
(
S
1
,S
2
(I))).
6.5.4 Join and PIPs
The join R
1

?
R
2
of R
1
, R
2
can be rewritten as the expression ?
?
(?
?
1
(R
1
)
?
?
2
(R
2
)), for some ?, ?
1
, ?
2
. This equivalence can be effectively exploited.
Corollary 6. Suppose I is a database over schema DS, R
1
, R
2
? I are relations over
schemas S
1
, S
2
? DS, and ?
1
, ?
2
are policies for the former and latter relation, respec-
tively. Let R
1

?
R
2
= ?
?
(?
?
1
(R
1
) ?
?
2
(R
2
)) for some ?, ?
1
, ?
2
. Furthermore, let S
t
denote the schema of ?
?
1
(R
1
) ?
?
2
(R
2
), and W
1
, W
2
be the attributes appearing in ?
1
and ?
2
, respectively. Then,
222
1. ?
S

?
(?
S

1
(
S
1
,S
2
(?
S
2
?
2
(?
S
1
?
1
(I))))) = ?
S

?
(
S
1
,S
2
(?
S
1
1
(?
S
2
?
2
(?
S
1
?
1
(I))))).
2. ?
S

?
(?
S

2
(
S
1
,S
2
(?
S
2
?
2
(?
S
1
?
1
(I))))) = ?
S

?
(
S
1
,S
2
(?
S
2
2
(?
S
2
?
2
(?
S
1
?
1
(I))))).
3. ?
S

?
(?
S

2
(?
S

1
(
S
1
,S
2
(?
S
2
?
2
(?
S
1
?
1
(I)))))) = ?
S

?
(
S
1
,S
2
(?
S
2
2
(?
S
1
1
(?
S
2
?
2
(?
S
1
?
1
(I)))))).
4. ?
S

?
(?
S

1
(?
S

2
(
S
1
,S
2
(?
S
2
?
2
(?
S
1
?
1
(I)))))) = ?
S

?
(
S
1
,S
2
(?
S
1
1
(?
S
2
2
(?
S
2
?
2
(?
S
1
?
1
(I)))))).
5. If ?
W
1
(R
1
) and ?
W
2
(R
2
) do not have nulls in common, then
?
S

?
(?
S

1
(?
S

2
(
S
1
,S
2
(?
S
2
?
2
(?
S
1
?
1
(I)))))) = ?
S

?
(?
S

2
(?
S

1
(
S
1
,S
2
(?
S
2
?
2
(?
S
1
?
1
(I)))))).
6.5.5 Union and PIPs
We provide a suf?cient condition under which the policy ?rst and policy last strate-
gies return the same result.
Proposition 41. Suppose I is a database over schema DS, R
1
, R
2
? I are relations over
schemas S
1
= r
1
(A
1
, . . . , A
n
), S
2
= r
2
(A
1
, . . . , A
n
) ? DS, and ? is a PIP. Furthermore,
let S
t
denote the schema of R
1
? R
2
, W the attributes appearing in ?, and X = sel(?).
If ?
X
(R
1
) ? ?
X
(R
2
) = ? and ?
W
(R
1
), ?
W
(R
2
) do not have nulls in common, then
?
S

(?
S
1
,S
2
(I)) = ?
S
1
,S
2
(?
S
2
(?
S
1
(I))) = ?
S
1
,S
2
(?
S
1
(?
S
2
(I))).
The next example shows that the condition in the previous proposition is not a
necessary condition.
Example 72. Consider the database I = ¦R
1
, R
2
¦ where R
1
and R
2
are shown below.
Let S
1
= r
1
(A, B, C, D) and S
2
= r
2
(A, B, C, D) denote their schemas, respectively.
A B C D
U
1
b 1 d
1
2 b 1 d
2
A B C D
U
2
b 1 d
3
2 b 1 d
4
223
Let ? be either ?
agg
(µ, ?, A, ¦B¦), ?
reg
(?, A, ¦B¦, ¦C¦), or ?
att
(µ, ?, ?, A, C, ¦B¦). For
any aggregate operators µ, ? and ?, we have that ?
S

(?
S
1
,S
2
(I)) = ?
S
1
,S
2
(?
S
2
(?
S
1
(I))) =
?
S
1
,S
2
(?
S
1
(?
S
2
(I))) even though ?
B
(R)??
B
(R
t
) ,= ? (S
t
denotes the schema of R
1
?R
2
).
6.5.6 Difference and PIPs
As we show in the example below, the different orders in which a policy can be
combined with the difference operator yield different results in very simple cases; the
reason is similar to the one given for selection.
Example 73. Consider the database I consisting of the relations R
1
and R
2
below, and
let S
1
= r
1
(A, B, C, D) and S
2
= r
2
(A, B, C, D) denote their schemas, respectively.
A B C D
U
1
b 1 d
1
2 b 1 d
2
A B C D
2 b 1 d
2
Suppose we compute ?
S
1
(I), where ? is any of the following policies.
1. ?
agg
(µ, ?, A, ¦B¦). This policy replaces U
1
with 2 for any aggregate operators µ
and ?.
2. ?
reg
(?, A, ¦B¦, ¦C¦). By applying linear regression, U
1
is replaced by 2 for any
aggregate operator ?.
3. ?
att
(µ, ?, ?, A, C, ¦B¦). U
1
is replaced with 2 for any aggregate operators µ, ?, ?.
Thus, for any of the policies above, ?
S
1
(I) replaces U
1
with a value determined
using the second tuple in R
1
(this is because the two tuples in R
1
have the same B-value).
Clearly, ?
S
1
,S
2
(?
S
1
(I)) returns only the ?rst tuple in R
1
where U
1
has been replaced with
224
an actual value. However, if the difference operator is performed before applying ?, then
the ?rst tuple in R
1
is returned and the application of a policy afterwards has no effect
because there are no tuples that can be used to determine al value for U
1
. Hence, we get
different results depending on whether we apply the policy before or after the difference
operator. Moreover neither result includes the other.
6.5.7 Intersection and PIPs
As in the case of difference, applying a policy before or after intersection leads to
different results in simple cases.
Example 74. Consider a database I consisting of the relations R
1
and R
2
below and let
S
1
= r
1
(A, B, C, D) and S
2
= r
2
(A, B, C, D) denote their schemas, respectively.
A B C D
U
1
b 1 d
1
2 b 1 d
2
A B C D
U
1
b 1 d
1
2 b 1 d
3
Considering the policies of Example 73, it is easy to check that ?
S
1
,S
2
(?
S
2
(?
S
1
(I))) re-
turns the tuple (2, b, 1, d
1
). On the other hand, ?
S

(?
S
1
,S
2
(I)), where S
t
denotes the
schema of R
1
? R
2
, returns the tuple (U
1
, b, 1, d
1
) since this is the only tuple which is
in both R
1
and R
2
, and the policy has no effect. Hence, the two results are different; note
also that neither of them is included in the other.
6.6 Experimental Results
We now describe several experiments we carried out to assess the effectiveness and
the scalability of the index structures of Section 6.4. We compare our approach with a
225
Policy Description
?
agg
(AVG, MAX, AirTime, ¡Origin, Dest, Carrier¦) Replace a missing ?ight air time with the average air time of the ?ights
operated by the same carrier having the same origin and destination.
?
att
(AVG, MAX, MIN, AirTime, ElapsedTime, Replace a missing air ?ight time with the the air ?ight time
¡Origin, Dest, Carrier¦) corresponding to the average elapsed time of the ?ights operated by
the same carrier having the same origin and destination.
?
reg
(MAX, AvgFare, ¡City¦, ¡Y ear, Quarter¦) Determine a missing average fare (for a certain city in a certain quarter)
by linear regression using the historical data for the same city.
Figure 6.4: Some of the PIPs used in the experiments
30,000
40,000
50,000
60,000
70,000
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Policy application
Index 1%
Naive 1%
Index 3%
Naive 3%
Index 5%
Naive 5%
0
10,000
20,000
1M 5M 10M 15M
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB Size
Figure 6.5: Policy application running time (different degrees of incompleteness)
naive one, the latter being a slight variant of Algorithm CT-ApplyPIP not relying on the
proposed indexes. In order to make the application of a policy ? faster with the naive
approach, we de?ned a B-tree index on sel(?) (we performed experiments showing that
this speeds up the naive approach). We also compare the two approaches when they are
combined with relational algebra operators and experimentally study the effects of the
propositions in Section 6.5. Finally, we carried out an experimental evaluation of the
quality of query answers with and without PIPs.
All experiments were carried out on a PostGres (v. 7.4.16) DBMS containing 20
years of U.S. ?ight data. The database schema has 55 attributes including date, origin,
226
300
400
500
600
700
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Policy application index
Index 1%
Index 3%
Index 5%
0
100
200
1M 5M 10M 15M
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB Size
Figure 6.6: Policy application running time (different degrees of incompleteness)
destination, airborne time, elapsed time, carrier, etc. Experiments were run using mul-
tiple multi-core Intel Xeon E5345 processors at 2.33GHz, 8GB of memory, running the
Scienti?c Linux distribution of the GNU/Linux OS kernel version 2.6.9-55.0.2.ELsmp.
The index structures were implemented using Berkeley DB Java Edition. The algorithms
for managing the index structures and applying policies in both approaches were written
in JAVA. Some of the policies used in the experiments are reported in Figure 6.4. The
results reported in this section apply only to aggregate policies; for the sake of brevity, we
do not present the results for the other kinds of policies as they show the same trend.
6.6.1 Applying PIPs
We ?rst compared the times taken by the two approaches to apply a policy. We
varied the size of the DB up to 15 million tuples and the “amount of incompleteness”
(percentage of rows with a null value) by randomly selecting tuples and inserting nulls
(of different kinds) in them. For example, for an aggregate policy ?
agg
(µ, ?, A, X) an x%
227
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
125,000 250,000 500,000 750,000 1,000,000
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB size
Policy application - multiple PIPs defined
Index 1-10 PIPs
Naive 1-10 PIPs
Figure 6.7: Running times of policy application with multiple policies de?ned
degree of incompleteness means that x% of the tuples in the database have null values in
A.
Figure 6.5 shows the running times of policy application for different database sizes
and three different amounts of incompleteness (only one policy is de?ned in this setting).
It is important to note that the execution times for the index approach include both the
time to apply a policy and the time taken to update the indexes. The gap between the two
approaches increases signi?cantly as the DB size increases with the index-based approach
signi?cantly outperforming the naive one – with 5 million tuples the former is 3 orders of
magnitude faster than the latter. As expected, a higher degree of incompleteness leads to
higher running times for both approaches. Figure 6.6 zooms in on the execution times for
the index-based approach and shows that it scales well: able to manage databases up to
15 million tuples.
Figure 6.7 shows how execution times vary when multiple policies are de?ned (here
the amount of incompleteness is 1%). The execution times of both approaches increase
228
with the number of de?ned policies because additional data structures have to be updated
when applying a policy. Our approach signi?cantly outperforms the naive method.
These results show that our approach scales well when increasing DB size, amount
of incompleteness and number of policies used – we can manage very large databases in
a reasonable amount of time.
0.01
0.015
0.02
A
v
g
.

R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Insertions
0
0.005
0.01
0.015
0.02
125,000 250,000 500,000 750,000 1,000,000
A
v
g
.

R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB Size
Insertions
Index
Naive
Figure 6.8: Tuple insertion running time
6.6.2 Updating the database
We also measured the time to execute tuple insertions, deletions, and updates; the
results are shown in Figures 6.8–6.10. Each execution time is the average over at least
50 runs covering the different kinds of tuples that might be inserted, deleted or updated.
The index-based approach is faster than the naive approach when tuple deletions are per-
formed, but slower for tuple insertions and updates, though the differences are negligible
and do not signi?cantly increase as the database size increases. This small overhead is
due to the management of the different data structures the index-based approach relies on
229
1
1.5
2
2.5
A
v
g
.

R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Deletions
Index
Naive
0
0.5
1
1.5
2
2.5
125,000 250,000 500,000 750,000 1,000,000
A
v
g
.

R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB Size
Deletions
Index
Naive
Figure 6.9: Tuple deletion running time
and is paid back by the better performances achieved for policy application, as discussed
earlier. We further analyze this tradeoff in the following subsection.
6.6.3 Execution times under different loads
The results reported in the previous two sections show that policy applications are
signi?cantly faster with our index structures, but tuple insertions and updates are slightly
slower (though tuple deletions are faster). Thus, the price we pay to maintain the indexes
is when tuples are inserted and updated. Clearly, this cost gets higher as the number of
modi?cations performed on the database increases, but it is paid back when policies are
applied. We performed experiments with different loads of database modi?cations and
policy applications to assess when the cost paid to perform the former is paid back by the
time saved when the latter are executed. Speci?cally, we varied the number of modi?ca-
tions from 1000 to 10000 combining them with different numbers of policy applications.
The experimental results are shown in Figure 6.11 (we used a database with 1 million
230
1.5
2
2.5
3
A
v
g
.

R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Updates
Index
Naive
0
0.5
1
1.5
2
2.5
3
125,000 250,000 500,000 750,000 1,000,000
A
v
g
.

R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB Size
Updates
Index
Naive
Figure 6.10: Tuple update running time
tuples and a 10% degree of incompleteness). The y-axis reports the difference between
the running times of the naive and the index-based approaches. If only one (resp. two)
policy application is performed, then the index running time gets higher than the naive
one when more than 5000 (resp. around 10000) database modi?cations are applied. With
more than two policy applications the index approach is always faster than the naive one
up to 10000 modi?cations and, as shown by the trends of the curves, different thousands
of modi?cations would be necessary to have the index approach slower than the naive.
6.6.4 Query answer quality
To assess the quality of query answers with and without policies, we performed
an experimental evaluation using the World Bank/UNESCO database mentioned in the
introduction. Speci?cally, we asked 5 analysts of our department (non computer scientist)
who are working with this database and know it well to express 10 queries of interest over
such data. As an example of query, they asked for the years during which the % of female
231
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
?
T

(
s
e
c
)
PIPs and DB modifications
1 PIP
2 PIPs
3 PIPs
4 PIPs
5 PIPs
-4,000
-2,000
0
2,000
1000 5000 10000
Number of DB modifications
5 PIPs
10 PIPs
Figure 6.11: Execution times with different loads
unemployment was above a certain threshold and wanted to know what were the gross
and under-age enrollment ratios in those years (to try to see if the gross and under-age
enrollment ratios are somehow related to and affect the % of female unemployment).
Furthermore, we asked the analysts to express different policies that would have been
reasonable over such data and that captured some of their knowledge of the domain. Then,
we asked them to rate the quality of query answers when policies are used and when the
queries are evaluated on the original database without applying any policy. Speci?cally,
users gave scores as integer numbers between 0 and 10, depending on their subjective
evaluation of the quality of the results. The average score for each query is reported
in Figure 6.12 and shows that end-users experience a substantial bene?t when they can
express how missing values should be replaced according to their needs and knowledge
of the data. The higher quality of query answers when PIPs are used is generally due to
232
2
3
4
5
6
7
8
9
10
S
c
o
r
e
Query answer quality
0
1
2
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Query
Score using PIPs Score without using PIPs
Figure 6.12: Query answer quality
the fact that more informative query answers are obtained after applying policies, that is,
“more complete” tuples where null values have been ?lled according to the assumptions
made by the user (and expressed in the policy) are returned to the user.
6.6.5 Relational Algebra operators and PIPs
We now compare the two approaches when they are combined with relational alge-
bra operators. It is worth noting that applying a policy to the result of a relational algebra
operator requires building index structures for the result database. We also wanted to
experimentally see if there are substantial differences in execution times when a policy
is applied before or after a relational algebra operator, under conditions which guarantee
the same result (see Section 6.5), since this might be exploited for query optimization
purposes.
233
600
800
1000
1200
1400
1600
1800
2000
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Projection
Index - Policy first
Index - Policy last
Naive - Policy first
Naive - Policy last
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB size
Projection
Index - Policy first
Index - Policy last
Naive - Policy first
Naive - Policy last
Figure 6.13: Running times of projection
We report on experiments for projection, join, and union as these are the three basic
operators for which equivalence theorems exist. All experiments in this section were
carried out on databases with 1% degree of incompleteness
5
.
Projection. Figure 6.13 shows that applying the policy before or after projection
makes little difference in running times when the indexes are adopted, whereas applying
a policy after projection is more ef?cient when the naive approach is used. The index
approach is slightly faster than the naive approach applied after projection, whereas it is
much faster than the naive approach applied before projection.
Join. Figure 6.14 shows that applying a PIP after a join is more expensive than the
other way around because the policy is applied to a much bigger relation. This difference
is more evident for the naive approach. The fastest solution is applying a policy before
join using the index structures.
5
We performed the same experiments with 3% and 5% degrees of incompleteness and got the same
trends as the ones reported here with just higher execution times due to the higher amount of incomplete-
ness.
234
1500
2000
2500
3000
3500
4000
4500
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Join
Index - Policy first
Index - Policy last
Naive - Policy first
Naive - Policy last
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 200,000 400,000 600,000 800,000 1,000,000
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB Size
Join
Index - Policy first
Index - Policy last
Naive - Policy first
Naive - Policy last
Figure 6.14: Running times of join
Union. Figure 6.15 shows that the index-based approach is faster than the naive
approach regardless of the order in which policy and union are applied. Applying a pol-
icy before union gives better performance for the naive approach, whereas there is no
signi?cant difference for the index based approach.
To sumup, the index-based approach is faster than the naive one for all the relational
algebra operators considered above. The gap between the two approaches gets bigger as
the database size increases and thus, as the trends of the execution time curves show, it is
expected to get even bigger with larger datasets.
6.7 Concluding Remarks
In all the works dealing with the management of incomplete databases, the DBMS
dictates how incomplete information should be handled. End-users have no say in the
matter. However, the stock analyst knows stocks, the market, and his own management’s
or client’s attitude toward risk better than a DB developer who has never seen the stock
235
3000
4000
5000
6000
7000
8000
9000
10000
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
Union
Index - Policy first
Index - Policy last
Naive - Policy first
Naive - Policy last
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 250,000 500,000 750,000 1,000,000
R
u
n
n
i
n
g

T
i
m
e

(
s
e
c
)
DB Size
Union
Index - Policy first
Index - Policy last
Naive - Policy first
Naive - Policy last
Figure 6.15: Running times of union
DB. He should make decisions on what to do with partial information, not the person who
built the DBMS without knowing what applications would be deployed on it.
In this chapter, we propose the concept of a partial information policy (PIP). Using
PIPs, end-users can specify the policy they want to use to handle partial information. We
have presented examples of three families of PIPs that end-users can apply. We have also
presented index structures for ef?ciently applying PIPs and conducted an experimental
study showing that the adoption of such index structures allows us to ef?ciently manage
very large datasets. Moreover, we have shown that PIPs can be combined with relational
algebra operators, giving even more capabilities to users on how to manage their incom-
plete data.
236
Chapter 7
Query Answering under Uncertain Schema
Mappings
The work described in this chapter appears in [GMSS09].
7.1 Introduction and Motivating Example
This chapter focuses on the problem of aggregate query processing across multiple
databases in the presence of probabilistic schema mappings. The system may contain a
number of data sources and a mediated schema, as in [DHY07]. Alternatively, a peer
database system, with multiple data sources (e.g., DB-life like information) and no medi-
ated schema, as in [AKK
+
03, HIM
+
04] may also be in place.
There are many cases where a precise schema mapping may not be available. For
instance, a comparison search “bot” that tracks comparative prices from different web
sites has - in real time - to determine which ?elds at a particular location correspond
to which ?elds in a database at another URL. Likewise, as in the case of [DHY07], in
many cases, users querying two databases belonging to different organizations may not
know what is the right schema mapping. We model this uncertainty about which schema
237
ID price agentPhone postedDate reducedDate
1 100k 215 1/5/2008 1/30/2008
2 150k 342 1/30/2008 2/15/2008
3 200k 215 1/1/2008 1/10/2008
4 100k 337 1/2/2008 2/1/2008
Table 7.1: An instance D
S1
mapping is correct by using probability theory. This robust model allows us to provide,
in the case of aggregate queries, not only a ranking of the results, but also the expected
value of the aggregate query outcome and the distribution of possible aggregate values.
We focus on ?ve types of aggregate queries: COUNT, MIN, MAX, SUM, and AVG.
Given a mediated schema, a query Q, and a data source S, Q is reformulated according
to the (probabilistic) schema mapping between S’s schema and the mediated schema, and
posed to S, retrieving the answers according to the appropriate semantics (to be discussed
shortly).
We focus on ef?cient processing of aggregate queries. An orthogonal challenge in
this setting involves record linkage and cleansing that relates to duplicates. We assume
the presence of effective tools for solving this problem [GIKS03, IKBS08] and focus on
correct and ef?cient processing of the data. Also, we focus on the analysis of aggregate
queries over a single table, to avoid mixing issues with joins over uncertain schema map-
pings. Our analysis tests the effect of executing an aggregate query over a single table or
a table that is the result of any SPJ query over the non probabilistic part of the schema.
We de?ne schema mappings between a source schema S and a target T in terms of
attribute correspondences of the form c
ij
= (s
i
, t
j
), where s
i
in S is the source attribute
and t
i
in T is the target attribute. For illustration purposes, we shall use the following
two examples throughout the chapter:
238
Example 75. Consider a real-estate data source S1, which describes properties for sale,
their list price, an agent’s contact phone, and the posting date. If the price of a property
was reduced, then the date on which the most recent reduction occurred is also posted.
The mediated schema T1 describes property list price, contact phone number, date of
posting, and comments:
S1 = (ID, price, agentPhone, postedDate, reducedDate)
T1 = (propertyID, listPrice, phone, date, comments)
For the sake of simplicity, we assume that the mapping of ID to propertyID, price to
listedPrice, and agentPhone to phone is known. In addition, there is no mapping to
comments. Due to lack of background information, it is not clear whether date should
be mapped to postedDate (denoted as mapping m
11
) or reducedDate (denoted map-
ping m
12
). Because of the uncertainty regarding which mapping is correct, we consider
both mappings when answering queries. We can assign a probability to each such map-
ping (e.g., m
11
has probability 0.6 and m
12
has probability 0.4). Such a probability may
be computed automatically by algorithms to identify the correct mapping [CSD
+
08]. Ta-
ble 7.1 shows an instance of a table D
S1
of data source S1.
Suppose that on February 20, 2008 the system receives a query Q1, composed on schema
T1, asking for the number of “old” properties, those listed for more than a month:
Q1: SELECT COUNT(
*
) FROM T1
WHERE date < ’2008-1-20’
Using mapping m
11
, we can reformulate Q1 into the following query:
Q11: SELECT COUNT(
*
) FROM S1
WHERE postedDate < ’2008-1-20’
239
transactionID auctionID time bid currentPrice
3401 34 0.43 195 195
3402 34 2.75 200 197.5
3403 34 2.8 331.94 202.5
3404 34 2.85 349.99 336.94
3801 38 1.16 330.01 300
3802 38 2.67 429.95 335.01
3803 38 2.68 439.95 336.30
3804 38 2.82 340.5 438.05
Table 7.2: An instance D
S2
Example 76. As another example, consider eBay auctions. These auctions have a strict
end date for each auction and use a second-price model. That is, the winner is the one
who places the highest bid, but the winning price is (a delta higher than) the second-
highest bid. Now consider two (simpli?ed) database schemas, S2 and T2, that keep track
of auction prices:
S2 = (transactionID, auction, time, bid, currentPrice)
T2 = (transaction, auctionId, timeUpdate, price)
For simplicity, we again assume that the mappings of transactionIDto transaction, auc-
tion to auctionIDand the mapping of time to timeUpdate are known. The attribute price
in T2 can be mapped to either the bid attribute (denoted as mapping m
21
) or the current-
Price attribute (denoted as mapping m
22
) in S2. Here, the source of uncertainty may be
attributed to the sometimes confusing semantics of the bid and the current price in eBay
auctions. Assume that m
21
is assigned probability 0.3 and m
22
is assigned probability
0.7. Table 7.2 contains data for two auctions (numbers 34 and 38) with four bids each.
The time is measured from the beginning of the auction and therefore 0.43 means that
about 10 hours (less than half a day) have passed from the opening of the auction. Sup-
240
pose that the system receives a query Q2 w.r.t. schema T2, asking for the average closing
price of all auctions:
Q2: SELECT AVG(R1.price) FROM
(SELECT MAX(DISTINCT R2.price)
FROM T2 AS R2
GROUP BY R2.auctionID) AS R1
The subquery, within the FROM clause, identi?es the maximum price for each auction.
Using mapping m
21
, we can reformulate Q2 to be:
Q21: SELECT AVG(R1.currentPrice) FROM
(SELECT MAX(DISTINCT R2.currentPrice)
FROM T2 AS R2
GROUP BY R2.auction) AS R1
As mentioned in Section 2.4, two different semantics have been proposed for deal-
ing with query answering using probabilistic schema matchings [DHY07, DSDH08]: a
“by-table” semantics and a “by-tuple” semantics. We analyze aggregates COUNT, MIN,
MAX, SUM, and AVG and de?ne three semantics for such aggregate functions that we com-
bine with by-table and by-tuple semantics. In the ?rst one, an aggregate query returns a
set of possible values for the answer, together with a probability distribution over that set.
We call this the “distribution” semantics. A second method returns just a range specifying
the lowest and highest possible values for the aggregate query. We call this the “range”
semantics. The third semantics returns an expected value. In this work, we ?rst propose
these three semantics for aggregate computations and then show that they combine with
the by-table and by-tuple semantics of [DHY07] in six possible ways, yielding six pos-
sible semantics for aggregates in probabilistic schema mapping. We develop algorithms
241
to compute under each of the six semantics and show that the algorithms are correct. We
develop a characterization of the computational complexity of the problem of computing
these six semantics. For all the above aggregate operators, we show that semantics based
on the by-table semantics are PTIME computable. For the COUNT operator, we show
that query results for all six semantics can be computed in PTIME. Computing the SUM
operator is in PTIME for all but the by-tuple/distribution semantics. Finally, we show that
for MIN, MAX, and AVG, the only by-tuple semantics that can be ef?ciently computed is
the range semantics.
We have developed a prototype implementation of our algorithms and tested out
their ef?ciency on large data sets, showing that our algorithms work very ef?ciently in
practice. Our experiments show the computational feasibility of the different semantics
for each of the aggregate operators mentioned above. We show that, for each aggregate
operator considered in this work under the by-tuple semantics, the algorithms for com-
puting the range semantics are very ef?cient and scalable; this is also the case for COUNT
under the other two semantics. Furthermore, the expected value semantics for SUM is also
very ef?cient since we can take advantage of the fact that it is guaranteed to be equivalent
to the by-table semantics, as we show in this work. In summary, for each aggregate oper-
ator, there is at least one semantics where our experiments show that it can be computed
very ef?ciently.
To summarize, our contributions are as follows:
1. We show six possible semantics for aggregate queries with uncertain schema map-
pings.
2. We showseveral cases under the by-tuple semantics where ef?cient algorithms exist
for aggregate computation.
242
3. We prove that for the SUM aggregate operator, by-tuple/expected value and by-
table/expected value semantics yield the same answer.
4. Using a thorough empirical setup, we show that the polynomial time algorithms are
scalable up to several million tuples (with some even beyond 30 million tuples) and
with a large number of mappings.
The rest of the chapter is organized as follows. Section 7.2 provides background
on aggregate query answering under uncertain schema mapping. The six semantics for
aggregate query processing in the presence of uncertain schema mappings is described in
detail in Section 7.3. Section 7.4 provides a set of algorithms for ef?cient computation of
the various aggregates. Our empirical analysis is provided in Section 7.5. We conclude
directions for future work presented in Section 7.6 and ?nal remarks in Section 7.7.
7.2 Preliminaries
We base our model of probabilistic schema mappings on the one presented in
[DHY07], extending it to answer aggregate queries. In what follows, given relational
schemas S and T, S a relation in S, and T a relation in T, an attribute correspondence is
a one-to-one mapping from the attribute names in S to the attribute names in T. Also, a
one-to-one relation mapping is a mapping where each source and target attribute occurs
in at most one correspondence.
De?nition 65 (Schema Mapping). Let S and T be relational schemas. A relation mapping
M is a triple (S, T, m), where S is a relation in S, T is a relation in T, and m is a set of
attribute correspondences between S and T.
A schema mapping M is a set of one-to-one relation mappings between relations in S and
in T, where every relation in either S or T appears at most once.
243
The following de?nition, also from [DHY07], extends the concept of schema map-
ping with probabilities:
De?nition 66 (Probabilistic Mapping). Let S and T be relational schemas. A probabilis-
tic mapping (p-mapping) pM is a triple (S, T, m), where S ? S, T ? T, and m is a set
¦(m
1
, Pr(m
1
)), ..., (m
l
, Pr(m
l
))¦, such that
• for i ? [1, l], m
i
is a one-to-one relation mapping between S and T, and for every
i, j ? [1, l], i ,= j ?m
i
,= m
j
.
• Pr(m
i
) ? [0, 1] and

l
i=1
Pr(m
i
) = 1.
A schema p-mapping pM is a set of p-mappings between relations in S and in T, where
every relation in either S or T appears in at most one p-mapping.
7.3 Semantics
We now present the semantics of aggregate queries in the presence of probabilis-
tic schema mappings. We start with a formal presentation of the by-table and by-tuple
semantics, as introduced in [DHY07] (Section 7.3.1). Then, we move on to introduce
three aggregate semantics and their combination with the by-table and by-tuple semantics
(Section 7.3.1).
7.3.1 Semantics of Probabilistic Mappings
The intuitive interpretation of a probabilistic schema mapping is that there is uncer-
tainty about which of the mappings is the right one. Such uncertainty may be rooted in the
fact that “the syntactic representation of schemas and data do not completely convey the
semantics of different databases,” [MHH00] i.e., the description of a concept in a schema
244
can be semantically misleading. As proposed in [DHY07], there are two ways in which
this uncertainty can be interpreted: either a single mapping should be applied to the entire
set of tuples in the source relation, or a choice of a mapping should be made for each of
these tuples. The former is referred to as the by-table semantics, and the latter as the by-
tuple semantics. The by-tuple semantics represents a situation in which data is gathered
from multiple sources, each with a potentially different interpretation of a schema.
As discussed in [DHY07], the high complexity of query answering under the by-
tuple semantics is due to the fact that all possible sequences of mappings (of length equal
to the number of tuples in the table) must be considered in the general case. The following
examples illustrate the difference between the two semantics when considering aggregate
functions.
Example 77. Consider the scenario presented in Example 75. Assume the content of
table D
S1
is as shown in Table 7.1. Using the two possible mappings, we can reformulate
Q1 into the following two queries, one for each possible way of mapping attribute date:
Q11: SELECT COUNT(
*
) FROM S1
WHERE postedDate
1. We must now prove that the statement holds for [T[ = k + 1. Let T
t
be equal to T
without its last tuple; since [T
t
[ = k, the algorithm correctly computes a probability
distribution pd for the answer to the query.
Now, let occProb and notOccProb be the values calculated by the algorithm
during its last iteration of the for loop in line 3, i.e. for the last tuple in T. Since pd is
correct, for any 0 ? i ? k the value pd(i) is equal to the sum of the probabilities of all
mapping sequences under which the result is i, i.e. pd(i) = Pr(s
i
1
) +... +Pr(s
i
y
), where
s
i
1
, ..., s
i
y
are all the sequences that yield answer i. Now, after updating pd in line 8 of the
algorithm, we get pd(i) = (Pr(s
i
1
) + ... + Pr(s
i
y
)) ? notOccProb + (Pr(s
i?1
1
) + ... +
Pr(s
i?1
y
)) ? occProb. If we distribute the multiplication signs with respect to the sums,
we get Pr(s
i
1
) ? notOccProb +... +Pr(s
i
y
) ? notOccProb +Pr(s
i?1
1
) ? occProb +... +
Pr(s
i?1
y
) ? occProb. Since notOccProb is the sum of the probabilities of the mappings
under which the last tuple is not part of the count, each term Pr(s
i
j
) ? notOccProb repre-
255
sents the sum of the probabilities of the sequences starting with s
i
j
under which the answer
is i, and analogously for the terms multiplied by occProb. Since these are all the possi-
ble sequences that can yield a value of i, we conclude that the probability is computed
correctly.
Since the argument above was built on an arbitrarily chosen value 0 ? i ? k,
we conclude that the probability distribution is correct for all such values. Finally, for
pd(k+1) the reasoning is analogous, except that the summation multiplied by notOccProb
is zero because it can never yield a value of k + 1.
In Section 7.3.1, the probability distribution for this example was computed by
looking at the answer of each possible sequence of mappings assigned to individual tuples.
If we have m mappings and n tuples, then the number of sequences is m
n
. The algorithm
presented here is polynomial in the number of mappings and tuples, and the number of
computations is in O(m? n
2
).
Aggregate functions SUM and AVG
We now present ef?cient (PTIME) algorithms to compute the SUM aggregate un-
der the by-tuple/range and by-tuple/expected value semantics. Computing this aggregate
function under the distribution semantics does not scale, simply because the number of
newly generated values may be exponential in the size of the original table, as was demon-
strated at the beginning of Section 7.4.2.
SUM Under the Range Semantics. For the range semantics, we must compute the tight-
est interval in which the aggregate lies. The algorithm is presented in Figure 7.4 and
illustrated next.
Consider Example 76, but now suppose we are interested in a simple computation
of the sum of the prices for transactions whose auctionIDis 34; we then use the following
query:
256
Algorithm ByTupleRangeSUM
Input: Table S, T; MapList M; Attribute A; Condition C;
1 Let low = 0, up = 0;
2 For each t
i
? S,
3 Let v
min
i
be the minimum value obtained by applying a mapping in M
that satis?es condition C;
4 Similarly, let v
max
i
be the maximum value that satis?es condition C;
5 low = low + v
min
i
;
6 up = up + v
max
i
;
7 return [low, up];
Figure 7.4: Algorithm to answer SELECT SUM(A) FROM T WHERE C under Range
Semantics
tupleID v
min
i
v
max
i
low up
0 0
1 195 195 195 195
2 197.5 200 392.5 395
3 336.3 439.95 728.8 834.95
4 340.5 438.05 1069.3 1273
Table 7.6: Trace of ByTupleRANGE for query Q2’
Q2’: SELECT SUM(price) FROM T2
WHERE auctionID = ’34’
Table 7.6 shows the trace of the algorithm in Figure 7.4 to answer query Q2’. If
we look, for instance, at the second row in the table, processing tuple 2 from Table 7.2,
v
min
2
= 197.5 and v
max
2
= 200, thus low = 392.5 and up = 395. The answer to Q2’ is
thus [1069.3, 1273]. This algorithm is polynomial in the number of mappings and tuples,
and the number of computations is in O(m? n), where m is the number of mappings and
n is the number of tuples.
Theorem 23. AlgorithmByTupleRangeSUM correctly computes the result of executing
a SUM query under the by-tuple/range semantics.
257
Proof. Suppose, towards a contradiction, that the algorithm returns the range [, u] and
that there exists a possible answer k such that either k < or u < k. Since , u, and k
are all sums of values from tuples in T, if k is in fact a possible value for the SUM query,
this means that there is at least one tuple in T such that, for some mapping, the value of A
is less (respectively, greater) than v
min
i
(respectively, v
max
i
). This is a contradiction since
the algorithm chooses this value to be the minimum (respectively, maximum) under all
possible mappings.
AVG under the Range Semantics. For the AVG aggregate operator, the algorithm we de-
veloped is very similar to the one in Figure 7.4, keeping a counter of the number of
participating tuples for both the lower bound and the upper bound. The counter for the
upper bound is incremented by one at each step only if there exists a maximum value for
the tuple that satis?es the condition when some mapping is applied. The counter for the
lower bound is incremented only if there is a minimum value for the tuple that satis?es
the condition under some mapping. The answer is given by dividing each bound for SUM
by the corresponding counter.
Theorem 24. AlgorithmByTupleRangeAVG correctly computes the result of executing
an AVG query under the by-tuple/range semantics.
Proof. Since the ByTupleRangeAVGalgorithmis a trivial variation of the ByTupleRange-
SUM algorithm to count the number of tuples satisfying the condition, this result is a
direct consequence of Theorem 23.
SUM Under the Expected Value Semantics. We now address an ef?cient way of computing
by-tuple/expected value semantics. We do so not by giving an algorithm, but rather by
showing that an answer to a SUM query using the by-tuple/expected value semantics is
258
Sequence SUM p SUMp
(m
21
, m
21
, m
21
, m
21
) 1076.93 0.0081 8.723133
(m
21
, m
21
, m
21
, m
22
) 1063.88 0.0189 20.107332
(m
21
, m
21
, m
22
, m
21
) 947.49 0.0189 17.907561
(m
21
, m
21
, m
22
, m
22
) 934.44 0.0441 41.208804
(m
21
, m
22
, m
21
, m
21
) 1074.43 0.0189 20.306727
(m
21
, m
22
, m
21
, m
22
) 1061.38 0.0441 46.806858
(m
21
, m
22
, m
22
, m
21
) 944.99 0.0441 41.674059
(m
21
, m
22
, m
22
, m
22
) 931.94 0.1029 95.896626
(m
22
, m
21
, m
21
, m
21
) 1076.93 0.0189 20.353977
(m
22
, m
21
, m
21
, m
22
) 1063.88 0.0441 46.917108
(m
22
, m
21
, m
22
, m
21
) 947.49 0.0441 41.784309
(m
22
, m
21
, m
22
, m
22
) 934.44 0.1029 96.153876
(m
22
, m
22
, m
21
, m
21
) 1074.43 0.0441 47.382363
(m
22
, m
22
, m
21
, m
22
) 1061.38 0.1029 109.216002
(m
22
, m
22
, m
22
, m
21
) 944.99 0.1029 97.239471
(m
22
, m
22
, m
22
, m
22
) 931.94 0.2401 223.758794
Expected value 975.437
Table 7.7: Computing Q2
t
under the by-tuple/expected value semantics
equivalent to its by-table counterpart. Before introducing this equivalence formally, we
start with an illustrating example:
Example 79. Consider query Q2’. Using the by-table/expected value semantics, we
consider two possible cases. Using m
21
we map price to bid, with a query outcome of
195 + 200 + 331.94 + 349.99 = 1076.93 and a probability of 0.3. Using m
22
we map
price to currentPrice, with a query outcome of 195 + 197.5 + 202.5 + 336.94 = 931.94
and a probability of 0.7. Therefore, the answer to Q2’, under the by-table/expected value
semantics would be 1076.93 ? 0.3 + 931.94 ? 0.7 = 975.437.
259
Table 7.7 presents the 16 different sequences and for each sequence it computes the
query output, its probability, and the product of the two (which is a term in the summation
de?ning expected value). The outcome of Q2’ using the by-tuple/expected value semantics
is identical to that of the by-table/expected value semantics. To see why, let us trace a
single value, 434.99. This value appears in the fourth tuple and is used in the computation
whenever a sequence contains mapping m
21
for the fourth tuple, which is every other row
in Table 7.7. Summing up the probabilities of all such worlds yields a probability of 0.3,
which is exactly the probability of using m
21
in the by-table semantics. The reason for
this phenomenon is because the association of a mapping to one tuple is independent of
the association with another tuple.
Example 79 explains the intuition underlying Theorem 25 below. It is worth noting
that this solution does not extend to the AVG aggregate because it is a non-monotonic
aggregate.
Theorem 25. Let pM = (S, T, m) be a schema p-mapping and let Q be a SUM query
over attribute A ? S. The expected value of Q
tuple
(D
T
), a by-tuple answer to Q with
respect to pM, is identical to Q
table
(D
T
), a by-table answer to Q with respect to pM.
In order to prove this theorem, We will ?rst prove a series of properties that are
necessary to do so. The following notation will be used.
Notation. Let Pr(m) be the probability associated with mapping m. We order the [m[
mappings and name them m
(1)
, m
(2)
, ..., m
([m[)
. We denote by A
t
i(k)
the value of the map-
ping of A of the i-th tuple using m
(k)
. seq
i(k)
_
¯
pM
_
is the set of all sequences in which
the i-th tuple uses the m
(k)
mapping.
Lemma 1.

[m[
k=1
Pr(m
(k)
) = 1
Lemma 2.

seq?seq(
¯
pM)
Pr (seq) = 1
260
Proof. By induction on the number of mappings on m.
Base: [m[ = 1. There is only one sequences,
¸
¸
seq
_
¯
pM
_¸
¸
= 1 and Pr(m
(j)
) = 1 from
Lemma 1.

seq?seq(
¯
pM)
Pr (seq) =
[D
T
[

j=1
Pr(m
(j)
) =
[D
T
[

j=1
1 = 1
Step: Suppose that the induction assumption holds for [m[ < q. For [m[ = q we partition
the summation into [m[ summations, each with all the sequences that share a common
mapping for the ?rst tuple:

seq?seq(
¯
pM)
Pr (seq)
= Pr(m
(1)
)

seq?seq
1(1)
(
¯
pM)
[D
T
[

j=2
Pr(m
j
)
+ Pr(m
(2)
)

seq?seq
1(2)
(
¯
pM)
[D
T
[

j=2
Pr(m
j
)
+ ...
+ Pr(m
([m[)
)

seq?seq
1(|m|)
(
¯
pM)
[D
T
[

j=2
Pr(m
j
)
= Pr(m
(1)
) + Pr(m
(2)
) + ... + Pr(m
([m[)
) = 1
based on the induction assumption and Lemma 1.
Lemma 3.

seq?seq
i(k)
(
¯
pM)
i?1

j=1
Pr(m
j
)
[D
T
[

j=i+1
Pr(m
j
) = 1
Proof. By induction on the number of tuples in T.
261
Base: [D
t
[ = 2. In this case, the number of sequences in which one tuple keeps the
same mapping is [m[. Therefore,

seq?seq
i(k)
(
¯
pM)
i?1

j=1
Pr(m
j
)
[D
T
[

j=i+1
Pr(m
j
)
= Pr(m
(1)
) + Pr(m
(2)
) + ... + Pr(m
([m[)
) = 1
from Lemma 1.
Step: Suppose that the induction assumption holds for [D
t
[ < q. For [D
t
[ = q we
choose a tuple different from the i-th tuple. Without loss of generality assume that we
choose the ?rst tuple. We partition the summation into [m[ summations, each with all the
sequences that share a common mapping for the ?rst tuple:

seq?seq
i(k)
(
¯
pM)
i?1

j=1
Pr(m
j
)
[D
T
[

j=i+1
Pr(m
j
)
= Pr(m
(1)
)

seq?seq
i(k)?1(1)
(
¯
pM)
i?1

j=2
Pr(m
j
)
[D
T
[

j=i+1
Pr(m
j
)
+ Pr(m
(2)
)

seq?seq
i(k)?1(2)
(
¯
pM)
i?1

j=2
Pr(m
j
)
[D
T
[

j=i+1
Pr(m
j
)
+ ...
+ Pr(m
([m[)
)

seq?seq
i(k)?1(|m|)
(
¯
pM)
i?1

j=2
Pr(m
j
)
[D
T
[

j=i+1
Pr(m
j
)
= Pr(m
(1)
) + Pr(m
(2)
) + ... + Pr(m
([m[)
) = 1
based on the induction assumption and Lemma 1.
262
Theorem 26. Let
¯
pM = (S, T, m) be a schema p-mapping and let Q be a sum query
over attribute A ? S. The expected value of Q
tuple
(D
S
? D
T
), a by-tuple answer to Q
with respect to
¯
pM, is identical to Q
table
(D
S
? D
T
), a by-table answer to Q with respect
to
¯
pM.
Proof. Let
¯
pM = (S, T, m) be a p-mapping. Let Q be a sum query over attribute A ? S
and let D
S
be an instance of S. First, let D
T
be a by-table consistent instance of T.
Therefore, there exists a mapping m ? m such that D
S
and D
T
satisfy m.
Given a mapping m, for which A ? S is mapped to A
t
? T, the outcome of Q is:
[D
S
[

i=1
A
i
+
[D
T
[

i=1
A
t
i
(7.3)
The expected value of a by-tuple answer to Q with respect to
¯
pM is:
[m[

j=1
(Pr(m
(j)
) (
[D
S
[

i=1
A
i
+
[D
T
[

i=1
A
t
i(j)
))
=
[m[

j=1
(Pr(m
(j)
)
[D
S
[

i=1
A
i
) +
[m[

j=1
(Pr(m
(j)
)
[D
T
[

i=1
A
t
i(j)
)
=
[D
S
[

i=1
A
i

[m[

j=1
Pr(m
(j)
) +
[m[

j=1
(Pr(m
(j)
)
[D
T
[

i=1
A
t
i(j)
)
=
[D
S
[

i=1
A
i
+
[m[

j=1
(Pr(m
(j)
)
[D
T
[

i=1
A
t
i(j)
) (7.4)
The justi?cation for the move from the third to the fourth line is from Lemma 1.
Let’s consider now a mapping sequence seq =
¸
m
1
, m
2
, . . . , m
[D
T
[
_
. m
i
can be one
of [m[ values. We can say, for example that m
i
= m
(j)
which means that the mapping of
the i-th tuple is using m
(j)
interpretation.
263
The associated probability sequence
¸
Pr
1
, Pr
2
, . . . , Pr
[D
T
[
_
assigns probability Pr
i
?
¦Pr(m
1
), Pr(m
2
), . . . , Pr(m
[m[
)¦ to each tuple in D
T
. Due to the independent assign-
ment of interpretations to tuples, Pr (seq) =
[D
T
[

i=1
Pr(m
i
). seq
_
¯
pM
_
is the set of all
m
[D
T
[
sequences that can be generated from
¯
pM. Given a sequence seq
j
? seq
_
¯
pM
_
,
we denote by m
ij
the i-th element of the j-th sequence.
The expected value of a by-tuple answer to Q with respect to
¯
pM is:

seq?seq(
¯
pM)
(Pr (seq)
[D
S
[

i=1
A
i
) +

seq?seq(
¯
pM)
Pr (seq)
[D
T
[

i=1
A
t
i
=
[D
S
[

i=1
A
i

seq?seq(
¯
pM)
Pr (seq) +
[seq(
¯
pM)[

k=1
[D
T
[

i=1
Pr(m
ij
)
[D
T
[

i=1
A
t
i
=
[D
S
[

i=1
A
i
+
[seq(
¯
pM)[

k=1
[D
T
[

i=1
Pr(m
ij
)
[D
T
[

i=1
A
t
i
(7.5)
For a given i and j, let’s have a look now at all the sequences in which A
t
i(j)
appear.
From the construction of the sequence set, there are exactly
1
[m[
such sequences and they
all share in common that m
ik
= m
(j)
, since A
t
i(j)
uses mapping m
(j)
. This part of the
summation can be written to be
Pr(m
(j)
) A
t
i(j)

seq?seq
i(j)
(
¯
pM)
i?1

k=1
Pr(m
ik
)
[D
T
[

k=i+1
Pr(m
ik
)
= Pr(m
(j)
) A
t
i(j)
264
We repeat this computation for all A
t
i(j)
and Eq 7.5 can now be rewritten to be
[D
T
[

i=1
[m[

j=1
A
t
i(j)
Pr(m
(j)
)
=
[m[

j=1
(Pr(m
(j)
)
[D
T
[

i=1
A
t
i(j)
)
Adding the sum of the A attribute in the S table one has that the expected value of a
by-tuple answer to Q with respect to
¯
pM is:
=
[D
S
[

i=1
A
i
+
[m[

j=1
(Pr(m
(j)
)
[D
T
[

i=1
A
t
i(j)
)
Corollary 7. The expected value of a by-tuple answer to a query Q with respect to a
schema p-mapping pM is PTIME with respect to data complexity and mapping complex-
ity.
Aggregate functions MAX and MIN
We now present an ef?cient algorithm to compute the MAX aggregate under the
range semantics for the by-tuple semantics. The techniques presented here for MAX can
be easily adapted for answering queries involving the MIN aggregate.
MAX under the Range semantics. To compute MAX under the range semantics, we have to
?nd the minimum and the maximum value of the aggregate under any possible mapping
sequence, i.e., the tightest interval that includes all the possible maximum values that can
arise. The procedure to ?nd this interval without the need to look at all possible sequences
is outlined in Figure 7.5.
265
Algorithm ByTupleRangeMAX
Input: Table S, T; MapList M; Attribute A; Condition C;
1 For each t
i
? S,
2 let v
min
i
be the minimum value obtained by applying a mapping in M
that satis?es condition C;
Similarly, let v
max
i
be the maximum value.
3 return [max
i
¦v
min
i
¦, max
i
¦v
max
i
¦];
Figure 7.5: Algorithm to answer SELECT MAX(A) FROM T under Range Semantics
To see how this algorithm works, consider Example 76. We answer the subquery
within the FROM clause of query Q2:
SELECT MAX(DISTINCT T2.price) FROM T2 AS R2 GROUP BY R2.auctionID
This subquery contains a GROUP BY auctionID, which means we will have one
answer for each distinct auctionID. In this case, looking at Table 7.2 we see that the
answer will consist of two different ranges, one for auctionID = 34 and another for
auctionID = 38. We show how to compute the answer for auctionID = 38; The process
to obtain the answer for auctionID = 34 is analogous. For tuple 5, with transactionID
= 3801, the minimum value obtained by applying a mapping is v
min
5
= 300, while the
maximum is v
max
5
= 330.01. For tuple 6, v
min
6
= 335.01 and v
max
6
= 429.95; for tuple 7,
v
min
7
= 336.3 and v
max
7
= 439.95. Finally, for tuple 8, v
min
8
= 340.05 and v
max
8
= 438.05.
Now, the range for the aggregator is given by [max
i
¦v
min
i
¦, max
i
¦v
max
i
¦]. Where each
bound is computed as:
max
i
¦v
min
i
¦ = max¦300, 335.01, 336.3, 340.05¦
max
i
¦v
max
i
¦ = max¦330.01, 429.95, 439.95, 438.05¦
and thus, the ?nal answer is [340.05, 439.95]. In general, it is always the case that the
range yielded by the by-table semantics is a subset of the range yielded by the by-tuple
266
semantics. This is because by-tuple has the possibility of choosing a different mapping
for each tuple, which means that the algorithm has the freedom to choose sequences that
are not allowed using the by-table semantics. This is true for all aggregate functions con-
sidered in this work. This algorithm also requires a polynomial number of computations
in O(m? n), where m is the number of mappings and n is the number of tuples.
Theorem 27. AlgorithmByTupleRangeMAX correctly computes the result of executing
an MAX query under the by-tuple/range semantics.
Proof. Suppose, towards a contradiction, that the algorithm returns the range [, u] and
that there exists a possible answer k such that either k < or u < k. If k is in fact a
possible value for the MAX query, this means that there is at least one tuple in T such that,
for some mapping, the value of A is less (respectively, greater) than v
min
i
(respectively,
v
max
i
), which is a contradiction since the algorithm chooses this value to be the minimum
(respectively, maximum) under all possible mappings.
7.4.3 Summary of Complexity Results
The tables in Figure 7.6 are a summary of our results for the six different kinds
of semantics. The algorithms presented in this section correspond to those that require
polynomial time to compute the answer.
7.5 Experimental Results
In order to evaluate the difference in the running times of our algorithms (both
PTIME and non-PTIME), and how these are affected by changes in both the number of
tuples in the database and the number of probabilistic mappings present, we carried out
a series of empirical tests whose results we report in this section. The algorithms we
267
COUNT Range Distribution Expected Value
By-Table PTIME PTIME PTIME
By-Tuple PTIME PTIME PTIME
SUM Range Distribution Expected Value
By-Table PTIME PTIME PTIME
By-Tuple PTIME ? PTIME
MAX,MIN,AVG Range Probability Distribution Expected Value
By-Table PTIME PTIME PTIME
By-Tuple PTIME ? ?
Figure 7.6: Summary of complexity for the different aggregates
Figure 7.7: Running times for variation of #tuples using the eBay data; #attributes = 7,
#mappings = 2, results are averages over 5 runs on the eBay auction data. Solid line: By-
TuplePDMAX. Dotted line: ByTupleExpValAVG, ByTuplePDAVG, ByTuplePDSUM,
and ByTupleExpValMAX. Dashed line (touching the x axis): ByTupleRangeMAX,
ByTupleRangeCOUNT, ByTuplePDCOUNT, ByTupleExpValCOUNT, ByTupleRange-
SUM, ByTupleExpValSUM, and ByTupleRangeAVG.)
268
Figure 7.8: Running times for variation of #mappings; #attributes = 20, #tuples = 6,
results are averages over 5 runs on synthetic data. Solid line: ByTupleExpValAVG, By-
TuplePDAVG, ByTuplePDSUM, ByTupleExpValMAX, and ByTuplePDMAX. Dashed
line (touching the x axis): ByTupleRangeMAX, ByTupleRangeCOUNT, ByTuplePD-
COUNT, ByTupleExpValCOUNT, ByTupleRangeSUM, ByTupleExpValSUM, and By-
TupleRangeAVG.
gave for problems that were not shown to be PTIME are — as expected — inef?cient.
However, the algorithms we gave for problems we showed to be in PTIME are quite
ef?cient when we vary both the number of tuples and the numbers of mappings — but
clearly there are limits that vary from one algorithm to another. We will discuss these
limits below.
The programs to carry out these tests consist of about 3,300 lines of Java code. All
computations were carried out on a quad-processor computer with Intel Xeon 5140 dual
core CPUs at 2.33GHz each, 4GB of RAM, under the CentOS GNU/Linux OS (distribu-
tion 2.6.9-55.ELsmp). The database engine we used was PostgreSQL version 7.4.16.
Experimental Setup. We carried out two sets of experiments. The ?rst set used real-
world data of 1,129 eBay 3-day auctions with a total of 155,688 bids for Intel, IBM,
and Dell laptop computers. The data was obtained from an RSS feed for a search query
on eBay.
1
The database schema is the one presented in Example 76. The sole point of
1http://search.ebay.com/ws/search/
269
uncertainty lies in the two price attributes where a reference to Price could mean either
the bid price or the current price. We therefore de?ned two mappings: bid mapped to
Price with probability 0.3 and currentPrice mapped to Price with probability 0.7. We
have applied the inner query of query Q2 and also a set of queries that cover four different
operators discussed in this work (all except MIN).
The second set of experiments was done on synthetic, randomly generated data in
order to be able to evaluate con?gurations not possible with the eBay data (in particular,
larger numbers of attributes, tuples, and mappings). The tables consist of attributes of type
real, plus one column of type int used as id (not included in the number of attributes
reported in the results). Mappings were also randomly generated by selecting an attribute
at random and then a set of attributes that are mapped to it, also with a randomly chosen
probability distribution. Each experiment was repeated several times.
Results. We now present and analyze the experiment results for small, medium, and large
instances.
Small instances. We ran a set of experiments on small relations to compare the perfor-
mance of all possible semantics, including those for which there are no PTIME algo-
rithms. Figures 7.7 and 7.8 show the running times of all algorithms on small instances
(#mappings ?xed at 2 in the former, #tuples ?xed at 6 in the latter). The former corre-
sponds to runs using the eBay auction data (results shown on a scatterplot, since each
point corresponds to adding all tuples from an auction), while the latter reports results
from runs on synthetic data.
As we can see, running times climb exponentially for algorithms we did not show to
be in PTIME; the sharp increase in Figure 7.7 continues when more auctions are included,
with a completion time of more than 10 days for 4 auctions (36 tuples). On the other
hand, the running times of the other algorithms are negligible. When we varied #tuples,
270
Figure 7.9: Running times for variation of #tuples; #attributes = 50, #mappings = 20,
results are averages over 5 runs on synthetic data. Solid line: ByTupleRangeAVG, ByTu-
pleRangeSUM, ByTupleRangeCOUNT, and ByTupleRangeMAX. Dashed line: ByTu-
pleExpValSUM. Dotted line: ByTuplePDCOUNT and ByTupleExpValCOUNT.
the by-table algorithms running times lay between 0.07 and 0.13 seconds. When we
varied #mappings, the by-table algorithms took between 0.03 and 0.26 seconds. We also
ran experiments varying #tuples using synthetic data which yielded the same trends in
running times as those in Figure 7.7.
Medium-size instances. Figure 7.9 shows the running times of all our PTIME algorithms
when the number of tuples is increased into the tens and hundreds of thousands (#map-
pings ?xed at 20). As we can see, the ByTuplePDCOUNT and ByTupleExpValCOUNT
algorithms’ performance is well differentiated from the rest, as they become intractable
at about 50,000 tuples. This is due to the fact that these algorithms must update the prob-
ability distribution for the possible values in each iteration, leading to a running time in
O(m? n
2
) as shown in Section 7.4.2. In this case, the by-table algorithms’ running times
varied between 0.96 seconds and 5.49 seconds.
Figure 7.10 shows how the running times increase with the number of mappings
(#tuples ?xed at 50,000, #attributes = 500). Note that ByTupleExpValSUM is more
affected by the increase in number of mappings than the other four algorithms, with its
running time climbing to almost 90 seconds for 250 mappings. This is because it is a
271
Figure 7.10: Running times for variation of #mappings; #attributes = 500, #tuples =
50,000, results are averages over 2 runs on synthetic data. Solid line: ByTupleExpVal-
SUM. Dashed line: ByTupleRangeMAX, ByTupleRangeCOUNT, ByTupleRangeSUM,
and ByTupleRangeAVG.
by-table algorithm, and it must issue as many queries as mappings and then combine the
answers. The other four, on the other hand, only slightly increase their running times
at these numbers of mappings. The by-table algorithms’ running times in this case lie
between 16.49 and 86.49 seconds.
Large instances. Figure 7.11 shows how our most scalable by-tuple algorithms per-
form when the number of tuples is increased into the millions, showing that algorithms
ByTupleRangeMAX/COUNT/AVG/SUM take about 1,300 seconds (about 21 minutes)
to answer queries with 5 million tuples and 20 mappings. This ?gure also shows the run-
ning time of ByTupleExpValSUM, which is much lower than the others because it is
actually equivalent to the by-table algorithm, as seen in Section 7.4.2. The corresponding
running times for the by-table algorithms varied between 15.73 and 125.63 seconds. We
also ran experiments for 15 to 30 million tuples, the results of which are shown in Fig-
ure 7.12. For these runs, the by-table algorithms took between 65.17 seconds and 124.76
seconds.
272
Figure 7.11: Running times for variation of #tuples; #attributes = 50, #mappings = 20,
results are averages over 2 runs on synthetic data. Solid line: ByTupleRangeMAX and
ByTupleRangeAVG. Dashed line: ByTupleRangeSUM and ByTupleRangeCOUNT. Dot-
ted line: ByTupleExpValSUM.
Figure 7.12: Running times for variation of #tuples; #attributes = 20, #mappings = 5,
results are averages over 2 runs on synthetic data. Solid line: ByTupleRangeCOUNT.
Dotted line: ByTupleRangeSUM and ByTupleRangeAVG. Dashed line: ByTupleRange-
MAX. Dashed and Dotted line: ByTupleExpValSUM.
273
It should be noted that the greater scalability of the by-table algorithms with respect
to the ef?cient by-tuple algorithms presented here is in large part due to the fact that
the former are taking advantage of the optimizations implemented by the DBMS when
answering queries.
7.6 Schema Mappings, Integrity Constraints, and Partial
Information
In this chapter we have de?ned and analyzed different semantics for aggregate
query answering in the setting of data integration using probabilistic schema mappings.
An important assumption we have made across this chapter is that the data is both com-
plete and consistent, problems in data integration only arise fromthe uncertainty at schema
level. In real world data sets this is often not the case, and it is therefore a good idea to
investigate the problem of query answering under probabilistic schema mappings in the
presence of inconsistent and partial information. Other works in the literature have studied
the relationship between schema mappings and integrity constraints. The iMAP [DLD
+
04]
system, for instance, exploits integrity constraints in order to de?ne meaningful mappings
among disparate schemas and to prune the search space of the mapping generation pro-
cess. Some constraint may state, for instance, that two attributes are unrelated and there-
fore they should not appear together in a mapping. Other constraints such as functional
dependencies or denial constraints can be used as well; however, depending on the com-
plexity of the constraint it might be too expensive to check it in the data in order to avoid
considering possibilities that violate them. It is important to note that the use of integrity
constraints in generating mappings does not guarantee that the actual integrated data is
274
going to remain consistent, since posterior updates or insertions to any of the sources
tables may introduce inconsistency with respect to the set of integrity constraints.
Consider the source and target schemas for the running example in this chapter:
S1 = (ID, price, agentPhone, postedDate, reducedDate)
T1 = (propertyID, listPrice, phone, date, comments)
The following table represents the instance relation D
S1
.
ID price agentPhone postedDate reducedDate
1 100k 215 1/5/2008 2/5/2008
2 150k 342 1/30/2008 2/15/2008
3 200k 215 1/1/2008 1/10/2008
4 100k 337 1/2/2008 2/1/2008
Now suppose that in addition to D
S1
we also have an instance relation for T1, i.e.,
there exist data that corresponds to the target schema. Let D
T1
consist of the following
tuples.
propertyID listPrice phone date comments
1 100k 215 1/5/2008 c
1
2 150k 342 1/30/2008 c
2
As before, possible mappings between S1 and T1 are: m
11
where date maps to
postedDate and m
12
where date maps to reducedDate. In addition, suppose that we
have the integrity constraint fd : propertyID ?listPrice over T1.
A tool such as iMAP could rule out any of the two mappings if when applied to D
S1
and D
T1
the result violates fd. However, in these instance relations no con?ict can appear
w.r.t. fd and therefore both m
11
and m
12
are considered possible mappings.
275
Now, suppose that S1 is updated with some new information in the following way:
propertyID listPrice phone date comments
1 120k 215 1/5/2008 c
1
2 150k 342 1/30/2008 c
2
If we were to issue the query “Select propertyID, listPrice from T where
listPrice < 200K and 1/1/2008 < date < 1/30/2008”, depending on which semantics
(by-table or by-tuple), the result might violate fd.
The point of this example was to show that inconsistency can appear in answers to
queries as well as in the result of using schema mapping methods, even when each source
is consistent in isolation.
Alternatively, works such those from [GN06, FKMP05, FKP05] have analyzed data
exchange when tuple generating (TGDs) and equality generating dependencies (EDGs)
are considered. Note that because of the presence of TGDs, null values may appear and
therefore it is necessary to deal with them. The idea in those approaches is to focus on
a class of solutions, called universal solutions, possessing desirable properties that jus-
tify selecting them as the semantics of the data exchange problem. Universal solutions
have the property that homomorphisms can be de?ned between them and every possible
solution, and any pair of universal solutions is homomorphically equivalent. Universal
solutions are the most general among all solutions and they represent the entire space of
solutions, all universal solutions share a unique (up to isomorphism) common part, called
their core. The core of a structure is the smallest substructure that is also a homomor-
phic image of the structure. Computing the core is a hard problem, but when restricted
only to EDGs [FKMP05], or a combination of EGDs and weakly acyclic TGDs (or other
cases where the chase procedure is known to terminate) [GN06], it can be computed in
276
polynomial time in the data complexity. The problem of query answering in this setting
was studied in [FKP05]. In that work, universal solutions are used to compute the certain
answers of queries q that are unions of conjunctive queries over the target schema. The
set certain(q, I) of certain answers of a query q over the target schema, with respect to
a source instance I, consists of all tuples that are in the intersection of all q(J)’s, as J
varies over all solutions for I (q(J) denotes the result of evaluating query q on J). Given
I and J, certain(q, I) is the set of all “null-free” tuples in q(J). The set certain(q, I) is
computable in time polynomial in the cardinality of I if q is a union of conjunctive queries
over the target schema: ?rst compute a universal J solution in polynomial time and then
evaluate q(J) and remove tuples with nulls. However, if q is a union of conjunctive queries
with at most two inequalities per conjunct, then the problem is coNP-complete. Alterna-
tively, [FKP05] de?ned universal certain answers; ucertain(q, I) consists of all tuples
that are in the intersection of all q(J)’s, as J varies over all universal solutions for I. The
set ucertain(q, I) can be computed in polynomial time for existential queries whenever
the core of I can be computed in polynomial time. This approach was developed for and
implemented in the CLIO system [FHH
+
09].
The approach described above provides a ?xed semantics for inconsistent and in-
completeness resolution. As clearly stated in the de?nition of certain answers, the set of
tuples in the answer is the set of all tuples that are free of null values across all possible
(valid) ways of completing the database instance. Furthermore, if instance I is incon-
sistent, the chase procedure used to compute the universal solutions for I fails and no
universal solution is computed; therefore, the empty set is returned as the answer to the
query. Finally, probabilistic schema mappings are not explored in any of these works.
Schema mappings are represented as high level constraints using source-to-target tuple
generating dependencies (see Chapter 2.4). Alternatively, we propose a personalizable
277
approach by incorporating IMPs, described in Chapter 3, and PIPs, described in Chap-
ter 6, in order to handle inconsistency and partial information, respectively. We consider
two different options to solve this problem. The simplest one is to perform a post-query
process. IMPs and PIPs can both be speci?ed to be applied after the query is answered.
In the presence of uncertain schema mappings, each tuple in the answer will have asso-
ciated a probability; in these cases, IMPs and PIPs can be built to make use of these
probabilities in order to manage the inconsistency or incompleteness following a certain
strategy. Alternatively, a second approach would be to embed IMPs and/or PIPs in the
query algorithms for both by-table and by-tuple semantics. For the latter we need to in-
vestigate the role that the policies play in obtaining the possible answers. We leave this
task as future work.
7.7 Concluding Remarks
Probabilistic schema matching is emerging as a paradigm for integrating informa-
tion from multiple databases. In past work, [DHY07] has proposed two semantics called
the by-table and by-tuple semantics for selection, projection and join query processing
under probabilistic schema mapping.
In this chapter, we studied the problem of answering aggregate queries in such an
environment. We present three semantics for aggregates — a range semantics in which a
range of possible values for an aggregate query is returned, a probability distribution se-
mantics in which all possible answer values are returned together with their probabilities,
and an expected value semantics. These three semantics combine together with the se-
mantics of [DHY07] to provide six possible semantics for aggregates. Given this setting,
we provide algorithms to answer COUNT, SUM, AVG, MIN, and MAX aggregate queries.
278
We develop algorithms for each of these ?ve aggregate operators under each of the six se-
mantics. The good news is that for every aggregate operator, at least one (and sometimes
more) semantics are PTIME computable.
We also report on a prototype implementation and experiments with two data sets
— a real world eBay data set, and a synthetic data set. We experimentally show that each
aggregate operator listed above can be computed ef?ciently in at least one of these six
semantics, even when the data set is large and there are many different possible schema
mappings.
279
Chapter 8
Conclusions
In this thesis, we have proposed several frameworks for dealing with problems that
arise from data or knowledge integration. The approach of our proposals is different from
previous attempts, both in arti?cial intelligence and databases, since it focuses on the
prior knowledge and expectations of the data and the application that the actual user can
bring into the data management processes, giving him the power to do it in the way that
best suits his needs.
After introducing real world scenarios in which data integration issues arise in
Chapter 1, and reviewing the literature that is most closely related to this work in Chap-
ter 2, in Chapter 3 we proposed a policy based framework for personalizable management
of inconsistent information that allows users to bring their application expertise to bear.
We de?ne formally the notion of inconsistency management policies, or IMPs, with re-
spect to functional dependencies as functions satisfying a minimal set of axioms. We
proposed several families of IMPs that satisfy these axioms, and study relations between
them in the simpli?ed case where only one functional dependency is present. We show
that when multiple functional dependencies are considered, multiple alternative seman-
tics can result. We introduced new versions of the relational algebra that are augmented
280
by inconsistency management policies that are applied either before the operator or after.
We develop theoretical results on the resulting extended relational operators that could,
in principle, be used in the future as the basis of query optimization techniques. Finally,
we presented an index structure for implementing an IMP-based framework showing that
it is versatile, can be implemented based on the needs and resources of the user and, ac-
cording to our theoretical results, the associated algorithms incur reasonable costs. As a
consequence, IMPs are a powerful tool for end users to express what they wish to do with
their data, rather than have a system manager or a DB engine that does not understand
their domain problem dictate how they should handle inconsistencies in their data.
In Chapter 4 we developed a general and uni?ed framework for reasoning about
inconsistency in a wide variety of monotonic logics. The basic idea behind this frame-
work is to construct what we call options, and then using a preference relation de?ned
by the user to compute the set of preferred options, which are intended to support the
conclusions to be drawn from the inconsistent knowledge base. We provide a formal def-
inition of the framework as well as algorithms to compute preferred options. We have
also shown through examples how this abstract framework can be used in different logics,
provided new results on the complexity of reasoning about inconsistency in such log-
ics, and proposed general algorithms for computing preferred options. Furthermore, we
showed that our general framework allows to represent approaches to inconsistency that
are well-known in the arti?cial intelligence literature.
Focusing on more speci?c domains, in Chapter 5 we develop a formalism for iden-
tifying inconsistencies in news reports. Besides the complication of having to deal with
a very large number of records, collecting and analyzing records such as news reports
encounters an extra level of complexity: the presence of linguistically modi?ed terms that
can be interpreted in different ways and make the notion of inconsistency less clear. We
281
propose a probabilistic logic programming language called PLINI within which users can
write rules specifying what they mean by inconsistency.
In Chapter 6, we proposed the concept of a partial information policy (PIP). Using
PIPs, end-users can specify the policy they want to use to handle partial information. We
presented examples of three families of PIPs that end-users can apply. Many more are
possible as simple combinations of these basic PIPs – in addition, the de?nition of PIPs
allows many more policies to be captured. We have also presented index structures for
ef?ciently applying PIPs, and conducted an experimental study showing that the adop-
tion of such index structures allows us to ef?ciently manage large data sets. Moreover,
we have shown that PIPs can be combined with relational algebra operators, giving even
more capabilities to users on how to manage their incomplete data. In previous works
dealing with the management of incomplete databases, the DBMS dictates how incom-
plete information should be handled.
Finally, in Chapter 7 we analyzed the problem of how to answer aggregate queries
in the presence of uncertain schema mappings. Two semantics had been proposed in
the literature for answering SPJ queries in the presence of probabilistic schema map-
pings [DHY07]. In this chapter we proposed three semantics for aggregate query an-
swering: a range semantics in which a range of possible values for an aggregate query
is returned, a probability distribution semantics in which all possible answer values are
returned together with their probabilities, and an expected value semantics. These three
semantics combine together with the semantics of [DHY07] to provide six possible se-
mantics for aggregates. Given this setting, we provide algorithms to answer COUNT,
SUM, AVG, MIN, and MAX aggregate queries. We develop algorithms for each of these
?ve aggregate operators under each of the six semantics. The good news is that for every
aggregate operator, at least one (and sometimes more) semantics are PTIME computable.
282
Recently, researchers have begun to understand that uncertainty, in particular in-
consistency and partial information, does not always need to be eliminated; it is possible
and, more often than not, necessary to reason with an inconsistent and/or partial knowl-
edge base. We need to use inconsistent and partial information; sometimes this kind of
information can help to better understand the data and to improve the quality of decision
making processes. Currently, there is a large gap between methodologies in arti?cial in-
telligence and databases, both for knowledge management and what is available to real
users. There is a need for context-sensitive data management approaches that create syn-
ergy with the user in order to be useful. In this thesis we aimed towards bridging this gap
by proposing several frameworks that provide personalizable approaches to the problems
that arise from data integration and management.
283
Bibliography
[ABC99] M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers
in inconsistent databases. In ACM Symposium on Principles of Database
Systems (PODS), pages 68–79, 1999.
[ABC03a] M. Arenas, L. E. Bertossi, and J. Chomicki. Answer sets for consistent
query answering in inconsistent databases. TPLP, 3(4-5):393–424, 2003.
[ABC
+
03b] M. Arenas, L. E. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. Spin-
rad. Scalar aggregation in inconsistent databases. Theoretical Computer
Science, 3(296):405–434, 2003.
[AC02] L. Amgoud and C. Cayrol. A reasoning model based on the production of
acceptable arguments. AMAI, 34(1):197–215, 2002.
[AFM06] Periklis Andritsos, Ariel Fuxman, and Renee J. Miller. Clean answers over
dirty databases: A probabilistic approach. In International Conference on
Data Engineering (ICDE), page 30, Washington, DC, USA, 2006. IEEE
Computer Society.
[AG85] Serge Abiteboul and G¨ osta Grahne. Update semantics for incomplete
databases. In International Conference on Very Large Data Bases (VLDB),
pages 1–12. VLDB Endowment, 1985.
[AGM85] Carlos E. Alchourron, Peter Gardenfors, and David Makinson. On the
logic of theory change: Partial meet contraction and revision functions.
The Journal of Symbolic Logic, 50(2):510–530, 1985.
284
[AKG91] Serge Abiteboul, Paris C. Kanellakis, and G¨ osta Grahne. On the repre-
sentation and querying of sets of possible worlds. Theoretical Computer
Science, 78(1):158–187, 1991.
[AKK
+
03] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. Miller, and
J. Mylopoulos. The hyperion project: From data integration to data co-
ordination. SIGMOD Record, 32(3), 2003.
[AM86] Paolo Atzeni and Nicola M. Morfuni. Functional dependencies and con-
straints on null values in database relations. Information and Control,
70(1):1–31, 1986.
[AMB
+
11] Massimiliano Albanese, Maria Vanina Martinez, Matthias Broecheler,
John Grant, and V.S. Subrahmanian. Plini: a probabilistic logic pro-
gram framework for inconsistent news information. Logic Programming,
Knowledge Representation, and Nonmonotonic Reasoning, LNCS, 6565,
2011.
[AP07] L. Amgoud and H. Prade. Formalizing practical reasoning under uncer-
tainty: An argumentation-based approach. In IAT, pages 189–195, 2007.
[Art08] A. Artale. Formal methods: Linear temporal logic, 2008.
[AS07] Massimiliano Albanese and V. S. Subrahmanian. T-REX: A domain-
independent system for automated cultural information extraction. In In-
ternational Conference on Computational Cultural Dynamics (ICCCD),
pages 2–8. AAAI Press, August 2007.
[BB03] P. Barcel ´ o and L. E. Bertossi. Logic programs for querying inconsistent
databases. In PADL, pages 208–222, 2003.
[BBC04] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine
Learning, 56(1):89–113, 2004.
[BBFL05] L. E. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. Complexity and
approximation of ?xing numerical attributes in databases under integrity
285
constraints. In DBPL, pages 262–278, 2005.
[BC03] L. E. Bertossi and J. Chomicki. Query answering in inconsistent databases.
In Logics for Emerging Applications of Databases, pages 43–83. Springer,
2003.
[BCD
+
93] Salem Benferhat, Claudette Cayrol, Didier Dubois, J´ erˆ ome Lang, and
Henri Prade. Inconsistency management and prioritized syntax-based en-
tailment. In International Joint Conference on Arti?cial Intelligence (IJ-
CAI), pages 640–647, 1993.
[BD83] Dina Bitton and David J. DeWitt. Duplicate record elimination in large
data ?les. ACM Transactions on Database Systems, 8(2):255–265, 1983.
[BDKT97] Andrei Bondarenko, Phan Minh Dung, Robert A. Kowalski, and Francesca
Toni. An abstract, argumentation-theoretic approach to default reasoning.
Arti?cial Intelligence, 93:63–101, 1997.
[BDP97] S. Benferhat, D. Dubois, and Henri Prade. Some syntactic approaches to
the handling of inconsistent knowledge bases: A comparative study part 1:
The ?at case. Studia Logica, 58(1):17–45, 1997.
[BDSH
+
08] Omar Benjelloun, Anish Das Sarma, Alon Halevy, Martin Theobald, and
Jennifer Widom. Databases with uncertainty and lineage. VLDB Journal,
17(2):243–264, 2008.
[Bel77] N. Belnap. A useful four valued logic. Modern Uses of Many Valued
Logic, pages 8–37, 1977.
[BFFR05] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model
and effective heuristic for repairing constraints by value modi?cation. In
SIGMOD, pages 143–154, 2005.
[BFOS84] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi?cation and
Regression Trees. Wadsworth and Brooks, Monterey, CA, 1984.
286
[BG04] Indrajit Bhattacharya and Lise Getoor. Iterative record linkage for clean-
ing and integration. In Workshop on Research issues in data mining and
knowledge discovery, pages 11–18, New York, NY, USA, 2004. ACM.
[BG07] I. Bhattacharya and L. Getoor. Collective entity resolution in relational
data. ACM Transactions on Knowledge Discovery from Data (TKDD),
1(1), 2007.
[BGMM
+
09] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su,
Steven Euijong Whang, and Jennifer Widom. Swoosh: a generic approach
to entity resolution. VLDB Journal, 18(1):255–276, 2009.
[BGMP92] D. Barbar´ a, H. Garcia-Molina, and D. Porter. The management of prob-
abilistic data. IEEE Transactions on Knowledge and Data Engineering
(TKDE), 4(5):487–502, 1992.
[BH05] Philippe Besnard and Anthony Hunter. Practical ?rst-order argumentation.
In Conference on Arti?cial Intelligence (AAAI), pages 590–595, 2005.
[Bis79] Joachim Biskup. A formal approach to null values in database relations.
In Advances in Data Base Theory, pages 299–341, 1979.
[BKM91] C. Baral, S. Kraus, and J. Minker. IEEE Transactions on Knowledge and
Data Engineering (TKDE), 3(2):208–220, 1991.
[BKMS91] C. Baral, S. Kraus, J. Minker, and V. S. Subrahmanian. Combining knowl-
edge bases consisting of ?rst order theories. pages 92–101, 1991.
[BM03] Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection
using learnable string similarity measures. In International conference on
Knowledge discovery and data mining (KDD), pages 39–48, New York,
NY, USA, 2003. ACM.
[BMP
+
08] A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, and G. Summa. Schema
mapping veri?cation: The spicy way. In EDBT, 2008.
287
[Bob80] D. G. Bobrow. Special issue on non-monotonic reasoning. Arti?cial Intel-
ligence, 13 (1-2), 1980.
[Bre89] G. Brewka. Preferred subtheories: An extended logical framework for de-
fault reasoning. In International Joint Conference on Arti?cial Intelligence
(IJCAI), pages 1043–1048, 1989.
[BS89] H. A. Blair and V. S. Subrahmanian. Paraconsistent logic programming.
Theoretical Computer Science, 68(2):135–154, 1989.
[BS98] P. Besnard and T. Schaub. Signed systems for paraconsistent reasoning.
Journal of Automated Reasoning, 20(1):191–213, 1998.
[BV08] N. Bozovic and V. Vassalos. Two-phase schema matching in real world
relational databases. In International Conference on Data Engineering
(ICDE), pages 290–296, 2008.
[CA03] G. Chen and T. Astebro. How to deal with missing categorical data: Test
of a simple bayesian method. Organ. Res. Methods, 6(3):309–327, 2003.
[CGS97a] K. Selc¸uk Candan, John Grant, and V. S. Subrahmanian. A uni?ed treat-
ment of null values using constraints. Information Sciences, 98(1-4):99–
156, 1997.
[CGS97b] K. Selc¸uk Candan, John Grant, and V. S. Subrahmanian. A uni?ed treat-
ment of null values using constraints. Information Sciences, 98(1-4):99–
156, 1997.
[CGZ09] Luciano Caroprese, Sergio Greco, and Ester Zumpano. Active integrity
constraints for database consistency maintenance. IEEE Transactions on
Knowledge Data Engineering (TKDE), 21(7):1042–1058, 2009.
[Cho07] J. Chomicki. Consistent query answering: Five easy pieces. In Interna-
tional Conference on Database Theory (ICDT), pages 1–17, 2007.
288
[CKP03] Reynold Cheng, Dmitri V. Kalashnikov, and Sunil Prabhakar. Evaluating
probabilistic queries over imprecise data. In SIGMOD, pages 551–562,
New York, NY, USA, 2003. ACM.
[Cla77] K. L. Clark. Negation as failure. In Logic and Data Bases, pages 293–322,
1977.
[CLR03] A. Cal`?, D. Lembo, and R. Rosati. On the decidability and complexity
of query answering over inconsistent and incomplete databases. In ACM
Symposium on Principles of Database Systems (PODS), pages 260–271,
2003.
[CLS94] Claudette Cayrol and Marie-Christine Lagasquie-Schiex. On the complex-
ity of non-monotonic entailment in syntax-based approaches. In ECAI
workshop on Algorithms, Complexity and Commonsense Reasoning, 1994.
[CLS95] Claudette Cayrol and Marie-Christine Lagasquie-Schiex. Non-monotonic
syntax-based entailment: A classi?cation of consequence relations. In
Symbolic and Quantitative Approaches to Reasoning and Uncertainty
(ECSQARU), pages 107–114, 1995.
[CM05] J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance
using tuple deletions. Information and Computation, 197(1-2):90–121,
2005.
[Cod74] E. F. Codd. Understanding relations. SIGMOD Records, 6(3):40–42, 1974.
[Cod79] E. F. Codd. Extending the database relational model to capture more mean-
ing. ACM Trans. Database Syst., 4(4):397–434, 1979.
[CP87] Roger Cavallo and Michael Pittarelli. The theory of probabilistic
databases. In International Conference on Very Large Data Bases (VLDB),
pages 71–81, San Francisco, CA, USA, 1987. Morgan Kaufmann Publish-
ers Inc.
289
[CR02] WilliamW. Cohen and Jacob Richman. Learning to match and cluster large
high-dimensional data sets for data integration. In International conference
on Knowledge discovery and data mining (KDD), pages 475–480, New
York, NY, USA, 2002. ACM.
[CR08] A. G. Cohn and J. Renz. Qualitative spatial representation and reason-
ing. In F. van Hermelen, V. Lifschitz, and B. Porter, editors, Handbook of
Knowledge Representation, pages 551–596. Elsevier, 2008.
[CS86] H. Y. Chiu and J. Sedransk. A bayesian procedure for imputing missing
values in sample surveys. J. Amer. Statist. Assoc., 81(3905):5667–5676,
1986.
[CS00] N. Cristianini and J. Shawe-Taylor. An introduction to support vector ma-
chines. Cambridge university press, 2000.
[CSD
+
08] X. Chai, M. Sayyadian, A. Doan, A. Rosenthal, and L. Seligman. Ana-
lyzing and revising mediated schemas to improve their matchability. In
International Conference on Very Large Data Bases (VLDB), Auckland,
New Zealand, August 2008.
[dC74] N.C.A. da Costa. On the theory of inconsistent formal systems. Notre
Dame Journal of Formal Logic, 15(4):497–510, 1974.
[DHY07] Xin Luna Dong, Alon Y. Halevy, and Cong Yu. Data integration with un-
certainty. In International Conference on Very Large Data Bases (VLDB),
pages 687–698, 2007.
[DI03] E. D. Demaine and N. Immorlica. Correlation clustering with partial in-
formation. Lecture Notes in Computer Science, pages 1–13, 2003.
[DLD
+
04] Robin Dhamankar, Yoonkyong Lee, Anhai Doan, Alon Halevy, and Pe-
dro Domingos. imap: discovering complex semantic matches between
database schemas. In SIGMOD, pages 383–394, New York, NY, USA,
2004. ACM.
290
[DNH04] A. Doan, N. F. Noy, and A. Y. Halevy. Introduction to the special issue on
semantic integration. SIGMOD Record, 33(4):11–13, 2004.
[DS07] Nilesh Dalvi and Dan Suciu. Ef?cient query evaluation on probabilistic
databases. VLDB Journal, 16(4):523–544, 2007.
[DSDH08] A. Das Sarma, X. Dong, and A.Y. Halevy. Bootstrapping pay-as-you-go
data integration systems. pages 861–874, 2008.
[Dun95] P. M. Dung. On the acceptability of arguments and its fundamental role in
nonmonotonic reasoning, logic programming and n-person games. Arti?-
cial Intelligence, Volume 77p. 321–357, 1995.
[Eme90] E. A. Emerson. Temporal and modal logic. In Theoretical Computer Sci-
ence, pages 995–1072. 1990.
[ESS05] M. Ehrig, S. Staab, and Y. Sure. Bootstrapping ontology alignment meth-
ods with apfel. In International Semantic Web Conference (ISWC), pages
186–200, 2005.
[FFM05] A. Fuxman, E. Fazli, and R. J. Miller. Conquer: Ef?cient management of
inconsistent databases. In SIGMOD, pages 155–166, 2005.
[FFP05] S. Flesca, F. Furfaro, and F. Parisi. Consistent query answers on numerical
databases under aggregate constraints. In DBPL, pages 279–294, 2005.
[FFP07] S. Flesca, F. Furfaro, and F. Parisi. Preferred database repairs under aggre-
gate constraints. In SUM, pages 215–229, 2007.
[FH94] Ronald Fagin and Joseph Y. Halpern. Reasoning about knowledge and
probability. Journal of the ACM, 41(2):340–367, 1994.
[FHH
+
09] Ronald Fagin, Laura M. Haas, Mauricio A. Hern´ andez, Ren´ ee J. Miller,
Lucian Popa, and Yannis Velegrakis. Clio: Schema mapping creation and
data exchange. In Conceptual Modeling: Foundations and Applications,
pages 198–236, 2009.
291
[FHM90] R. Fagin, Joseph Y. Halpern, and Nimrod Megiddo. A logic for reasoning
about probabilities. Information and Computation, 87(1-2):78–128, 1990.
[Fit91] M. Fitting. Bilattices and the semantics of logic programming. Journal of
Logic Programming, 11(1-2):91–116, 1991.
[FKMP05] Ronald Fagin, Phokion G. Kolaitis, Ren´ ee J. Miller, and Lucian Popa. Data
exchange: semantics and query answering. Theoretical Computer Science,
336(1):89–124, 2005.
[FKP05] Ronald Fagin, Phokion G. Kolaitis, and Lucian Popa. Data exchange: get-
ting to the core. ACM Transactions on Database Systems, 30(1):174–210,
2005.
[FKUV86] Ronald Fagin, Gabriel M. Kuper, Jeffrey D. Ullman, and Moshe Y. Vardi.
Updating logical databases. Advances in Computing Research, 3:1–18,
1986.
[FLMC02] W. Fan, H. Lu, S. E. Madnick, and D. Cheund. Direct: Asystemfor mining
data value conversion rules from disparate data sources. Decision Support
Systems, 34:19–39, 2002.
[FLPL
+
01] E. Franconi, A. Laureti Palma, N. Leone, S. Perri, and F. Scarcello. Census
data repair: a challenging application of disjunctive logic programming. In
LPAR, pages 561–578, 2001.
[FM05] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent
databases. In International Conference on Database Theory (ICDT), pages
337–351, 2005.
[FR97] Norbert Fuhr and Thomas R¨ olleke. A probabilistic relational algebra for
the integration of information retrieval and database systems. ACM Trans-
actions on Information Systems, 15(1):32–66, 1997.
[FUV83] Ronald Fagin, Jeffrey D. Ullman, and Moshe Y. Vardi. On the semantics of
updates in databases. In ACM SIGACT-SIGMOD Symposium on Principles
292
of Database Systems (PODS), pages 352–365. ACM, 1983.
[Gab85] D. Gabbay. Theoretical foundations for non-monotonic reasoning in expert
systems. pages 439–457, 1985.
[Gal06] Avigdor Gal. Managing uncertainty in schema matching with top-k
schema mappings. Journal of Data Semantics, 6:90–114, 2006.
[Gal07] avigdor Gal. Why is schema matching tough and what can we do about it?
SIGMOD Record, 35(4):2–5, 2007.
[Gar78] Peter. Gardenfors. Conditionals and changes of belief. Acta Philosophica
Fennica, 30:381–404, 1978.
[Gar88a] P. Gardenfors. The dynamics of belief systems: Foundations vs. coherence.
International journal of Philosophy, 1988.
[Gar88b] Peter. Gardenfors. Knowledge in ?ux : modeling the dynamics of epistemic
states. MIT Press, Cambridge, Mass. :, 1988.
[GD05] L. Getoor and C. P. Diehl. Link mining: a survey. ACM SIGKDD Explo-
rations Newsletter, 7(2):3–12, 2005.
[GGZ03] G. Greco, S. Greco, and E. Zumpano. A logical framework for querying
and repairing inconsistent databases. IEEE Transactions on Knowledge
and Data Engineering (TKDE), 15(6):1389–1408, 2003.
[GH92] Dov Gabbay and Anthony Hunter. Making inconsistency respectable 1:
A logical framework for inconsistency in reasoning. In Fundamentals of
Arti?cial Intelligence, page pages. Springer, 1992.
[GH93] Dov Gabbay and Anthony Hunter. Making inconsistency respectable: Part
2 - meta-level handling of inconsistency. In In Applied General Systems
Research, (ed) G. Klir, Plenum, pages 129–136. Springer-Verlag, 1993.
293
[GH06] J. Grant and A. Hunter. Measuring inconsistency in knowledgebases. Jour-
nal of Intelligent Information Systems, 27(2):159–184, 2006.
[GH08] J. Grant and A. Hunter. Analysing inconsistent ?rst-order knowledge
bases. Arti?cial Intelligence, 172(8-9):1064–1093, 2008.
[GIKS03] L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for
data cleansing and integration in an rdbms. In International Conference
on Data Engineering(ICDE), pages 729–731, 2003.
[GJ86] J. Grant and Minker. J. Answering queries in inde?nite databases and the
null value problem. Advances in Computing Research - The Theory of
Databases, 3:247–267, 1986.
[GL98] M. Gelfond and V. Lifschitz. The stable model semantics for logic pro-
gramming. In International Conference on Logic Programming, pages
1070–1080, 1998.
[GMSS09] Avigdor Gal Gal, Maria Vanina Martinez, Gerardo I. Simari, and V. S.
Subrahmanian. Aggregate query answering under uncertain schema map-
pings. In International Conference on Data Engineering (ICDE), pages
940–951, 2009.
[GN06] Georg Gottlob and Alan Nash. Data exchange: computing cores in polyno-
mial time. In ACMSymposiumon Principles of Database Systems (PODS),
pages 40–49, New York, NY, USA, 2006. ACM.
[GPSS80] D. M. Gabbay, A. Pnueli, S. Shelah, and J. Stavi. On the temporal basis of
fairness. In Symposium on Principles of Programming Languages (POPL),
pages 163–173, 1980.
[Gra77] John Grant. Null values in a relational data base. Inf. Process. Lett.,
6(5):156–157, 1977.
[Gra78] J. Grant. Classi?cations for inconsistent theories. Notre Dame Journal of
Formal Logic, 19(3):435–444, 1978.
294
[Gra79] John Grant. Partial values in a tabular database model. Inf. Process. Lett.,
9(2):97–99, 1979.
[Gra80] John Grant. Incomplete information in a relational database. Fundamenta
Informaticae III, 3:363–378, 1980.
[Gra91] G. Grahne. The Problem of Incomplete Information in Relational
Databases. Springer, 1991.
[GS95] John Grant and V. S. Subrahmanian. Reasoning in inconsistent knowledge
bases. IEEE Transactions on Knowledge and Data Engineering (TKDE),
7(1):177–189, 1995.
[GS04] Alejandro Javier Garc´?a and Guillermo Ricardo Simari. Defeasible logic
programming: An argumentative approach. TPLP, 4(1-2):95–138, 2004.
[Hai84] T. Hailperin. Probability logic. Notre Dame Journal of Formal Logic, 25
(3):198–212, 1984.
[Hal90] Joseph Y. Halpern. An analysis of ?rst-order logics of probability. Arti?-
cial Intelligence, Volume 46(3)p. 311–350, 1990.
[Han93] Sven Ove Hansson. Reversing the levi identity. Journal of Philosophical
Logic, 22(6), 1993.
[Han94] Sven Ove Hansson. Kernel contraction. Journal of Symbolic Logic,
59(3):845–859, 1994.
[Han97] Sven Ove Hansson. Semi-revision. Journal of Applied Non-Classical
Logic, (7):151–175, 1997.
[Hec98] D. Heckerman. A tutorial on learning with bayesian networks. NATO ASI
SERIES D BEHAVIOURAL AND SOCIAL SCIENCES, 89:301–354, 1998.
[HIM
+
04] A.Y. Halevy, Z.G. Ives, J. Madhavan, P. Mork, D. Suciu, and I. Tatarinov.
The Piazza peer data management system. IEEE Transactions on Knowl-
295
edge and Data Engineering (TKDE), 16(7):787–798, 2004.
[HK05] A. Hunter and S. Konieczny. Approaches to measuring inconsistent infor-
mation. In Inconsistency Tolerance, pages 191–236, 2005.
[HL10] Sven Hartmann and Sebastian Link. When data dependencies over sql ta-
bles meet the logics of paradox and s-3. In ACM Symposium on Principles
of Database Systems (PODS), pages 317–326, New York, NY, USA, 2010.
ACM.
[HMYW05] H. He, W. Meng, C. T. Yu, and Z. Wu. Wise-integrator: A system for ex-
tracting and integrating complex web search interfaces of the deep web. In
International Conference on Very Large Data Bases (VLDB), pages 1314–
1317, 2005.
[HS95] Mauricio A. Hern´ andez and Salvatore J. Stolfo. The merge/purge problem
for large databases. In SIGMOD, pages 127–138, New York, NY, USA,
1995. ACM.
[IJ81] Tomasz Imielinski and Witold Lipski Jr. On representing incomplete in-
formation in a relational data base. In International Conference on Very
Large Data Bases (VLDB), pages 388–397, 1981.
[IKBS08] A. Inan, M. Kantarcioglu, E. Bertino, and M. Scannapieco. A hybrid ap-
proach to private record linkage. In International Conference on Data
Engineering (ICDE), pages 496–505, 2008.
[IL83] T. Imielinski and W. Lipski. Incomplete information and dependencies in
relational databases. In SIGMOD, pages 178–184, 1983.
[IL84a] Tomasz Imielinski and Witold Lipski. Incomplete information in relational
databases. Journal of ACM, 31(4):761–791, 1984.
[IL84b] Tomasz Imieli´ nski and Witold Lipski, Jr. Incomplete information in rela-
tional databases. Journal of the ACM, 31(4):761–791, 1984.
296
[JD05] Wojciech Jamroga and J¨ urgen Dix. Model checking strategic abilities of
agents under incomplete information. In ICTCS, pages 295–308, 2005.
[JD06] Wojciech Jamroga and J¨ urgen Dix. Model checking abilities under incom-
plete information is indeed delta2-complete. In EUMAS, 2006.
[JDR99] P. Jermyn, M. Dixon, and B. J. Read. Preparing clean views of data for data
mining. In ERCIM Workshop on Database Research, pages 1–15, 1999.
[JKV07] T.S. Jayram, S. Kale, and E. Vee. Ef?cient aggregation algorithms for prob-
abilistic data. In ACM-SIAM Symposium on Discrete Algorithms (SODA),
pages 346–355, New Orleans, Louisiana, USA, 2007.
[Jr.79] Witold Lipski Jr. On semantic issues connected with incomplete informa-
tion databases. ACM Transactions on Database Systems, 4(3):262–296,
1979.
[KIL04] Gabriele Kern-Isberner and Thomas Lukasiewicz. Combining probabilis-
tic logic programming with the power of maximum entropy. Arti?cial
Intelligence, 157(1-2):139–202, 2004.
[KL92] M. Kifer and E. L. Lozinskii. A logic for reasoning with inconsistency.
Journal of Automated Reasoning, 9(2):179–215, 1992.
[KLM90] Sarit Kraus, Daniel Lehmann, and Menachem Magidor. Nonmonotonic
reasoning, preferential models and cumulative logics. Arti?cial Intelli-
gence, 44(1-2):167–207, 1990.
[KS89] Michael Kifer and V. S. Subrahmanian. On the expressive power of anno-
tated logic programs. In NACLP, pages 1069–1089, 1989.
[KS92] M. Kifer and V.S. Subrahmanian. Theory of generalized annotated
logic programming and its applications. Journal of Logic Programming,
12(3&4):335–367, 1992.
297
[KTG92] Werner Kiessling, Helmut Th¨ one, and Ulrich G¨ untzer. Database sup-
port for problematic knowledge. In International Conference on Extend-
ing Database Technology (EDBT), pages 421–436, London, UK, 1992.
Springer-Verlag.
[LC00] W.-S. Li and C. Clifton. SEMINT: A tool for identifying attribute cor-
respondences in heterogeneous databases using neural networks. Data &
Knowledge Engineering, 33(1):49–84, 2000.
[LCcI
+
04] Chengkai Li, Kevin Chen-chuan, Ihab F. Ilyas, Chang Ihab, F. Ilyas, and
Sumin Song. Ranksql: Query algebra and optimization for relational top-k
queries. In SIGMOD, pages 131–142. ACM Press, 2004.
[Lev84] Hector J. Levesque. A logic of implicit and explicit belief. In National
Conference on Arti?cial Intelligence (AAAI), pages 198–202, 1984.
[Li09] Xiao Bai Li. A bayesian approach for estimating and replacing missing
categorical data. J. Data and Information Quality, 1(1):1–11, 2009.
[Lie82] Y. Edmund Lien. On the equivalence of database models. J. ACM, 29:333–
362, April 1982.
[Lip79] Witold Lipski. On semantic issues connected with incomplete information
databases. ACM Trans. Database Syst., 4(3):262–296, 1979.
[Lip81] Witold Lipski. On databases with incomplete information. Journal of the
ACM, 28(1):41–70, 1981.
[LL98] Mark Levene and George Loizou. Axiomatisation of functional dependen-
cies in incomplete relations. Theoretical Computer Science, 206(1-2):283–
300, 1998.
[LL99] Mark Levene and George Loizou. Database design for incomplete rela-
tions. ACM Transactions on Database Systems, 24(1):80–125, 1999.
298
[Llo87] J. W. Lloyd. Foundations of Logic Programming, Second Edition.
Springer-Verlag, 1987.
[LLRS97] Laks V. S. Lakshmanan, Nicola Leone, Robert Ross, and V. S. Subrahma-
nian. Probview: a ?exible probabilistic database system. ACM Transac-
tions on Database Systems, 22(3):419–469, 1997.
[Loz94] E. L. Lozinskii. Resolving contradictions: A plausible semantics for in-
consistent systems. Jounal of Automated Reasoning, 12(1):1–31, 1994.
[LS94] Laks V. S. Lakshmanan and Fereidoon Sadri. Probabilistic deductive
databases. In International Symposium on Logic programming (ILPS),
pages 254–268, Cambridge, MA, USA, 1994. MIT Press.
[McC87] J. McCarthy. Circumscription—a formof non-monotonic reasoning. pages
145–152, 1987.
[MD80] Drew V. McDermott and Jon Doyle. Non-monotonic logic i. Arti?cial
Intelligence, 13(1-2):41–72, 1980.
[ME97] Alvaro Monge and Charles Elkan. An ef?cient domain-independent algo-
rithm for detecting approximately duplicate database records, 1997.
[MHH00] R.J. Miller, L.M. Haas, and M.A. Hern´ andez. Schema mapping as query
discovery. In A. El Abbadi, M.L. Brodie, S. Chakravarthy, U. Dayal,
N. Kamel, G. Schlageter, and K.-Y. Whang, editors, International Confer-
ence on Very Large Data Bases (VLDB), pages 77–88. Morgan Kaufmann,
2000.
[MHH
+
01] R.J. Miller, M.A. Hern` andez, L.M. Haas, L.-L. Yan, C.T.H. Ho, R. Fagin,
and L. Popa. The Clio project: Managing heterogeneity. SIGMOD Record,
30(1):78–83, 2001.
[MKIS00] E. Mena, V. Kashayap, A. Illarramendi, and A. Sheth. Imprecise answers
in distributed environments: Estimation of information loss for multi-
ontological based query processing. International Journal of Cooperative
299
Information Systems, 9(4):403–425, 2000.
[MM
+
11] Maria V. Martinez, Cristian Molinaro, , V.S. Subrahmanian, and Leila Am-
goud. A general framework for reasoning about inconsistency. In prepa-
ration, 2011.
[MMGS11] Maria Vanina Martinez, Cristian Molinaro, John Grant, and V.S. Subrah-
manian. Customized policies for handling partial information in relational
databases. Under Review, 2011.
[Moo85] R. C. Moore. Semantical considerations on nonmonotonic logic. Arti?cial
Intelligence, 25(1):75–94, 1985.
[Moo88] R. C. Moore. Autoepistemic Logic. In P. Smets, E. H. Mamdani,
D. Dubois, and H. Prade, editors, Non-Standard Logics for Automated
Reasoning. Academic Press, 1988.
[MP92] Z. Manna and A. Pnueli. The Temporal Logic of Reactive and Concurrent
Systems: Speci?cation. Springer-Verlag, New York, 1992.
[MPP
+
08] Maria V. Martinez, Francesco Parisi, Andrea Pugliese, Gerardo I. Simari,
and V.S. Subrahmanian. Inconsistency management policies. In Interna-
tional Conference on Principles of Knowledge Representation and Rea-
soning (KR), pages 367–376, 2008.
[MPP
+
10] Maria Vanina Martinez, Francesco Parisi, Andrea Pugliese, Gerardo I.
Simari, and V. S. Subrahmanian. Ef?cient policy-based inconsistency man-
agement in relational knowledge bases. In SUM, pages 264–277, 2010.
[MPS
+
07] M. V. Martinez, A. Pugliese, G. I. Simari, V. S. Subrahmanian, and
H. Prade. How dirty is your relational database? an axiomatic approach.
In ECSQARU, pages 103–114, 2007.
[MST94] D. Michie, D. J. Spiegelhalter, and C.C. Taylor. Machine Learning, Neural
and Statistical Classi?cation. Ellis Horwood, 1994.
300
[Mun74] J. Munkres. Topology: A First Course. Prentice Hall, 1974.
[MWJ99] K. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for
approximate inference: An empirical study. In Uncertainty in Arti?cial
Intelligence (UAI), page 467475. Citeseer, 1999.
[Nil86a] N. J. Nilsson. Probabilistic logic. Arti?cial Intelligence, Vol-
ume 28(1)p. 71–87, 1986.
[Nil86b] Nils Nilsson. Probabilistic logic. Arti?cial Intelligence, 28:71–87, 1986.
[NJ02] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classi?ers:
A comparison of logistic regression and naive bayes. Advances in neural
information processing systems, 2:841–848, 2002.
[NS92] Raymond Ng and V. S. Subrahmanian. Probabilistic logic programming.
Information and Computation, 101(2):150–201, 1992.
[NS94] Raymond Ng and V. S. Subrahmanian. Stable semantics for probabilistic
deductive databases. Information and Computation, 110(1):42–83, 1994.
[OS04] F. Ozcan and V.S. Subrahmanian. Partitioning activities for agents. In In-
ternational Joint Conference on Arti?cial Intelligence (IJCAI), pages 89–
113, 2004.
[Pap94] Christos M. Papadimitriou. Computational complexity. Addison-Wesley,
1994.
[Pea88] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of
plausible inference. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 1988.
[PL92] G. Pinkas and R. P. Loui. Reasoning from inconsistency: a taxonomy of
principles for resolving con?icts. In International Conference on Princi-
ples of Knowledge Representation and Reasoning (KR), pages 709–719,
1992.
301
[Pnu77] A. Pnueli. The temporal logic of programs. In Symposium on Foundations
of Computer Science (FOCS), pages 46–57, 1977.
[Poo85] D. Poole. On the comparison of theories: preferring the most speci?c
explanation. In International Joint Conference on Arti?cial Intelligence
(IJCAI), pages 144–147, 1985.
[Poo88] David Poole. A logical framework for default reasoning. Arti?cial Intelli-
gence, Volume 36(1)p. 27–47, 1988.
[Poo93] David Poole. Probabilistic horn abduction and bayesian networks. Arti?-
cial Intelligence, 64(1):81–129, 1993.
[Poo97] David Poole. The independent choice logic for modelling multiple agents
under uncertainty. Arti?cial Intelligence, 94(1-2):7–56, 1997.
[PS97] Henry Prakken and Giovanni Sartor. Argument-based extended logic pro-
gramming with defeasible priorities, 1997.
[Pyl99] Dorian Pyle. Data Preparation for Data Mining (The Morgan Kaufmann
Series in Data Management Systems). Morgan Kaufmann, 1999.
[Qui93] J. Ross Quinlan. C4.5: Programs for Machine Learning (Morgan Kauf-
mann Series in Machine Learning). Morgan Kaufmann, 1993.
[RCC92] D. A. Randell, Z. Cui, and A. G. Cohn. A spatial logic based on regions
and connection. In International Conference on Principles of Knowledge
Representation and Reasoning (KR), pages 165–176, 1992.
[RDS07] Christopher R´ e, Nilesh Dalvi, and Dan Suciu. Ef?cient top-k query evalu-
ation on probabilistic data. In International Conference on Data Engineer-
ing (ICDE), pages 886–895, 2007.
[Rei78] R. Reiter. On closed world data bases. In Logic and Data Bases, pages
55–76, 1978.
302
[Rei80a] R. Reiter. A logic for default reasoning. Arti?cial Intelligence, 13(1-2):81–
132, 1980.
[Rei80b] R. Reiter. A logic for default reasoning. Arti?cial Intelligence, 13(1-2):81–
132, 1980.
[Rei86] Raymond Reiter. A sound and sometimes complete query evaluation al-
gorithm for relational databases with null values. Journal of the ACM,
33:349–370, April 1986.
[Res64] N. Rescher. Hypothetical reasoning, 1964.
[RM70] N. Rescher and R. Manor. On inference from inconsistent premises. The-
ory and decision, Volume 1p. 179–219, 1970.
[Roo92] Nico Roos. A logic for reasoning with inconsistent knowledge. Arti?cial
Intelligence, Volume 57(1)p. 69–103, 1992.
[RSG05] R.B. Ross, V.S. Subrahmanian, and J. Grant. Aggregate operators in prob-
abilistic databases. Journal of the ACM, 52(1):54–101, 2005.
[SA07] V. S. Subrahmanian and L. Amgoud. A general framework for reasoning
about inconsistency. In International Joint Conference on Arti?cial Intel-
ligence (IJCAI), pages 599–504, 2007.
[SCM09] Slawomir Staworko, Jan Chomicki, and Jerzy Marcinkowski. Prioritized
repairing and consistent query answering in relational databases. CoRR,
abs/0908.0464, 2009.
[Sho67] J. Shoen?eld. Mathematical Logic. Addison-Wesley, 1967.
[SI07] Mohamed A. Soliman and Ihab F. Ilyas. Top-k query processing in uncer-
tain databases. In International Conference on Data Engineering (ICDE),
pages 896–905, 2007.
303
[SL92] Guillermo R. Simari and Ronald P. Loui. A mathematical treatment of
defeasible reasoning and its implementation. Arti?cial Intelligence, 53(2-
3):125–157, 1992.
[SNB
+
08] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad.
Collective classi?cation in network data. AI Magazine, 29(3):93, 2008.
[SQL03] Information technology: Database languages, sql part 2 (foundation),
2003.
[SS03] E. Schallehn and K. Sattler. Using similarity-based operations for resolv-
ing data-level con?icts. In BNCOD, volume 2712, pages 172–189, 2003.
[Tar56] A. Tarski. On Some Fundamental Concepts of Metamathematics. Oxford
Uni. Press, 1956.
[TK93] K. Thirunarayan and M. Kifer. A theory of nonmonotonic inheritance
based on annotated logic. Arti?cial Intelligence, 60(1):23–50, 1993.
[Ull88] J. D. Ullman. Principles of Database and Knowledge-Base Systems, Vol-
ume I. Computer Science Press, 1988.
[UW02] Jeffrey D. Ullman and Jennifer Widom. A ?rst course in database systems
(2. ed.). Prentice Hall, 2002.
[Vas79] Yannis Vassiliou. Null values in data base management: A denotational
semantics approach. In SIGMOD, pages 162–169, 1979.
[Vas80] Yannis Vassiliou. Functional dependencies and incomplete information. In
International Conference on Very Large Data Bases (VLDB), pages 260–
269, 1980.
[Wij03] J. Wijsen. Condensed representation of database repairs for consis-
tent query answering. In International Conference on Database Theory
(ICDT), pages 378–393, 2003.
304
[Wij05] J. Wijsen. Database repairing using updates. ACM TODS, 30(3):722–768,
2005.
[YC88] Li Yan Yuan and Ding-An Chiang. A sound and complete query evaluation
algorithm for relational databases with null values. In SIGMOD, pages 74–
81, New York, NY, USA, 1988. ACM.
[Zan84] C. Zaniolo. Database relations with null values. Journal of Computer and
System Sciences (JCSS), 28(1):142–166, 1984.
305

doc_171318340.pdf

Dissertation Study on Personalizable Knowledge Integration

Attachments