From Knowledge Discovery to Implementation A Business Intelligence Approach Using Neural N

oneonone · Jan 26, 2016

Description
From Knowledge Discovery to Implementation A Business Intelligence Approach Using Neural Network Rule Extraction and Decision Tables

From Knowledge Discovery to Implementation:
A Business Intelligence Approach Using Neural
Network Rule Extraction and Decision Tables
Christophe Mues
1,2
, Bart Baesens
1
, Rudy Setiono
3
, and Jan Vanthienen
2
1
University of Southampton, School of Management,
Southampton, SO17 1BJ, United Kingdom
{c.mues, b.m.m.baesens}@soton.ac.uk
2
K.U.Leuven, Dept. of Applied Economic Sciences,
Naamsestraat 69, B-3000 Leuven, Belgium
[email protected]
3
National University of Singapore, Dept. of Information Systems,
Kent Ridge, Singapore 119260, Republic of Singapore
[email protected]
Abstract. The advent of knowledge discovery in data (KDD) technol-
ogy has created new opportunities to analyze huge amounts of data.
However, in order for this knowledge to be deployed, it ?rst needs to
be validated by the end-users and then implemented and integrated into
the existing business and decision support environment. In this paper,
we propose a framework for the development of business intelligence (BI)
systems which centers on the use of neural network rule extraction and
decision tables. Two di?erent types of neural network rule extraction al-
gorithms, viz. Neurolinear and Neurorule, are compared, and subsequent
implementation strategies based on decision tables are discussed.
1 Introduction
Many businesses have eagerly adopted data storing facilities to record informa-
tion regarding their daily operations. The advent of knowledge discovery in data
(KDD) technology has created new opportunities to extract powerful knowledge
from the stored data using data mining algorithms. Although many of these
algorithms yield very accurate models, one regularly sees that the extracted
models fail to be successfully integrated into the existing business environment
and supporting information systems infrastructure. In this paper, we address
two possible explanations for this phenomenon.
Firstly, many of the representations applied by these algorithms cannot be
easily interpreted and validated by humans. For example, neural networks are
considered a black box technique, since the reasoning behind how they reach their
conclusions cannot be readily obtained from their structure. Therefore, we have,
in recent work [1], proposed a two-step process to open the neural network black
box which involves: (a) extracting rules from the network; (b) visualizing this rule
set using an intuitive graphical representation, viz. decision tables. In [1], results
K.D. Altho? et al. (Eds.): WM 2005, LNAI 3782, pp. 483–495, 2005.
c Springer-Verlag Berlin Heidelberg 2005
484 C. Mues et al.

data
neural net
rule set
decision table
validation
direct consultation
program code if ... then ...
if ... then ...
...
if ...
else if ...
else ...
...
...
Fig. 1. A framework for business intelligence systems development
were reported on the use of Neurorule, a neural network rule extraction algorithm
that requires a pre-processing step in which the data are to be discretized. In this
paper, we also investigate the use of an alternative algorithm called Neurolinear,
which instead works with continuous (normalized) data and produces oblique
(as opposed to propositional) rules. We will empirically assess whether the rules
extracted by Neurolinear o?er a higher predictive accuracy than those extracted
by Neurorule and to what extent they are still easily interpretable.
Secondly, once a satisfactory knowledge model has been obtained, it still has
to be implemented and deployed. While, in the KDD literature, much atten-
tion has been paid to the preceding stages, relatively few guidelines are supplied
with regard to the implementation, integration, as well as the subsequent man-
agement and maintenance of business intelligence (BI) systems. For example,
while neural networks may typically yield a high predictive accuracy, they do
so by simultaneously processing all inputs. Hence, direct implementations would
need to query the user or a database for all values describing a given input case,
regardless of their relative impact on the output. In contrast, decision tables or
trees provide more e?cient test strategies that avoid querying for inputs that
become irrelevant given the case values already supplied. Clearly, this is an im-
portant advantage, especially when these operations are quite costly. Another
advantage of decision table based systems is that they are easily maintainable:
if, at some point, changes are to be made to the underlying decision table, these
are either automatically re?ected in the operational system, or, depending on
the chosen implementation strategy, it requires little e?ort to modify the system
accordingly.
Therefore, in this paper, we advocate that decision tables can play a central
role in the KDD process, by bridging the gap that exists between an accurate
neural network model and a successful business intelligence system implemen-
tation (see Fig. 1) and knowledge management strategy. Our approach will be
illustrated in the context of developing credit-scoring systems (which are meant
to assist employees of ?nancial institutions in deciding whether or not to grant
a loan to an applicant), but is also applicable in various other settings involving
predictive data mining (e.g., customer churn prediction, fraud detection, etc.).
From Knowledge Discovery to Implementation: A BI Approach 485
2 Neural Network Rule Extraction: Neurolinear and
Neurorule
As universal approximators, neural networks can achieve signi?cantly better pre-
dictive accuracy compared to models that are linear in the input variables. How-
ever, a major drawback is their lack of transparency: their internal structure is
hard for humans to interpret. It is precisely this black box property that hinders
their acceptance by practitioners in several real-life problem settings such as
credit-risk evaluation (where besides having accurate models, explanation of the
predictions being made is essential). In the literature, the problem of explain-
ing the neural network predictions has been tackled by techniques that extract
symbolic rules or trees from the trained networks. These neural network rule
extraction techniques attempt to open up the neural network black box and
generate symbolic, comprehensible descriptions with approximately the same
predictive power as the neural network itself. An advantage of using neural net-
works as a starting point for rule extraction is that the neural network considers
the contribution of the inputs towards classi?cation as a group, while decision
tree algorithms like C4.5 measure the individual contribution of the inputs one
at a time as the tree is grown [4].
The expressive power of the extracted rules depends on the language used
to express the rules. Many types of rules have been suggested in the literature.
Propositional rules are simple ‘if-then’ expressions based on conventional propo-
sitional logic. An example of a propositional rule is:
If Purpose = second hand car and Savings Account ? 50 Euro then
Applicant = bad.
An oblique rule is a rule whereby each condition represents a separating
hyperplane given in the form of a linear inequality, e.g.:
If 0.84 Income + 0.32 Savings Account ? 1000 Euro then Applicant = bad.
Oblique rules allow for more powerful decision surfaces than propositional
rules since the latter allow only axis-parallel decision boundaries. This is illus-
trated in Fig. 2. The latter represents a classi?cation problem involving two
classes, represented by ‘+’ and ‘o’ respectively, each described by two inputs x
1
and x
2
. The left hand side illustrates an oblique rule separating both classes and
the right hand side a set of propositional rules inferred by e.g. C4.5. Clearly,
the oblique rule provides a better separation than the set of propositional, axis-
parallel, rules. Augmenting the number of training points will probably increase
the number of axis parallel decision boundaries. Hence, this example illustrates
that oblique rules may provide a more powerful, concise separation than a set
of propositional rules. However, this advantage has to be o?set against the loss
of comprehensibility since oblique rules are harder to interpret for the domain
expert.
Neurolinear and Neurorule are algorithms that extract rules from trained
three-layered feedforward neural networks. The kinds of rules generated by Neu-
rolinear and Neurorule are oblique rules and propositional rules, respectively.
Both techniques share the following common steps [6,7]:
486 C. Mues et al.
x
2
x
1
(a) Oblique rule
x
2
x
1
(b) Propositional rule
Fig. 2. Oblique rules versus propositional rules [4]
1. Train a neural network to meet the prespeci?ed accuracy requirement;
2. Remove the redundant connections in the network by pruning while main-
taining its accuracy;
3. Discretize the hidden unit activation values of the pruned network by clus-
tering;
4. Extract rules that describe the network outputs in terms of the discretized
hidden unit activation values;
5. Generate rules that describe the discretized hidden unit activation values in
terms of the network inputs;
6. Merge the two sets of rules generated in steps 4 and 5 to obtain a set of rules
that relates the inputs and outputs of the network.
Both techniques di?er in their way of preprocessing the data. Neurolinear works
with continuous data which is normalized e.g. to the interval [?1, 1]. On the
other hand, Neurorule assumes the data are discretized and represented as bi-
nary inputs using the thermometer encoding for ordinal variables and dummy
encoding for nominal variables. Table 1 illustrates the thermometer encoding
procedure for the ordinal Income variable (where the discretization into four
categories could have been done by either a discretization algorithm, e.g. the
algorithm of Fayyad and Irani [2], or according to the recommendation from a
domain expert).
Table 1. The thermometer encoding procedure for ordinal variables
Original input Categorical Thermometer
input inputs
I
1
I
2
I
3
Income ? 800 Euro 1 0 0 0
800 Euro < Income ? 2000 Euro 2 0 0 1
2000 Euro < Income ? 10000 Euro 3 0 1 1
Income > 10000 Euro 4 1 1 1
From Knowledge Discovery to Implementation: A BI Approach 487
Both Neurorule and Neurolinear typically start from a one-hidden layer
neural network with hyperbolic tangent hidden neurons and sigmoid or linear
output neurons. For a classi?cation problem with C classes, C output neurons
are used and the class is assigned to the output neuron with the highest activa-
tion value (winner-take-all learning). The network is then trained to minimize
an augmented cross-entropy error function using the BFGS method which is a
modi?ed Quasi-Newton algorithm. This algorithm converges much faster than
the standard backpropagation algorithm and the total error decreases after each
iteration step which is not necessarily the case in the backpropagation algorithm.
Determining the optimal number of hidden neurons is not a trivial task. In the
literature, two approaches have been suggested to tackle this problem. A growing
strategy starts from an empty network and gradually adds hidden neurons to
improve the classi?cation accuracy. On the other hand, a pruning strategy starts
from an oversized network and removes the irrelevant connections. When all
connections to a hidden neuron have been removed, it can be pruned. The latter
strategy is followed by Neurolinear and Neurorule. Note that this pruning step
plays an important role in both rule extraction algorithms since it will facilitate
the extraction of a compact, parsimonious rule set. After having removed one or
more connections, the network is retrained and inspected for further pruning.
Once a trained and pruned network has been obtained, the activation values
of all hidden neurons are clustered. In the case of hyperbolic tangent hidden neu-
rons, the activation values lie in the interval [?1, 1]. A simple greedy clustering
algorithm then starts by sorting all these hidden activation values in increasing
order. Adjacent values are then merged into a unique discretized value as long as
the class labels of the corresponding observations do not con?ict. The merging
process hereby ?rst considers the pair of hidden activation values with the short-
est distance in between. Another discretization algorithm is the Chi2 algorithm
which is an improved and automated version of the ChiMerge algorithm and
makes use of the ?
2
test statistic to merge the hidden activation values [3].
In step 4 of Neurolinear and Neurorule, a new data set is composed consist-
ing of the discretized hidden unit activation values and the class labels of the
corresponding observations. Duplicate observations are removed and rules are
inferred relating the class labels to the clustered hidden unit activation values.
This can be done using an automated rule induction algorithm or manually when
the pruned network has only a few hidden neurons and inputs. Note that steps
3 and 4 can be done simultaneously by C4.5rules since the latter can work with
both discretized and continuous data [4].
In the last two steps of both rule extraction algorithms, the rules of step
4 are translated in terms of the original inputs. First, the rules are generated
describing the discretized hidden unit activation values in terms of the original
inputs. This rule set is then merged with that of step 4 by replacing the conditions
of the latter with those of the former. For Neurolinear, this process is fairly
straightforward. In the case of Neurorule, one might again use an automated
rule induction algorithm to relate the discretized hidden unit activation values
to the inputs.
488 C. Mues et al.
3 Neural Network Rule Extraction Results
3.1 Experimental Setup
The experiments were conducted on three real-life credit-risk evaluation data
sets: German credit, Bene1 and Bene2. The Bene1 and Bene2 data sets were
obtained from two major Benelux (Belgium, The Netherlands, Luxembourg)
?nancial institutions. The German credit data set is publicly available at the
UCI repository (http://www.ics.uci.edu/
?
mlearn/MLRepository.html).
Each data set is randomly split into two-thirds training set and one-third test
set. The neural networks are trained and rules are extracted using the training
set. The test set is then used to assess the predictive power of the trained net-
works and the extracted rule sets or trees. The continuous and discretized data
sets are analyzed using Neurolinear and Neurorule, respectively. We also include
C4.5 and C4.5rules as a benchmark to compare the results of the rule extraction
algorithms. We set the con?dence level for the pruning strategy to 25% which is
the value that is commonly used in the literature.
All algorithms are evaluated by their classi?cation accuracy as measured by
the percentage correctly classi?ed (PCC) observations, and by their complexity.
The complexity is quanti?ed by looking at the number of generated rules or the
number of leaf nodes and total number of nodes (including leaf nodes) for C4.5.
Since the primary goal of neural network rule extraction is to mimic the
decision process of the trained neural network, we also measure how well the
extracted rule set models the behavior of the network. For this purpose, we also
report the ?delity of the extraction techniques, which is de?ned as the percentage
of observations that the extraction algorithm classi?es in the same way as the
neural network.
For the Neurolinear and Neurorule analyses, we use two output units with
linear or logistic activation functions and the class is assigned to the output
neuron with the highest activation value (winner-takes-all). A hyperbolic tangent
activation function is used in the hidden layer.
3.2 Results for the Continuous Data Sets
Table 2 presents the results of applying the rule extraction methods to the con-
tinuous data sets. Before the neural networks are trained for rule extraction
using Neurolinear, all inputs x
i
, i = 1, ..., n are scaled to the interval [?1, 1] in
the following way: x
new
i
= 2[
x
old
i
?min(x
i
)
max(x
i
)?min(x
i
)
] ?1.
As explained, Neurolinear typically starts from a large, oversized network
and then prunes the irrelevant connections. This pruned neural network has 1
hidden unit for the German credit and Bene2 data set and 2 hidden units for
the Bene1 data set, indicating that there is no need to model more complex non-
linearities by using more hidden neurons. The pruned networks had 16 inputs
for the German credit data set, 17 inputs for the Bene1 data set and 23 inputs
for the Bene2 data set.
Neurolinear obtained 100% test set ?delity for the German credit and Bene2
data set and 99.9% test set ?delity for the Bene1 data set (cf. Table 3). This
From Knowledge Discovery to Implementation: A BI Approach 489
clearly indicates that Neurolinear was able to extract rule sets which closely
re?ect the decision process of the trained neural networks.
It can be observed from Table 2 that the rules extracted by Neurolinear are
both powerful and very concise when compared to the rules and trees inferred
Table 2. Neural network rule extraction results for the continuous data sets
Data set Method PCC
train
PCC
test
Complexity
German C4.5 82.58 70.96 37 leaves, 59 nodes
credit C4.5rules 81.53 70.66 13 propositional rules
Pruned NN 80.78 77.25 16 inputs
Neurolinear 80.93 77.25 2 oblique rules
Bene1 C4.5 89.91 68.68 168 leaves, 335 nodes
C4.5rules 78.63 70.80 21 propositional rules
Pruned NN 77.33 72.62 17 inputs
Neurolinear 77.43 72.72 3 oblique rules
Bene2 C4.5 90.24 70.09 849 leaves, 1161 nodes
C4.5rules 77.61 73.00 30 propositional rules
Pruned NN 76.05 73.51 23 inputs
Neurolinear 76.05 73.51 2 oblique rules
Table 3. Fidelity rates of Neurolinear
Data set Fid
train
Fid
test
German credit 100 100
Bene1 99.81 99.90
Bene2 100 100
If [-12.83(Amount on purchase invoice)+13.36(Percentage of ?nancial burden)
+31.33(Term)-0.93(Private or professional loan))-35.40(Savings account)
-5.86(Other loan expenses)+10.69(Profession)+10.84(Number of years since
last house move)+3.03(Code of regular saver)+6.68(Property)-6.02(Existing
credit info)-13.78(Number of years client)-2.12(Number of years since last loan)
-10.38(Number of mortgages)+68.45(Pawn)-5.23(Employment status)
-5.50(Title/salutation)] ? 0.31
then Applicant = good
If [19.39(Amount on purchase invoice)+32.57(Percentage of ?nancial burden)
-5.19(Term)-16.75(Private or professional loan)-27.96(Savings account)
+7.58(Other loan expenses)-13.98(Profession)-8.57(Number of years since
last house move)+6.30(Code of regular saver)+3.96(Property) -9.07(Existing
credit info)-0.51(Number of years client) -5.76(Number of years since last loan)
+0.14(Number of mortgages)+0.15(Pawn)+1.14(Employment status)
+15.03(Title/salutation)]? -0.25
then Applicant = good
Default Class: Applicant = bad
Fig. 3. Oblique rules extracted by Neurolinear for Bene1
490 C. Mues et al.
by C4.5rules and C4.5. Neurolinear yields the best absolute test set performance
for all three data sets with a maximum of three oblique rules for the Bene1 data
set. For the German credit data set, Neurolinear performed signi?cantly better
than C4.5rules according to McNemar’s test at the 1% level. For the Bene1 and
Bene2 data sets, the performance of Neurolinear was not signi?cantly di?erent
from C4.5rules at the 5% level.
Fig. 3 depicts the oblique rules that were extracted by Neurolinear for the
Bene1 data set. Arguably, although the rules perfectly mimic the decision process
of the corresponding neural networks, their interpretability is still rather lim-
ited. They are basically mathematical expressions which represent piece-wise
linear discriminant functions. Hence, their usefulness for building intelligent,
user-friendly and comprehensible credit-scoring systems can be questioned.
3.3 Results for the Discretized Data Sets
After discretization using the method of Fayyad and Irani [2], 15 inputs remained
for the German credit data set, 21 inputs for the Bene1 data set and 29 inputs for
the Bene2 data set. When representing these inputs using the thermometer and
dummy encoding, we ended up with 45 binary inputs for the German credit data
set, 45 binary inputs for the Bene1 data set and 105 inputs for the Bene2 data
set. We then trained and pruned the neural networks for rule extraction using
Neurorule. Only 1 hidden unit was needed with a hyperbolic tangent transfer
function. All inputs are binary (e.g. the ?rst input is 1 if Term > 12 Months and
0 otherwise). Note that according to the pruning algorithm, no bias was needed
to the hidden neuron for the Bene1 data set. Of the 45 binary inputs, 37 were
pruned leaving only 8 binary inputs in the neural network. This corresponds
to 7 of the original inputs because the nominal input ‘Purpose’ has two corre-
sponding binary inputs in the pruned network (Purpose = cash provisioning and
Purpose = second hand car).
The network trained and pruned for Bene2 had 1 hidden neuron with again
no bias input. Starting from 105 binary inputs, the pruning procedure removed
97 of them and the remaining 8 corresponded to 7 of the original inputs. The
network for German credit had also only 1 hidden neuron but with a bias input.
The binarized German credit data set consists of 45 inputs of which 13 are
retained, corresponding to 6 of the original inputs.
Three things are worth mentioning here. First of all, note how the pruned
networks for all three data sets have only 1 hidden neuron. These networks
are thus only marginally di?erent from an ordinary logistic regression model.
This clearly con?rms that, also for the discretized data sets, simple classi?ca-
tion models seem to yield good performance for credit scoring. Furthermore,
since all networks have only 1 hidden neuron, the rule extraction process by
Neurorule can also be simpli?ed. If we would cluster the hidden unit activation
values by sorting them, we would ?nd two clusters corresponding to the two
output classes. Hence, instead of generating the rules relating the outputs to the
clustered hidden unit activation values and merging them with the rules express-
ing the clustered hidden unit activation values in terms of the inputs, we can
From Knowledge Discovery to Implementation: A BI Approach 491
Table 4. Neural network rule extraction results for the discretized data sets
Data set Method PCC
train
PCC
test
Complexity
German C4.5 80.63 71.56 38 leaves, 54 nodes
credit C4.5rules 81.38 74.25 17 propositional rules
Pruned NN 75.53 77.84 6 inputs
Neurorule 75.83 77.25 4 propositional rules
Bene1 C4.5 77.76 70.03 77 leaves, 114 nodes
C4.5rules 76.70 70.12 17 propositional rules
Pruned NN 73.05 71.85 7 inputs
Neurorule 73.05 71.85 6 propositional rules
Bene2 C4.5 82.80 73.09 438 leaves, 578 nodes
C4.5rules 77.76 73.51 27 propositional rules
Pruned NN 74.15 74.09 7 inputs
Neurorule 74.27 74.13 7 propositional rules
Table 5. Fidelity rates of Neurorule
Data set Fid
train
Fid
test
German 99.70 98.80
Bene1 100 100
Bene2 99.71 99.79
generate the rules relating the outputs to the inputs directly by using C4.5rules.
Finally, also notice how the binary representation allows to prune more inputs
than with the continuous data sets. This will of course facilitate the generation of
a compact set of rules. Table 4 presents the performance and complexity of C4.5,
C4.5rules, the pruned NN and Neurorule on the discretized credit scoring data
sets. It is important to remark here that the discretization process introduces
non-linear e?ects.
When comparing Table 4 with Table 2, it can be seen that the test set
performance in many instances actually augments and that the discretization
process did not appear to cause any loss of predictive power of the inputs. For
the German credit data set, Neurorule did not perform signi?cantly better than
C4.5rules at the 5% level according to McNemar’s test. However, Neurorule ex-
tracted only 4 propositional rules which is very compact when compared to the
17 propositional rules inferred by C4.5rules. The test set ?delity of Neurorule
is 98.80% (cf. Table 5). For the Bene1 data set, Neurorule performed signi?-
cantly better than C4.5rules at the 5% level. Besides the gain in performance,
Neurorule also uses only 6 propositional rules whereas C4.5rules uses 17 propo-
sitional rules. The rule set inferred by Neurorule obtained 100% test set ?delity
with respect to the pruned neural network from which it was derived. The high
?delity rate of Neurorule indicates that it was able to accurately approximate
the decision process of the trained and pruned neural network. For the Bene2
data set, the performance di?erence between Neurorule and C4.5rules is not sta-
tistically signi?cant at the 5% level using McNemar’s test. However, the rule set
492 C. Mues et al.
If Term >12 months and Purpose = cash provisioning
and Savings Account ? 12.40 e and Years Client ? 3
then Applicant = bad
If Term >12 months and Purpose = cash provisioning
and Owns Property = no and Savings Account ? 12.40 e
then Applicant = bad
If Purpose = cash provisioning and Income > 719 e
and Owns Property = no and Savings Account ? 12.40 e
and Years Client ? 3 then Applicant = bad
If Purpose = second-hand car and Income > 719 e
and Owns Property = no and Savings Account ? 12.40 e
and Years Client ? 3 then Applicant = bad
If Savings Account ? 12.40 e and Economical sector =
Sector C then Applicant = bad
Default class: Applicant = good
Fig. 4. Rules extracted by Neurorule for Bene1
extracted by Neurorule consists of only 7 propositional rules which again is a lot
more compact than the 27 propositional rules induced by C4.5rules. Note that
the rules extracted by Neurorule actually yielded a better classi?cation accuracy
than the original network resulting in a test set ?delity of 99.79%.
Fig. 4 represents the rules extracted by Neurorule for the Bene1 data set.
When looking at these rules, it becomes clear that, while the propositional rules
extracted by Neurorule are similarly powerful as the oblique rules extracted by
Neurolinear, they are far easier to interpret and understand.
However, while propositional rules are an intuitive and well-known formalism
to represent knowledge, they are not necessarily the most suitable representa-
tion in terms of structure and e?ciency of use in every day business practice and
decision-making. Research in knowledge representation suggests that graphical
representation formalisms can be more readily interpreted and consulted by hu-
mans than a set of symbolic propositional if-then rules [5]. Next, we will discuss
how the sets of rules extracted by Neurorule may be further transformed into
decision tables which facilitate the e?cient classi?cation of applicants by the
credit-risk manager.
4 Visualizing the Extracted Rule Sets Using Decision
Tables
A decision table (DT) consists of four quadrants, separated by double-lines, both
horizontally and vertically. The horizontal line divides the table into a condition
part (above) and an action part (below). The vertical line separates subjects
(left) from entries (right). The condition subjects are the problem criteria that
are relevant to the decision-making process. The action subjects describe the
From Knowledge Discovery to Implementation: A BI Approach 493
possible outcomes of the decision-making process. Each condition entry describes
a relevant subset of values (called a state) for a given condition subject (at-
tribute), or contains a dash symbol (‘–’) if its value is irrelevant within the
context of that column (‘don’t care’ entry). Subsequently, every action entry
holds a value assigned to the corresponding action subject (class). True, false
and unknown action values are typically abbreviated by ‘×’, ‘–’ and ‘ · ’, respec-
tively.
If each column contains only simple states, the table is called an expanded
DT (cf. Fig. 5a), whereas otherwise the table is called a contracted DT (cf.
Fig. 5b). For ease of legibility, we will allow only contractions that maintain a
lexicographical column ordering, i.e., in which the entries at lower rows alternate
before the entries above them. As a result of this ordering restriction, a decision
tree structure emerges in the condition entry part of the DT. Importantly, the
number of columns in the contracted table can often be further minimized by
changing the order of the condition rows (cf. Fig. 5c). It is obvious that a DT
with a minimal number of columns is to be preferred since it provides a more
e?cient representation of the underlying knowledge.
For each of the three rule sets extracted by Neurorule, we used the Pro-
loga software (http://www.econ.kuleuven.ac.be/prologa/) to construct an
equivalent DT. Each expanded DT was converted into a more compact DT, by
joining nominal attribute values that do not appear in any rule antecedent into
a common ‘other’ state, and then performing optimal table contraction. As a re-
sult of this reduction process, we ended up with three minimum-size contracted
DTs,consisting of 11, 14 and 26 columns for the German credit, Bene1 and Bene2
1. Owns property? yes no
2. Years client ? 3 >3 ? 3 >3
3. Savings amount low high low high low high low high
1. Applicant=good – × × × – × – ×
2. Applicant=bad × – – – × – × –
(a) Expanded DT
1. Owns property? yes no
2. Years client ? 3 >3 –
3. Savings amount low high – low high
1. Applicant=good – × × – ×
2. Applicant=bad × – – × –
(b) Contracted DT
1. Savings amount low high
2. Owns property? yes no –
3. Years client ? 3 >3 – –
1. Applicant=good – × – ×
2. Applicant=bad × – × –
(c) Minimum-size contracted DT
Fig. 5. Minimizing the number of columns of a lexicographically ordered DT
1. Savings Account ?12.40 e >12.40 e
2. Economical sector Sector C other –
3. Purpose – cash provisioning second-hand car other –
4. Term – ?12 months >12 months –
5. Years Client – ?3 >3 ?3 >3 ?3 >3
6. Owns Property – yes no – – yes no yes no – – –
7. Income – – ?719 e >719 e – – – – – ?719 e >719 e – – –
1. Applicant=good – × × – × – × – × × – × × ×
2. Applicant=bad × – – × – × – × – – × – – –
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Fig. 6. Decision table for the Bene1 rule set
494 C. Mues et al.
data sets, respectively. In all cases, the contracted tables were satisfactorily con-
cise and did not reveal any anomalies [8]. Fig. 6 depicts the resulting decision
table for the Bene1 data set. Clearly, the top-down readability of such a DT,
combined with its conciseness, makes it a very attractive visual representation.
5 Knowledge Implementation: From Decision Tables to
Decision-Support Systems
5.1 DT Consultation
DTs can be consulted not only visually, but also in an automated way. In Pro-
loga, a generic environment is o?ered that allows the user to actively apply
the decision-making knowledge to a given problem case. During a consultation
session, the provided engine will navigate through the DT system structure and
inquire the user about the condition states of every relevant condition in the
DTs thus visited. Only questions for which relevant condition entries remain are
being asked (unlike, e.g., in the rule description shown in Fig. 4, where conditions
would typically have to be evaluated on a rule-by-rule basis).
In addition to the built-in engine, a consultation web service has recently
been developed (http://prologaws.econ.kuleuven.ac.be). Web services may
be called by client applications written in various languages and distributed over
a network. This allows one to separately manage and update the decision table
knowledge (logically as well as physically), while enforcing its consistent use
throughout various types of application settings (cf. Fig. 7).
5.2 DT Export
To allow the developer to apply a di?erent tool for the actual implementation,
or to integrate the system into a given ICT setting, Prologa o?ers a series of
export options. A straightforward transformation of a DT into a decision tree,
and from there to Pascal, C, COBOL, Java, Ei?el or Visual Basic program code,
is provided. The code takes on the form of a nested if-else selection (or in the case
of COBOL, an EVALUATE-statement can be generated as well). It should be
noted that the generated code is not intended to be directly executable:variable
get first question
answer question/
get next question
client
application
consultation
web service
Fig. 7. A simple web service architecture for decision table consultation
From Knowledge Discovery to Implementation: A BI Approach 495
declarations must still be added, and the conditions and actions of the DT
should constitute valid expressions within the target language. The main idea
is to be able to (re-)generate the ‘hard’ part of the code, quickly and without
errors, after which the (updated) piece of code can be plugged (back) into the
implementation. In addition to the former standard conversions, one can also
generate optimized code for either of the aforementioned languages.
6 Conclusions
In this paper, a framework for the development of business intelligence systems
was described, which combines neural network rule extraction and decision table
techniques. Using a series of real-life credit scoring cases, it was shown how Neu-
rolinear and Neurorule, two neural network rule extraction algorithms, produce
both powerful and concise rule sets, with the rule set extracted by Neurorule
being more intuitive. Finally, it was shown how decision tables o?er a highly
intuitive visualization of the extracted knowledge and can then serve as a ba-
sis for an e?cient and maintainable system implementation, either by direct or
web-based consultation, or through the (partial) generation of program code.
References
1. B. Baesens, R. Setiono, C. Mues, and J. Vanthienen. Using neural network rule
extraction and decision tables for credit-risk evaluation. Management Science,
49(3):312–329, 2003.
2. U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued
attributes for classi?cation learning. In Proceedings of the Thirteenth International
Joint Conference on Arti?cial Intelligence (IJCAI), pages 1022–1029, Chamb´ery,
France, 1993. Morgan Kaufmann.
3. H. Liu and R. Setiono. Chi2: feature selection and discretization of numeric at-
tributes. In Proceedings of the Seventh IEEE International Conference on Tools
with Arti?cial Intelligence (ICTAI), pages 388–391, 1995.
4. J.R. Quinlan. C4.5 programs for machine learning. Morgan Kaufmann, 1993.
5. L. Santos-Gomez and M.J. Darnel. Empirical evaluation of decision tables for con-
structing and comprehending expert system rules. Knowledge Acquisition, 4:427–
444, 1992.
6. R. Setiono and H. Liu. Symbolic representation of neural networks. IEEE Computer,
29(3):71–77, 1996.
7. R. Setiono and H. Liu. Neurolinear: from neural networks to oblique decision rules.
Neurocomputing, 17(1):1–24, 1997.
8. J. Vanthienen, C. Mues, and A. Aerts. An illustration of veri?cation and validation
in the modelling phase of KBS development. Data and Knowledge Engineering,
27(3):337–352, 1998.

doc_649825174.pdf

From Knowledge Discovery to Implementation A Business Intelligence Approach Using Neural N

Attachments