Reviewed July 23, 1994
Presented
at INRIA (projet CLOREC),
Paris,
June 23, 1994
INDUCTION OF DECISION TREES WHEN
EXAMPLES ARE DESCRIBED WITH NOISY MEASUREMENTS AND WITH FUZZY CLASS MEMBERSHIP
Roberto Araya
Abstract: An extension of the induction of decision trees based on the
Kolmogorov Smirnov distance is proposed to handle noisy observations and fuzzy
class membership. This extension is based on a new generalization of the
Kolmogorov Smirov measure. It is shown that in the case of noisy measurements
the KS measure is a differentiable function of the classification region. Thus,
a gradient algorithm can be implemented in order to search for good discriminating
regions. This algorithm is extended to the case when the training examples
contain signals whose historic evolution has to be taken into account.
1. INTRODUCTION
Induction
of decision trees
Induction
of decision trees is one of the more popular attempts to do supervised machine
learning. The idea is to analyse data bases and distill knowledge from them.
The acquired knowledge is represented as decision trees. Research on this area
has been done by Breiman et Al. [4] who proposed the CART algorithm, Quinlan
[11] who proposed the ID3 algoritm, and Friedman [8] and Celeux and Lechevalier
[5] with the tree segmentation algorithm. Several other authors have proposed
generalizations and improvements of these original algorithms. In this work we deal
with a generalization to the segmentation tree algorithm as originally proposed
by Friedman.
The
discriminating power of a variable
The
induction algorithm based on the segmentation technique attempts at each stage
to measure the discriminating power of each variable to discriminate one class
from the other. Friedman proposes the Kolmogorov Smironov (KS) distance as a
measure of discriminating power.
The
Kolmogorov Smirnov measure is the maximum distance between two distribution
functions. This measure was first used by Friedman [8] as a discriminating
measure in an algorithm for inducing decision trees from data. For each
variable the empirical distribution functions for each one of the two classes
are built. If for a given variable, the maximum distance between the two
distributions is equals to 1, then that variable segments perfectly the two
classes. If the maximum is zero, then that variable is completely useless to
segment the classes. The algorithm was further studied by Celeux and Lechevallier
[5], and implemented computationally by Lechevallier in a software called SICLA
. Several extensions of this algorithm were made by the author [1] (see also
[13]) to include categorical variables, change of variables and to incorporate
different types of background knowledge. These extensions were implemented in
an interactive software called AutoExpert .
In
this paper two further extensions are proposed: the case when examples are
described through noisy measurements and the case when class membership is
fuzzy.
A
priori knowledge: measurement noise
A
priori or background knowledge offers the opportunity to improve the induction
process. One type of knowledge that is usually encountered is the probability
distribution of the measurements. Let us denote x the vector of decision
variables and N the measurement noise. Then the observations encountered in the
data base will be y = x + N. In a more general case we won't have additive
noise. However, we assume that the measurement noise effect on the decision
factors is specified in a precise way.
Supervised
learning: Fuzzy class membership
Supervised
learning assumes that all the examples in the data base can be exactly
determined to which class ( or classes ) they belong to. This assumption is
very strong. In several real world problems there are examples where it is only
possible to specify a degree of membership to the classes.
The
gradient algorithm
The
proposed extension of the Kolmogorov Smirnov measure is shown to be a
differentiable function of the discriminating region. Furthermore, the gradient
is easy to compute. It proposes a way to adjust decision regions that is very
easy to interpret. This fact leads to the implementation of gradient algorithms
to fine tune discriminating regions and, therefore, to improve the performance
of the induced trees.
Organization
of the paper
Section
two reviews the definition of the Kolmogorov Smirnov measure for regions as
described in [1]. Section 3 extend the KS measure when noisy measurements are
present. Section 4 extend the KS measure when class membership is fuzzy.
Section 5 consider the case when both noisy measurements and fuzzy class
membership are present at the same time. Section 6 studies the data base
segmentation process when noisy observations are present. Section 7 proposes strategies to find good
discriminating regions. They are based in change of variable schemes, either
for numeric or categorical variables. Section 8 shows with an example a KS
based induction of a decision tree. Section 9 shows the differentiability of
the function that assign to any region R in
its KS measure. Then the particular case of a
parametrized regions is studied and the gradient computed. The case
corresponding to regions defined by hyperplanes is also studied and the
gradient computed. This latter case is very similar to the perceptron model of
neural networks. The proposed gradient produces then an alternative algorithm
to modifiy the defining hyperplanes. Finally, section 10 is dedicated to the
case when the training examples are signals.
2. DECISION REGIONS AND ITS
DISCRIMINATING POWER
Let
C1 and C2 be two classes not necessarily disjoints, and let
( N being the set of natural numbers), be
a set of training examples in the data base. Each example j in T has associated a two dimensional vector
that denotes its
membership to classes C1 and C2. For example
means that the 5 th
training example of the data base belongs to class C1.
means that the 8 th
training example belongs to classes C1 and C2 at the same time.
All
examples are specified with descriptions corresponding to a list of decision
factors x1,...,xn. Each example j has an associated description x(j)=(x1(j),...,xn(j)).
For example, if x1 is color, then x1(3)=red means that the third example has
red color. Let X be the cartesian product of the range spaces of the decision
factors xk. Let R be a region in the description space X. Then the KS discriminating
power of the region R to segment between classes C1 and C2, given the set T of
examples, is given by the absolute value of the difference between two ratios.
These ratios are the proportion of examples in each class belonging to the
region R. This can be written as:
![]()
That is:
![]()
Example 1: Consider the following data base
color * class1 class2
blue * 1
0
green * 0
1
blue * 1
0
yellow *
0 1
green *
0 1
blue * 1 1
green *
0 1
This data base has to be read as follows: in the first
example, color is blue, and the example belongs to class1. In the second
example, color is green, and it belongs to class2, etc.
Here T is the set of training examples, that is T={1,2,3,4,5,6,7}, x1 is color, and X
={blue,green,yellow}. Let R be the
decision region R={blue}. Then the proportion of examples in class1 that belong
to the region R is 3/3=1. The proportion of examples in class2 that belong to
region R is 1/5. Therefore the KS measure is |1-1/5 | =4/5 = 0.8
In
mathematical terms, the KS measure of the region R to segment class C1 from
class C2, given the examples T, is given by:
![]()
where
c1 = is the set of points of R2 such that its first
component is 1 , and
c2 = is the set of points of R2
such that its second component is 1
and
is the
characteristic function (
).
In the previous example this would be:
KS(R/T)= | 3/ card( {1,3,6} ) - 1/ card( {2,4,5,6,7} ) |
= | 3/3 - 1/5 |
= 0.8
3. EXTENSION OF THE KS MEASURE FOR
NOISY MEASUREMENTS
Let
C1 and C2 be two classes not necessarily disjoint, and let T be a set of
training examples in the data base. Let R be a region. Then the discriminating
power of the region R to segment between classes C1 and C2 given the set T of
examples is given by the absolute value of the difference of two ratios. These
ratios are the proportions of examples in each class that "belong" to
the region. However, since we have noisy observations, we cannot count the
number of examples in the region. We can only compute for each example its
probability of belonging to the region. More precisely:
![]()
where
P is the probability distribution of the measurements in the description space
X.
Example 2: Consider the previous data base with the training
examples T and the same decision region R, but now assume that the
"color" variable has measurement noise. Therefore a probability
distribution is given for its measurement. In this example it is given the
following interpretation for the "color" column:
"color=blue" means
p(color=blue)=0.8 , p(color=green)=0.1,
p(color=yellow)=0.1
"color=green" means
p(color=blue)=0.2 , p(color=green)=0.6, p(color=yellow)=0.2
"color=yellow" means
p(color=blue)=0
, p(color=green)=0.1, p(color=yellow)=0.9
Then the data base can be described in the following
form:
|
Color: Blue |
Color: Green |
Color: Yellow |
* |
Class 1 |
Class 2 |
|
0.8 |
0.1 |
0.1 |
* |
1 |
0 |
|
0.2 |
0.6 |
0.2 |
* |
0 |
1 |
|
0.8 |
0.1 |
0.1 |
* |
1 |
0 |
|
0 |
0.1 |
0.9 |
* |
0 |
1 |
|
0.2 |
0.6 |
0.2 |
* |
0 |
1 |
|
0.8 |
0.1 |
0.1 |
* |
1 |
1 |
|
0.2 |
0.6 |
0.2 |
* |
0 |
1 |
Let the decision region be R={blue}, then

= | ( 0.8 + 0.8 + 0.8 )/3 - (
0.2 + 0 + 0.2 + 0.8 + 0.2 )/5 |
= 0.52
Comparing
example 1 and example 2, we can see the degradation of the discriminating power
of the variable color due to the measurement noise.
Missing
information
Assume
that for a given example we do not have information about the value of the
atribute x. We can consider that this is an example with measurement noise
whose distribution is equal to the empirical distribution obtained from the
rest of the examples that belong to the same class, and whose value for x is
known.
4. EXTENSION OF THE KOLMOGOROV
SMIRNOV MEASURE FOR DECISION REGIONS WHEN CLASS MEMBERSHIP IS FUZZY
Assume
now that we have a data base where class membership is fuzzy. This means that
there is a degree of class membership. Consider for example the following data
base:
Example 3: consider now
the following data base:
Color * Class1
Class2
blue
* 0.9 0.1
green * 0.1 0.9
blue
* 0.9 0.1
yellow * 0 1
green * 0.2 0.8
blue
* 1 1
green * 0.2 1
Note that in this data base the membership numbers are
not necessarily 0 and 1 as before. The first example where color is blue,
belongs to class1 with a memberhip degree of 0.9 and belongs to class2 with a
membership degree of 0.1.
The
membership degrees are numbers between 0 and 1 inclusive. Note that for each
example the sum of its membership degrees need not to be 1, since the classes
need not to be exclusives.
In
this section we assume that there is no meaurement noise. In this case the
extended Kolmogorov Smirnov measure for a region R given the examples T is
given by the absolute value of the difference of two ratios. These ratios are
the proportion of examples "of" each class that belong to the region.
Since class membership is fuzzy we cannot count the number of examples in each
class. We can only add the membership degree of the examples to a given class.
Also each example in the region have to be weighted with the membership degree
to the given class to obtain the total amount of examples in that class that
belong to the region. More precisely:
![]()
where
,
for each class Ci,
and
Note
that:
for i=1,2 and all j , and
for all j
Example 4: If T = {
1,2,3,4,5,6,7} in the data base of example 3 shown above, and R = {blue}, then


= | ( 0.9 + 0.9 +1 )/3.3 - (0.1 + 0.1 + 1)/4.9 |
= 0.6
5. EXTENSION OF THE KOLMOGOROV
SMIRNOV MEASURE FOR NOISY MEASUREMENTS AND FUZZY CLASS MEMBERSHIP AT THE SAME
TIME
Let
T be a set of examples, R a decision region, P a measurement probabilty
distribution, and
a class membership measure, then
![]()
or equivalently,
![]()
Example 5: Consider the data base 2 defined in example 3, but
with noisy measurements as defined in example 1. Then the data base is:
|
Color: Blue |
Color: Green |
Color: Yellow |
* |
Class 1 |
Class 2 |
|
0.8 |
0.1 |
0.1 |
* |
0.9 |
0.1 |
|
0.2 |
0.6 |
0.2 |
* |
0.1 |
0.9 |
|
0.8 |
0.1 |
0.1 |
* |
0.9 |
0.1 |
|
0 |
0.1 |
0.9 |
* |
0 |
1 |
|
0.2 |
0.6 |
0.2 |
* |
0.2 |
0.8 |
|
0.8 |
0.1 |
0.1 |
* |
1 |
1 |
|
0.2 |
0.6 |
0.2 |
* |
0.2 |
1 |
and therefore the KS distance is:

= | (
0.9x0.8 + 0.1x0.2 + 0.9x0.8 + 0x0 + 0.2x0.2 + 1x0.8 + 0.2x0.2 )/3.3 -
(
0.1x0.8 + 0.9x0.2 + 0.1x0.8 + 1x0 + 0.8x0.2 + 1x0.8 + 1x0.2 )/4.9 |
=
0.403
Properties:
For all regions R and non empty training set
T, we have:
i)
, where
is the
complement of
.
ii)
iii)
implies that it does not exist an example
in T that belongs to two classes at the same time with a non zero degree of
membership, and that all examples of one class belong to R and all examples of
the other class do not belong to R.
iv) KS is a true extension
v) KS does take into account very low
frequent classes. For example if there are 1000 examples of class 2 and only 1
example of class 1, and region R contains the example of class 1 and 5 examples
of class 2, then KS(R) = 0.995 .

This
means that KS identifies R as a good
region to predict class 1 even though there are more examples belonging to
class 2 in R than to class 1.
Consider
now the cases where there are N classes. Then we try all combinations in two
super classes and for each one we compute the KS measure. We choose the
combination that gives the maximum KS. This combination defines the KS.
This
means that:
![]()
![]()
6. SEGMENTATION OF A DATA BASE
WITH NOISY OBSERVATIONS
Let's
consider a data base with noisy observations, and assume that a region R0
is being chosen for segmentation. This is a region in the description space
that segment the data base in two: the set of examples belonging to R0 , i.e. the examples that satisfy the
defining condition of R0 ,
and the set of examples that belong to R0' the complement of R0
, i.e. the examples that do not satisfy the defining condition of R0
. If observations were not noisy, the data base T is segmented in two disjoint
subsets:
and
.
In
the case of noisy observations, the data base T is not segmented in two
disjoint subset. All we have is a set of probabilties
that
each example j of the data base belongs to the region R0 , and a
corresponding set of probabilities
that
each example j belongs to the complement R0' of the region R0
.
Therefore
after segmenting by the region R0 , each example in R0
count less. Its membeship degree is then disminished proportional to the
probabilty that the example belongs to R0. We then define the
membership degree of the example j, given that we are segmenting on the region
R0 as:
![]()
and consequently
![]()
Then
the KS discriminating measure of a region R given the data base T and that the
examples are in the region R0 is defined as:
![]()
There
are two important cases that we will use later on. One case is when
.Then
![]()
and therefore
![]()
The
other case is when satisfying the defining condition of
does not imply anything about belonging to R.
In this case
![]()
and therefore
![]()
In
other cases a priori knowledge can be introduced to compute the conditional
probabilities
.
Example 6: Consider the data base
|
Color: Blue |
Color: Green |
Color: Yellow |
Model a |
Model b |
* |
Class 1 |
Class 2 |
|
0.8 |
0.1 |
0.1 |
0.6 |
0.4 |
* |
0.9 |
0.1 |
|
0.2 |
0.6 |
0.2 |
0 |
1 |
* |
0.1 |
0.9 |
|
0.8 |
0.1 |
0.1 |
0.7 |
0.3 |
* |
0.9 |
0.1 |
|
0 |
0.1 |
0.9 |
0.2 |
0.8 |
* |
0 |
1 |
|
0.2 |
0.6 |
0.2 |
0.5 |
0.5 |
* |
0.2 |
0.8 |
|
0.8 |
0.1 |
0.1 |
0.4 |
0.6 |
* |
1 |
1 |
|
0.2 |
0.6 |
0.2 |
0.4 |
0.6 |
* |
0.2 |
1 |
If we segment using color and R={Green, Yellow} then
the segmented data base will be:
|
|
Color: Green |
Color: Yellow |
Model a |
Model b |
* |
C1 |
|
C2 |
|
|
0.2 |
0.1 |
0.1 |
0.6 |
0.4 |
* |
0.9 |
0.18 |
0.1 |
0.02 |
|
0.8 |
0.6 |
0.2 |
0 |
1 |
* |
0.1 |
0.08 |
0.9 |
0.72 |
|
0.2 |
0.1 |
0.1 |
0.7 |
0.3 |
* |
0.9 |
0.18 |
0.1 |
0.02 |
|
1 |
0.1 |
0.9 |
0.2 |
0.8 |
* |
0 |
0 |
1 |
1 |
|
0.8 |
0.6 |
0.2 |
0.5 |
0.5 |
* |
0.2 |
0.16 |
0.8 |
0.64 |
|
0.2 |
0.1 |
0.1 |
0.4 |
0.6 |
* |
1 |
0.20 |
1 |
0.20 |
|
0.8 |
0.6 |
0.2 |
0.4 |
0.6 |
* |
0.2 |
0.16 |
1 |
0.80 |
Therefore
![]()
![]()
This means that there are 0.96 examples in C1 and 3.4
examples in C2, that have color green or yellow.
Let R={color = green} and let's compute its KS measure
given that the examples are green or yellow. This means that R0 = {
color = green or yellow }. In this case
. The number
of examples in class 1 that are green is
0.1x0.9 + 0.6x0.1 + 0.1x0.9 + 0.1x0 + 0.6x0.2 + 0.1x1
+ 0.6x0.2 = 0.58 . Therefore the
proportion of examples in class 1 that are green is 0.56/0.96 = 0.604 . Similarly, the number of examples in class
2 that are yellow is 1.84 , and therefore the proportion of examples of class 2
that are green is 0.541/3.4 = 0.541 .
Then the distribution histograms is the following:

and therefore the KS measure is KS(R/T,R0
)=0.604-0.541 = 0.063
Let now R be the set R={model = a }. In this case R
and R0 are competely independent. The number of examples in class 1
that have model = a , is then
0.6x0.18 + 0x0.08 + 0.7x0.018 + 0.2x0 + 0.5x0.16 +
0.4x0.20 + 0.4x0.16 =0.458 , and therefore the proportion of examples in class 1
that have model a given that have color
green or yellow is 0.458/0.96 = 0.477 . Similarly the number of examples in
class 2 that have model a given that have color green or yellow is 0.946, and
the corresponding proportion is 0.946/3.4 = 0.278 . Then the distribution
histogram is:

and therefore tthe KS measure is KS(R/T,R0
) = 0.477 - 0.278 = 0.199 .
7. HOW TO FIND A GOOD
DISCRIMINATING REGION
Finding
good decision regions is a basic step for the tree induction process. The
original algorithm as proposed by Friedman, examined each variable and
determined for each variable a threshold. Therefore, if there were N variables
( x1, ..., xN), then N decision regions were examined Ri= { x / xi <= ci }, i=1,...,N,
where ci is the threshold computed as the point where the maximum distance
between the distribution function was reached.
Better
decision trees can be obtained if other decision regions are also examined
besides the ones mentioned above. These regions can be proposed as a-priori
knowledge or it can be regions proposed by an algorithm.
In
the following we propose a method to search for regions that is decomposed in
two steps:
Step 1:
Create new one dimensional atributes ( variables).
Step 2: Look for
the best threshold or set of thresholds for each of the previously created
atributes.
Step1
and Step2 lead to new regions that mix the original atributes. These regions
can compete favorably with the original ones that are based in one atribute at
a time.
Step 1
Numerical
atributes
If
in the data base we have numerical variables, new atributes can be created as
linear or non-linear combinations of the original ones. Linear combinations
corresponds to new axis like the axis z as indicated in the following figure:

A
simple new variable to try is the one obtained from the projection to the
vector computed as the difference
of the center of masses of the two set of
points ( those belonging to class1 and those belonging to class2 ). Then the
new variable is
![]()

If
the covariances matrices
of the points belonging to class1 and class2
respectively, are also computed, then the previous new axis can be adjusted to
incorporate knowledge from the dispersion of the two classes. Two new variables
that can be tried are:
and
.
These new variables are the one used in discriminant analysis. If both
populations ( class1 and class2) are gaussian with the same covariance ( that
is
), then the new defined variable is optimal.
If
one of the covariance matrices is singular, then all examples belonging to the
corresponding class are concentrated in an affine subspace. Then directions
orthogonal to that subspace can be very discriminant. This can be true if both
populations are not included in the same afin subspace. For example if the
center of mass difference vector
, does not lie on the subspace. Let
be an eigenvector of say
with null eigenvalue. That is:
![]()
Then if
![]()
a
new variable to try would be
![]()
If
we plot the distribution functions of both classes using this new atributre z,
we will see that class1 will be concentrated in one point. If the center of
masses difference vector
lies in the same subspace, then we restrict
the data to that subspace. Then the restriction of the covariance to that
subspace will be non-singular and we can use the variable
![]()
where
is the projection operator over the null
space of
.
Non
linear combinations can also be tried. We can define a parametric region R
through a non linear function f(,
. ) from RN to
R, in the following way:
![]()
Categorical
atributes
One
of the more interesting challenges is to search for interesting new variables (
or axis ) for categorical variables. A simple idea is to define a new axis
jointing two categorical variables as shown in the previuos section: if we have
two categorical variables x1 and x2 with values a11,...,a1n and a21,...,a2p
respectively, then a new z variable can be defined as:
z
= 1 means x1=a11 and x2=a21 ,
z
= 2 means x1=a11 and x2=a22 ,.
...
z
= np means x1=a1n and x2=a2p .
The
combinations that do not have at least one example in the data base T can be
excluded.
Example: Consider the following data base
|
Color |
Model |
* |
Class 1 |
Class 2 |
|
green |
a |
* |
1 |
0 |
|
blue |
a |
* |
0 |
1 |
|
blue |
a |
* |
0 |
1 |
|
blue |
b |
* |
1 |
0 |
|
blue |
c |
* |
1 |
0 |
|
blue |
c |
* |
0 |
1 |
|
blue |
d |
* |
1 |
0 |
|
blue |
d |
* |
1 |
0 |
|
blue |
d |
* |
1 |
0 |
|
red |
a |
* |
0 |
1 |
|
red |
b |
* |
0 |
1 |
|
red |
b |
* |
0 |
1 |
|
red |
c |
* |
1 |
0 |
|
red |
c |
* |
1 |
0 |
|
red |
d |
* |
1 |
0 |
|
red |
d |
* |
1 |
0 |
This
data base can be plotted in the following form:

A
new axis z would be the following:

Different
ordering strategies lead to different variables. However, all these axis lead
to the same segmentation region. This is due to the fact that looking at all
the maxium and minimum as explained below and forming the region R accordingly,
is equivalent to ordering forming the region R for class 1 by selecting the
values where the frequency of class 1 is higher than the frequency of class 2.
Step 2
Once
defined the set of variables, step2 has to be applied to each one of them. This
means that a threshold or set of thresholds is searched. We do that by looking
at all the local regional maxima and local regional minima of the difference of
the distribution functions corresponding to each of the two classes. A local
regional maximum ( minimum ) is computed as the center of the biggest interval
containing all the local maximums ( minimums ) that contain a given local
maximum ( minimum ). Furthermore , outside of the interval the function must be
strictly bigger (smaller).
For
example in the following figure the region R is obtained as the union of two
subregions. The first one is the interval between the first maximum and the
first minimum. In this interval the distribution function of class 2 ( the one
whose distribution function is located below the other one ), increases more
rapidly than the distribution function of class 1. The other subregion is the
semi interval that starts at the second maximum. In this semi inteval class 2
also increases more rapidly than class 1.

Similarly,
in the following figure the region R is obtained as a union of two intervals.
In each one class 2 increases more rapidly than class 1. In the complement
region class 1 increases more rapidly.

This
way from the distribution functions we obtain the region R. This region
considers all maxima and minima. This method is an extension of the standard
one that is based only on the global maximum, and allow to have a completely
similar treatment of numeric and categorical variables. This means that if a
numeric variable is discretized in a finite number of intervals, and all
ordering strategies are tried to find a good region then the solution is
similar to using all maximum and minimum. And this strategy is similar to the
used directly on the numeric variable. If only the global maximum were used
then differences are produced between the numeric and discretized aproach.
Example 8: In example 7 the region R obtained from the
distribution functions is:

Then R is the union of three subregions: R = {3,4,5} U
{7} U {9} . And this is equivalent to:
R = { model = d } U { model = c and color = blue } U {
model = a and color = green }.
KS(R/T) = 0.5 + 0.2 - (0.3 - 1/6) + ( 0.4 - 1/6) + 0.1
=
0.9
8. INDUCTION OF DECISION TREES
We
have all the ingredients to do the segmentation process. At each stage
different regions have to be proposed and its segmentating powers have to be
measured. A good region is selected and then utilized to segment the data base.
In each subregion the same selection and segmentation process is performed
again.
We
will show this process with an example. The data base has two categorical
variables. We will use both at the same time to find the best segmentation
region.
Example 9: consider the data base described in example 6.Let's
define the new categorical variable z containing six values {1,2,3,4,5,6} as
defined in the following figure:

That is,
z=1 means color=yellow and model=a
z=2 means color=green and model=a
z=3 means color=blue and model=a
z=4 means color=blue and model=b
z=5 means color=green and model=b
z=3 means color=yellow and model=b
The number of examples in class 1 with z=1 is:
0.1x0.6x0.9 + 0.2x0x0.1 + 0.1x0.7x0.9 + 0.9x0.2x0 +
0.2x0.5x0.2 + 0.1x0.4x1 + 0.2x0.4x0.2 = 0.193 . Since there are 3.3 examples in
class 1, the frequency is 0.193/3.3 = 0.058 . Similarly the frequency of class
2 for z=1 is 0.080. Then the following frequency histogram is obtained:

And the following distribution function is obtained:

Therefore, R = {1,2} U {5,6} . This means that R = {
color = green or yellow } and its KS is given by KS(R/T)=0.353 + 0.050 = 0.403
.
The
induction of a decision tree is a sequential process. At each stage a
segmentation of the data base is performed. However due to measurement noise
fractional parts of the examples can remain in a segmented data base. We can
compute at any leaf the probability of each class. Let R be the region defined
by the leaf. Then the probabilty that a case k belong to class i given that it
belongs to the region R is:

Example 10: consider a leaf defined by the region R as shown in
the figure below:

The total number of examples is 8. Note that there are
two examples that have a 0.5 degree of membership to class1 and class2. There
is another example that belongs to both class1 and class2 at the same time.
Note also that there is one example that belongs completely to class1 but only
half of it is in the region R.
Then
P(class1/R) = 4.5/7.5
P(class2/R)=4/7.5
Example 11: Consider again the induced tree of example 9. Since
and
we have
P(class 1/color=green or yelow)= ( 0.193 + 0.265 +
0.315 + 0.187 )/4 = 0.24
P(class 2 / color= green or yellow )= ( 0.393 + 0.553
+ 1.287 + 1.167 )/4 = 0.85
P( class 1 / color=blue) =( 1.292 + 1.064 )/3 = 0.785
P(class 2 / color = blue ) = ( 0.584 + 0.916 )/3 = 0.5
More
generally, we can compute the probability that the degree of membership to a
class i be bigger than a certain threshold in a region R as:

where
is the characteritic
function of the interval [c,1] .
Stopping
criteria
Without
observation noise and with mutually exclusive classes the segmentation process
can continue up to obtaining leafs that contain examples belonging to only one
class. Then, prunning techniques are used to produce trees capable of
generalization to independent but similar data bases. If observation noise is
included, the knowledge about the noise can be used to stop the tree generation
process at a more apropiate stage, avoiding noise fitting and reducing later
prunning. One stopping criteria is to define a theshold for the KS measure.
That is, stop segmenting a particular leaf when KS(R/T) < c for all
partitions R of the subset of examples belonging to the leaf.
9. KS DIFFERENTIABILITY AS FUNCTION OF THE REGION
We
will consider in this section regions that are defined with numerical
atributes. We will analyse how small changes in the region shape will produce
small changes in the KS measure when all atributes have measurement noise.
Let
R be a region in Rn and
let
be its
boundary . Let's assume that the probability measures in Rn induced
by the measurement noise are absolutely continue with respect to the Lebesgue
measure of Rn , and let fj be the Radon Nikodym derivative
of the induced probability measure by the j th training example with respect to
the Lebesgue measure. Let's perturbe the region R to a new region R', defining
a perturbation function
defined on the boundary
as:
![]()
where
s is in the boundary
and s' is its corresponding point in the
perturbed boundary
through the normal n(s) to
in s.

Then the function
![]()
is
Gateaux differentiable. Its derivative is defined as:

where
is the region R perturbed by
in each point s of its boundary
.
It is known that the following result holds:
![]()
Then
the function
is
differentiable for all R such that
, and

where

We
can use the above computed gradient to implement the gradient algorithm to
iteratively improve the region R. This means to perturb the boundary
with the function:

We
can have a very simple interpretation of this perturbation.
Let's
assume that the region R is such that K(R) > 0. This means that we can think
of R as a region where examples of class C1 are concentrated. Note
that this fact could happen even when there are more examples belonging to C2
in R than examples belonging to C1 .In this particular situation,
the complement of R has most of his examples belonging to class C2 ,
and even much less examples belonging to class C1 . To see this,
consider the term

This
term is the proportion of examples in Ci that belong to R. Then if K(R) > 0 , it implies that the
proportion of examples of class 1 in the region R is higher than the proportion
of examples of class2 in region R. That is, examples of class 1 are
concentrated on R even though there could be very little amount of them in R
compared to examples of class 2. The use of this relative ratio instead of
using directly the number of examples of each class, makes possible to handle
classes with very low frequency, and therefore allows to detect particular
conditions ( expressed througout R ) where they relative occurence is higher.
Therefore we distinguish this region as a singular one for class 1 . Then we
are interested in perturbing the region R in such a way to include examples in
the neighborhood or R that belong to class 1 and exclude examples of class 2.
Now,
if for the j th training example we have
>
0
then
we have that the j th training example belongs "more" to class C1
than to class C2 . Therefore, if at the point s in the
boundary the j th training example is
close enough to s ( that is, fj(s) > 0 ), then we should move the
boundary outward at s to completely include the j th training example in the
new region R. By the contrary, if
<
0
we
should move the boundary inward at s in order to exclude the j th training
example from the region R.
However,
the point s is influenced by several training examples. We have to weight the
influence of the different training examples that are close to s, in order to
decide in which direction pays more to move the boundary at s.
The
following figure shows the region R and the new modified R' as proposed by the
gradient:

The
gradient algorithm can be used to improve an allready existing region or to
create a new one from the start.
In
the first case an initial region can be created with the tree segmentation
algorithm, which can include linear combinations of the numerical variables and
nonlinear combinations defined as a priori knowledge. Then the gradient
algorithm is used to perform small tunnings of the classification region.
In
the other case, the gradient algorithm can be used assuming first a big noise
covariance in order to find disciminating regions without getting stack in a
local minimum, and then as the noise covariance is reduced a sequence of
discriminating regions is computed.
Parametrized
regions
Consider
now the case when the region R depends on a parameter
. Let
be a region in Rn that depends on the
parameter
in Rp, and let
be the
boundary of
. Let's assume as
before that the probability measures in Rn
induced by the measurement noise are absolutely continue with respect to the
Lebesgue measure of Rn ,
and let fj be the Radon Nikodym derivative of the induced
probability measure by the j th training example with respect to the Lebesgue
measure. For every
in Rp and any perturbation
, we define
as
![]()
where
s is in the boundary
and s' is
its corresponding point in the perturbed boundary
through
the normal n(s) to
in s.

Then the function

is
Gateaux differentiable, and its derivative is:

Then
is differentiable
for all
such that
, and

where

Thus,
the gradient direction is:

As
before, we can have a very simple interpretation of the proposed gradient
direction. Let's assume that the region
is such that
>
0. This means that examples of class C1 are concentrated in
.
Then we are interested in perturbing the region
in such a way to include examples in the
neighborhood of
that belong to class 1 and exclude examples
that belong to class 2. Now, if for the j th training example we have
>
0
then
we have that the j th training example belongs "more" to class C1
than to class C2 . Therefore, if at the point s in the
boundary the j th training example is
close enough to s ( that is, fj(s) > 0 ), then we should move the
boundary outward at s as much as possible. The direction
is exactly the one that moves
in such a way that increases
the most.
However, we have to weight this direction with the ones suggested by
other points s in the boundary where the j th example also influences.
Therefore, the weighted direction is:

Thus
the direction to move
, as suggested by example j, is influenced
principally by the directions obtained by the points s in the boundary with
higher density. If the example j is far away from the boundary, then
is 0.
The
above term takes into account the closeness of the example to the boundary.
However we have to consider also a second factor: the opportunity that
represent the example to improve the KS distance. This gain potential is bigger
when more clearly an example belong to one of the two classes. This factor is
measured for the j th example by the following term:
This
term varies between -1 to 1. When this term is close to the extremes -1 or 1,
it means that more clearly the j th example belongs to C2 or C1
respectively. Therefore, the gain in KS is bigger when the example is excluded
( respectively included) from the discriminaiting region than the gain obtained
from an example that belongs as much to one class than to the other class.
Thus, this example has to have a bigger influence to compute the final
direction
than another example whose respective term is
closer to 0. However this term has to
be multiplied by
to indicate an outward or inward move.
Therefore
has to be moved in a weighted direction that
takes into account the influence of the two described factors for each example.
This weighted direction is:

That is,

Regions
parametrized by a set of hyperplanes
We
will consider now the special case when the region R is specified by a set of
hyperplanes. These hyperplanes define the boundary of a convex clasification
region R. This is analogous to the neural network aproach where the number of
hyperplanes has to be specified beforehand ( neurons in the first hidden
layer). We will write the gradient algorithm. The algorithm specifies how to
make small changes in the hyperplanes in order to arrive after several
iterations to a good classification region R.
Let
be defined as:
![]()
In
this case
![]()
then
the normal at the point s in the boundary defined by the i th hyperplane is
![]()
as
shown in the figure
If
we perturb the boundary at s by moving i
to i
+hi , then the point s' is
![]()
and
can be computed easily using
![]()
then

and therefore

and then

Therefore
the gradient direction is:

where
![]()
This
gradient direction has a very simple interprettion. Let´s assume that s is a
point in the boundary that corresponds to an example thatbelongs more to class
1, and that the region corresponds to class 2. Then the yperplane should be
moved in such a way as to completely exclude that example in the new region.
This direction is proportional to
.

10. TRAINING WITH SIGNALS
In
several applications monitoring of time varying features can be used for
classification. Therefore the training data bases contain signals. For example
temperature or acceleration measurements over a period of time. Here we are
interested on the time evolution of these observations, since we suspect that
the evolution pattern can make a significant difference to discriminate the
classes. In this cases the j th training example has descriptors that are functions.
For example, the k descriptor is not a number but a function: xk(j) L2([0,t1],R). The following figure shows a
typical situation with examples that are signals.

We
propose two aproaches to induce a decision tree. The first is trying to find a
linear functional that discriminates both classes. The other one is designing a
filter that help to discriminate both classes.
Linear
functional
One
way to handle the problem is to find a set ot instants tk and
thresholds ck that define a decision region of the following type:
![]()
where
![]()
or
using also the time derivatives
![]()
where
![]()
In
our experience this last type of region is used frequently by human experts.
However the determination of the best times tk and the alarm
thresholds ck and dk are usually a difficult problem.
Another
aproach is to use the average signals:
![]()
and
define a new variable z as the projection over the difference of averages
.
That is
![]()
and
therefore the searched region will be
![]()
where
.
If
we assume that the signal y(t) is a noisy observation of a deterministic signal
x(t), i.e.,
y(t) = x(t) + n(t)
where
n(t) is white noise with covariance
, then
is differentiable. Therefore KS is a
differentiable function of the parameter
, and a gradient algorithm can be used.
Preprocessing
through a filter
Tipically
signals are filtered before being used for training. ARMA filters to filter
noise or extract relevant features are used. Here we are interested in
designing a "good" filter whose designing goal is to discriminate
between two classes. This aproach is different to the usual one where the prefiltering
is done without any direct consideration to the learning goal.
Let
= (a0,...,ap,b0,...,bq,c)
, be the parameters of an ARMA filter with transfer function
![]()

That is
![]()
where
is the filter impulse response function, and
therefore
is its Laplace transform. We have to design
the filter ( find
) such that the
signals from class1 and class2 are as different as possible.

Let's define
![]()
Other
similar regions can also be used such as:
![]()
or

If
the observations y(t) are noisy and come from a deterministic signal x(t)
corrupted by additive white noise:
y(t) = x(t) + n(t)
where
n(t) is white noise with variance
, then

where
![]()
and
![]()
therefore
the KS distance is

which
is a differentiable function of
.
ACKNOWLEDGEMENT
I
want to thank to Yves Lechevallier from INRIA and Raúl Gormaz for the fruitfull
discussions about the concept proposed in this paper. However, eventual
mistakes and missconceptions are my responsability. I also appreciate the
support from the Ministere d'Affaire Etrangeres of the french goverment that
makes possible my visits to work with INRIA researchers.
REFERENCES