Reviewed July 23,  1994

Presented at INRIA (projet CLOREC),

Paris, June 23, 1994

 

INDUCTION OF DECISION TREES WHEN EXAMPLES ARE DESCRIBED WITH NOISY MEASUREMENTS AND WITH FUZZY CLASS MEMBERSHIP

 

[1]

 Roberto Araya

AutoMind

www.automind.cl

 

 

Abstract: An extension of the induction of decision trees based on the Kolmogorov Smirnov distance is proposed to handle noisy observations and fuzzy class membership. This extension is based on a new generalization of the Kolmogorov Smirov measure. It is shown that in the case of noisy measurements the KS measure is a differentiable function of the classification region. Thus, a gradient algorithm can be implemented in order to search for good discriminating regions. This algorithm is extended to the case when the training examples contain signals whose historic evolution has to be taken into account.

 

 

1. INTRODUCTION

 

Induction of decision trees

 

Induction of decision trees is one of the more popular attempts to do supervised machine learning. The idea is to analyse data bases and distill knowledge from them. The acquired knowledge is represented as decision trees. Research on this area has been done by Breiman et Al. [4] who proposed the CART algorithm, Quinlan [11] who proposed the ID3 algoritm, and Friedman [8] and Celeux and Lechevalier [5] with the tree segmentation algorithm. Several other authors have proposed generalizations and improvements of these original algorithms. In this work we deal with a generalization to the segmentation tree algorithm as originally proposed by Friedman.

 

The discriminating power of a variable

 

The induction algorithm based on the segmentation technique attempts at each stage to measure the discriminating power of each variable to discriminate one class from the other. Friedman proposes the Kolmogorov Smironov (KS) distance as a measure of discriminating power. 

 

The Kolmogorov Smirnov measure is the maximum distance between two distribution functions. This measure was first used by Friedman [8] as a discriminating measure in an algorithm for inducing decision trees from data. For each variable the empirical distribution functions for each one of the two classes are built. If for a given variable, the maximum distance between the two distributions is equals to 1, then that variable segments perfectly the two classes. If the maximum is zero, then that variable is completely useless to segment the classes. The algorithm was further studied by Celeux and Lechevallier [5], and implemented computationally by Lechevallier in a software called SICLA . Several extensions of this algorithm were made by the author [1] (see also [13]) to include categorical variables, change of variables and to incorporate different types of background knowledge. These extensions were implemented in an interactive software called AutoExpert .

 

In this paper two further extensions are proposed: the case when examples are described through noisy measurements and the case when class membership is fuzzy.

 

A priori knowledge: measurement noise

 

A priori or background knowledge offers the opportunity to improve the induction process. One type of knowledge that is usually encountered is the probability distribution of the measurements. Let us denote x the vector of decision variables and N the measurement noise. Then the observations encountered in the data base will be y = x + N. In a more general case we won't have additive noise. However, we assume that the measurement noise effect on the decision factors is specified in a precise way.

 

Supervised learning: Fuzzy class membership  

 

Supervised learning assumes that all the examples in the data base can be exactly determined to which class ( or classes ) they belong to. This assumption is very strong. In several real world problems there are examples where it is only possible to specify a degree of membership to the classes.

 

The gradient algorithm

 

The proposed extension of the Kolmogorov Smirnov measure is shown to be a differentiable function of the discriminating region. Furthermore, the gradient is easy to compute. It proposes a way to adjust decision regions that is very easy to interpret. This fact leads to the implementation of gradient algorithms to fine tune discriminating regions and, therefore, to improve the performance of the induced trees.

 

Organization of the paper

 

Section two reviews the definition of the Kolmogorov Smirnov measure for regions as described in [1]. Section 3 extend the KS measure when noisy measurements are present. Section 4 extend the KS measure when class membership is fuzzy. Section 5 consider the case when both noisy measurements and fuzzy class membership are present at the same time. Section 6 studies the data base segmentation process when noisy observations are present.  Section 7 proposes strategies to find good discriminating regions. They are based in change of variable schemes, either for numeric or categorical variables. Section 8 shows with an example a KS based induction of a decision tree. Section 9 shows the differentiability of the function that assign to any region R in   its KS measure. Then the particular case of a parametrized regions is studied and the gradient computed. The case corresponding to regions defined by hyperplanes is also studied and the gradient computed. This latter case is very similar to the perceptron model of neural networks. The proposed gradient produces then an alternative algorithm to modifiy the defining hyperplanes. Finally, section 10 is dedicated to the case when the training examples are signals.

 

 

2. DECISION REGIONS AND ITS DISCRIMINATING POWER

 

Let C1 and C2 be two classes not necessarily disjoints, and let   ( N being the set of natural numbers), be a set of training examples in the data base. Each example j in  T has associated a two dimensional vector  that denotes its membership to classes C1 and C2. For example   means that the 5 th training example of the data base belongs to class C1.   means that the 8 th training example belongs to classes C1 and C2 at the same time.

 

All examples are specified with descriptions corresponding to a list of decision factors x1,...,xn. Each example j has an associated description x(j)=(x1(j),...,xn(j)). For example, if x1 is color, then x1(3)=red means that the third example has red color. Let X be the cartesian product of the range spaces of the decision factors xk. Let R be a region in the description space X. Then the KS discriminating power of the region R to segment between classes C1 and C2, given the set T of examples, is given by the absolute value of the difference between two ratios. These ratios are the proportion of examples in each class belonging to the region R. This can be written as:

 

 

That is:

 

 

Example 1: Consider the following data base

 

color   * class1 class2

blue    *    1        0

green  *    0        1

blue    *    1        0

yellow *    0        1

green *    0         1

blue    *    1        1

green *    0        1

 

This data base has to be read as follows: in the first example, color is blue, and the example belongs to class1. In the second example, color is green, and it belongs to class2, etc.

 

Here T is the set of training examples, that is  T={1,2,3,4,5,6,7}, x1 is color, and X ={blue,green,yellow}.  Let R be the decision region R={blue}. Then the proportion of examples in class1 that belong to the region R is 3/3=1. The proportion of examples in class2 that belong to region R is 1/5. Therefore the KS measure is |1-1/5 | =4/5 = 0.8

 

In mathematical terms, the KS measure of the region R to segment class C1 from class C2, given the examples T, is given by:

 

where

            c1 = is the set of points of R2 such that its first component is 1 , and

         c2 = is the set of points of  R2 such that its second component is 1

and

             is the characteristic function  ().

 

In the previous example this would be:

 

KS(R/T)= | 3/ card( {1,3,6} )  - 1/ card( {2,4,5,6,7} ) |

               = | 3/3 - 1/5 |

               = 0.8

 

 

3. EXTENSION OF THE KS MEASURE FOR NOISY MEASUREMENTS

 

Let C1 and C2 be two classes not necessarily disjoint, and let T be a set of training examples in the data base. Let R be a region. Then the discriminating power of the region R to segment between classes C1 and C2 given the set T of examples is given by the absolute value of the difference of two ratios. These ratios are the proportions of examples in each class that "belong" to the region. However, since we have noisy observations, we cannot count the number of examples in the region. We can only compute for each example its probability of belonging to the region. More precisely:

 

 

where P is the probability distribution of the measurements in the description space X.

 

Example 2: Consider the previous data base with the training examples T and the same decision region R, but now assume that the "color" variable has measurement noise. Therefore a probability distribution is given for its measurement. In this example it is given the following interpretation for the "color" column:

 

"color=blue" means

p(color=blue)=0.8 , p(color=green)=0.1, p(color=yellow)=0.1

 

"color=green" means

p(color=blue)=0.2 , p(color=green)=0.6, p(color=yellow)=0.2

 

"color=yellow" means

p(color=blue)=0   , p(color=green)=0.1, p(color=yellow)=0.9

 

Then the data base can be described in the following form:

 

Color:     Blue

Color:   Green

Color: Yellow

*

Class 1

Class 2

0.8

0.1

0.1

*

1

0

0.2

0.6

0.2

*

0

1

0.8

0.1

0.1

*

1

0

0

0.1

0.9

*

0

1

0.2

0.6

0.2

*

0

1

0.8

0.1

0.1

*

1

1

0.2

0.6

0.2

*

0

1

 

 

Let the decision region be R={blue}, then

 

 

                = | ( 0.8 + 0.8 + 0.8 )/3   -  ( 0.2 + 0 + 0.2 + 0.8 + 0.2 )/5 |

 

                = 0.52

 

Comparing example 1 and example 2, we can see the degradation of the discriminating power of the variable color due to the measurement noise.

 

Missing information

 

Assume that for a given example we do not have information about the value of the atribute x. We can consider that this is an example with measurement noise whose distribution is equal to the empirical distribution obtained from the rest of the examples that belong to the same class, and whose value for x is known.

 

 

4. EXTENSION OF THE KOLMOGOROV SMIRNOV MEASURE FOR DECISION REGIONS WHEN CLASS MEMBERSHIP IS FUZZY

 

Assume now that we have a data base where class membership is fuzzy. This means that there is a degree of class membership. Consider for example the following data base:

 

Example 3:  consider now the following data base:

 

Color    *  Class1  Class2

blue      *    0.9         0.1

green   *    0.1         0.9

blue      *    0.9         0.1

yellow   *    0            1

green   *    0.2         0.8

blue      *    1            1

green   *    0.2         1

 

Note that in this data base the membership numbers are not necessarily 0 and 1 as before. The first example where color is blue, belongs to class1 with a memberhip degree of 0.9 and belongs to class2 with a membership degree of 0.1.

 

The membership degrees are numbers between 0 and 1 inclusive. Note that for each example the sum of its membership degrees need not to be 1, since the classes need not to be exclusives.

 

In this section we assume that there is no meaurement noise. In this case the extended Kolmogorov Smirnov measure for a region R given the examples T is given by the absolute value of the difference of two ratios. These ratios are the proportion of examples "of" each class that belong to the region. Since class membership is fuzzy we cannot count the number of examples in each class. We can only add the membership degree of the examples to a given class. Also each example in the region have to be weighted with the membership degree to the given class to obtain the total amount of examples in that class that belong to the region. More precisely:

 

 

 

where          ,  for each class Ci,

 

and                    

 

Note that:

 

   for i=1,2 and all j , and

 

     for all j

 

 

Example 4:  If T = { 1,2,3,4,5,6,7} in the data base of example 3 shown above, and R = {blue}, then

 

 

 

               = | ( 0.9 + 0.9 +1 )/3.3 - (0.1 + 0.1 + 1)/4.9 |

 

               = 0.6

 

 

5. EXTENSION OF THE KOLMOGOROV SMIRNOV MEASURE FOR NOISY MEASUREMENTS AND FUZZY CLASS MEMBERSHIP AT THE SAME TIME

 

Let T be a set of examples, R a decision region, P a measurement probabilty distribution, and a class membership measure, then

 

 

or equivalently,

 

 

 

Example 5: Consider the data base 2 defined in example 3, but with noisy measurements as defined in example 1. Then the data base is:

 

Color:     Blue

Color:   Green

Color: Yellow

*

Class 1

Class 2

0.8

0.1

0.1

*

0.9

0.1

0.2

0.6

0.2

*

0.1

0.9

0.8

0.1

0.1

*

0.9

0.1

0

0.1

0.9

*

0

1

0.2

0.6

0.2

*

0.2

0.8

0.8

0.1

0.1

*

1

1

0.2

0.6

0.2

*

0.2

1

 

 

and therefore the KS distance is:

 

 

        = | ( 0.9x0.8 + 0.1x0.2 + 0.9x0.8 + 0x0 + 0.2x0.2 + 1x0.8 + 0.2x0.2 )/3.3 -

             ( 0.1x0.8 + 0.9x0.2 + 0.1x0.8 + 1x0 + 0.8x0.2 + 1x0.8 +    1x0.2 )/4.9  |

 

        = 0.403

 

 

Properties:

 For all regions R and non empty training set T, we have:

i)      , where  is the complement of.

ii)       

iii)       implies that it does not exist an example in T that belongs to two classes at the same time with a non zero degree of membership, and that all examples of one class belong to R and all examples of the other class do not belong to R.

iv)     KS is a true extension

v)     KS does take into account very low frequent classes. For example if there are 1000 examples of class 2 and only 1 example of class 1, and region R contains the example of class 1 and 5 examples of class 2, then KS(R) = 0.995 .

 

            

 

This means that KS  identifies R as a good region to predict class 1 even though there are more examples belonging to class 2 in R than to class 1.

 

Consider now the cases where there are N classes. Then we try all combinations in two super classes and for each one we compute the KS measure. We choose the combination that gives the maximum KS. This combination defines the KS.

 

This means that:

 

 

6. SEGMENTATION OF A DATA BASE WITH NOISY OBSERVATIONS

 

Let's consider a data base with noisy observations, and assume that a region R0 is being chosen for segmentation. This is a region in the description space that segment the data base in two: the set of examples belonging to R0  , i.e. the examples that satisfy the defining condition of R0  , and the set of examples that belong to R0' the complement of R0 , i.e. the examples that do not satisfy the defining condition of R0 . If observations were not noisy, the data base T is segmented in two disjoint subsets:     and    .

 

In the case of noisy observations, the data base T is not segmented in two disjoint subset. All we have is a set of probabilties  that each example j of the data base belongs to the region R0 , and a corresponding set of probabilities   that each example j belongs to the complement R0' of the region R0 .

 

Therefore after segmenting by the region R0 , each example in R0 count less. Its membeship degree is then disminished proportional to the probabilty that the example belongs to R0. We then define the membership degree of the example j, given that we are segmenting on the region R0 as:

 

 

and consequently

 

 

Then the KS discriminating measure of a region R given the data base T and that the examples are in the region R0 is defined as:

 

 

There are two important cases that we will use later on. One case is when  .Then

 

 

and therefore

 

 

The other case is when satisfying the defining condition of  does not imply anything about belonging to R. In this case

 

 

and therefore

 

 

In other cases a priori knowledge can be introduced to compute the conditional probabilities .

 

Example 6: Consider the data base

 

Color:     Blue

Color:   Green

Color: Yellow

Model            a

Model                   b

*

Class 1

Class 2

0.8

0.1

0.1

0.6

0.4

*

0.9

0.1

0.2

0.6

0.2

0

1

*

0.1

0.9

0.8

0.1

0.1

0.7

0.3

*

0.9

0.1

0

0.1

0.9

0.2

0.8

*

0

1

0.2

0.6

0.2

0.5

0.5

*

0.2

0.8

0.8

0.1

0.1

0.4

0.6

*

1

1

0.2

0.6

0.2

0.4

0.6

*

0.2

1

 

If we segment using color and R={Green, Yellow} then the segmented data base will be:

 

  

Color:   Green

Color: Yellow

Model            a

Model                   b

*

C1

C2

0.2

0.1

0.1

0.6

0.4

*

0.9

0.18

0.1

0.02

0.8

0.6

0.2

0

1

*

0.1

0.08

0.9

0.72

0.2

0.1

0.1

0.7

0.3

*

0.9

0.18

0.1

0.02

1

0.1

0.9

0.2

0.8

*

0

0

1

1

0.8

0.6

0.2

0.5

0.5

*

0.2

0.16

0.8

0.64

0.2

0.1

0.1

0.4

0.6

*

1

0.20

1

0.20

0.8

0.6

0.2

0.4

0.6

*

0.2

0.16

1

0.80

 

Therefore

 

This means that there are 0.96 examples in C1 and 3.4 examples in C2, that have color green or yellow.

 

Let R={color = green} and let's compute its KS measure given that the examples are green or yellow. This means that R0 = { color = green or yellow }. In this case  . The number of examples in class 1 that are green is

0.1x0.9 + 0.6x0.1 + 0.1x0.9 + 0.1x0 + 0.6x0.2 + 0.1x1 + 0.6x0.2 = 0.58  . Therefore the proportion of examples in class 1 that are green is 0.56/0.96 = 0.604  . Similarly, the number of examples in class 2 that are yellow is 1.84 , and therefore the proportion of examples of class 2 that are green is 0.541/3.4 = 0.541 .  Then the distribution histograms is the following:

 

                

 

and therefore the KS measure is KS(R/T,R0 )=0.604-0.541 = 0.063

 

Let now R be the set R={model = a }. In this case R and R0 are competely independent. The number of examples in class 1 that have model = a , is then

0.6x0.18 + 0x0.08 + 0.7x0.018 + 0.2x0 + 0.5x0.16 + 0.4x0.20 + 0.4x0.16 =0.458 , and therefore the proportion of examples in class 1 that have model a given that  have color green or yellow is 0.458/0.96 = 0.477 . Similarly the number of examples in class 2 that have model a given that have color green or yellow is 0.946, and the corresponding proportion is 0.946/3.4 = 0.278 . Then the distribution histogram is:

 

                

 

and therefore tthe KS measure is KS(R/T,R0 ) = 0.477 - 0.278 = 0.199 .

 

 

7. HOW TO FIND A GOOD DISCRIMINATING REGION

 

Finding good decision regions is a basic step for the tree induction process. The original algorithm as proposed by Friedman, examined each variable and determined for each variable a threshold. Therefore, if there were N variables ( x1, ..., xN), then N decision regions were examined Ri= { x / xi <= ci }, i=1,...,N, where ci is the threshold computed as the point where the maximum distance between the distribution function was reached.

 

Better decision trees can be obtained if other decision regions are also examined besides the ones mentioned above. These regions can be proposed as a-priori knowledge or it can be regions proposed by an algorithm.

 

In the following we propose a method to search for regions that is decomposed in two steps:

 

Step 1: Create new one dimensional atributes ( variables).

 

Step 2: Look for the best threshold or set of thresholds for each of the previously created atributes.  

 

Step1 and Step2 lead to new regions that mix the original atributes. These regions can compete favorably with the original ones that are based in one atribute at a time.

 

Step 1

 

Numerical atributes

 

If in the data base we have numerical variables, new atributes can be created as linear or non-linear combinations of the original ones. Linear combinations corresponds to new axis like the axis z as indicated in the following figure:

 

                   

 

A simple new variable to try is the one obtained from the projection to the vector computed as the difference  of the center of masses of the two set of points ( those belonging to class1 and those belonging to class2 ). Then the new variable is

 

 

                               

 

 

If the covariances matrices  of the points belonging to class1 and class2 respectively, are also computed, then the previous new axis can be adjusted to incorporate knowledge from the dispersion of the two classes. Two new variables that can be tried are:  and . These new variables are the one used in discriminant analysis. If both populations ( class1 and class2) are gaussian with the same covariance ( that is  ), then the new defined variable is optimal.

 

If one of the covariance matrices is singular, then all examples belonging to the corresponding class are concentrated in an affine subspace. Then directions orthogonal to that subspace can be very discriminant. This can be true if both populations are not included in the same afin subspace. For example if the center of mass difference vector  , does not lie on the subspace.  Let  be an eigenvector of say  with null eigenvalue. That is:

 

 

Then if

 

 

a new variable to try would be

 

 

If we plot the distribution functions of both classes using this new atributre z, we will see that class1 will be concentrated in one point. If the center of masses difference vector  lies in the same subspace, then we restrict the data to that subspace. Then the restriction of the covariance to that subspace will be non-singular and we can use the variable

 

 

where  is the projection operator over the null space of  .

 

Non linear combinations can also be tried. We can define a parametric region R through a non linear function f(, . ) from RN to R, in the following way:

 

 

Categorical atributes

 

One of the more interesting challenges is to search for interesting new variables ( or axis ) for categorical variables. A simple idea is to define a new axis jointing two categorical variables as shown in the previuos section: if we have two categorical variables x1 and x2 with values a11,...,a1n and a21,...,a2p respectively, then a new z variable can be defined as:

 

z = 1  means x1=a11 and x2=a21 ,

z = 2  means x1=a11 and x2=a22 ,.

...

z = np  means x1=a1n and x2=a2p .

 

The combinations that do not have at least one example in the data base T can be excluded.

 

Example: Consider the following data base

 

Color

Model

*

Class 1

Class 2

green

a

*

1

0

blue

a

*

0

1

blue

a

*

0

1

blue

b

*

1

0

blue

c

*

1

0

blue

c

*

0

1

blue

d

*

1

0

blue

d

*

1

0

blue

d

*

1

0

red

a

*

0

1

red

b

*

0

1

red

b

*

0

1

red

c

*

1

0

red

c

*

1

0

red

d

*

1

0

red

d

*

1

0

 

This data base can be plotted in the following form:

 

        

 

A new axis z would be the following:

 

 

       

 

 

Different ordering strategies lead to different variables. However, all these axis lead to the same segmentation region. This is due to the fact that looking at all the maxium and minimum as explained below and forming the region R accordingly, is equivalent to ordering forming the region R for class 1 by selecting the values where the frequency of class 1 is higher than the frequency of class 2.

 

Step 2

 

Once defined the set of variables, step2 has to be applied to each one of them. This means that a threshold or set of thresholds is searched. We do that by looking at all the local regional maxima and local regional minima of the difference of the distribution functions corresponding to each of the two classes. A local regional maximum ( minimum ) is computed as the center of the biggest interval containing all the local maximums ( minimums ) that contain a given local maximum ( minimum ). Furthermore , outside of the interval the function must be strictly bigger (smaller). 

 

For example in the following figure the region R is obtained as the union of two subregions. The first one is the interval between the first maximum and the first minimum. In this interval the distribution function of class 2 ( the one whose distribution function is located below the other one ), increases more rapidly than the distribution function of class 1. The other subregion is the semi interval that starts at the second maximum. In this semi inteval class 2 also increases more rapidly than class 1.

 

                 

 

Similarly, in the following figure the region R is obtained as a union of two intervals. In each one class 2 increases more rapidly than class 1. In the complement region class 1 increases more rapidly.

 

            

 

This way from the distribution functions we obtain the region R. This region considers all maxima and minima. This method is an extension of the standard one that is based only on the global maximum, and allow to have a completely similar treatment of numeric and categorical variables. This means that if a numeric variable is discretized in a finite number of intervals, and all ordering strategies are tried to find a good region then the solution is similar to using all maximum and minimum. And this strategy is similar to the used directly on the numeric variable. If only the global maximum were used then differences are produced between the numeric and discretized aproach.

 

Example 8: In example 7 the region R obtained from the distribution functions is:

 

             

 

Then R is the union of three subregions: R = {3,4,5} U {7} U {9}  . And this is equivalent to:

 

R = { model = d } U { model = c and color = blue } U { model = a and color = green }.

 

KS(R/T) = 0.5 + 0.2 - (0.3 - 1/6) + ( 0.4 - 1/6) + 0.1

              = 0.9

 

 

8. INDUCTION OF DECISION TREES

 

We have all the ingredients to do the segmentation process. At each stage different regions have to be proposed and its segmentating powers have to be measured. A good region is selected and then utilized to segment the data base. In each subregion the same selection and segmentation process is performed again.

 

We will show this process with an example. The data base has two categorical variables. We will use both at the same time to find the best segmentation region.

 

Example 9: consider the data base described in example 6.Let's define the new categorical variable z containing six values {1,2,3,4,5,6} as defined in the following figure:

 

             

 

That is,

z=1 means color=yellow and model=a

z=2 means color=green and model=a

z=3 means color=blue and model=a

z=4 means color=blue and model=b

z=5 means color=green and model=b

z=3 means color=yellow and model=b

 

The number of examples in class 1 with z=1 is:

0.1x0.6x0.9 + 0.2x0x0.1 + 0.1x0.7x0.9 + 0.9x0.2x0 + 0.2x0.5x0.2 + 0.1x0.4x1 + 0.2x0.4x0.2 = 0.193 . Since there are 3.3 examples in class 1, the frequency is 0.193/3.3 = 0.058 . Similarly the frequency of class 2 for z=1 is 0.080. Then the following frequency histogram is obtained:

 

            

 

And the following distribution function is obtained:

 

            

 

Therefore, R = {1,2} U {5,6} . This means that R = { color = green or yellow } and its KS is given by KS(R/T)=0.353 + 0.050 = 0.403 .

 

 

The induction of a decision tree is a sequential process. At each stage a segmentation of the data base is performed. However due to measurement noise fractional parts of the examples can remain in a segmented data base. We can compute at any leaf the probability of each class. Let R be the region defined by the leaf. Then the probabilty that a case k belong to class i given that it belongs to the region R is:

 

 

Example 10: consider a leaf defined by the region R as shown in the figure below:

 

 

The total number of examples is 8. Note that there are two examples that have a 0.5 degree of membership to class1 and class2. There is another example that belongs to both class1 and class2 at the same time. Note also that there is one example that belongs completely to class1 but only half of it is in the region R.

 

Then

 

P(class1/R) = 4.5/7.5

P(class2/R)=4/7.5

 

Example 11: Consider again the induced tree of example 9. Since

      and       

 

we have

P(class 1/color=green or yelow)= ( 0.193 + 0.265 + 0.315 + 0.187 )/4 = 0.24

P(class 2 / color= green or yellow )= ( 0.393 + 0.553 + 1.287 + 1.167 )/4 = 0.85

P( class 1 / color=blue) =( 1.292 + 1.064 )/3 = 0.785

P(class 2 / color = blue ) = ( 0.584 + 0.916 )/3 = 0.5

 

 

More generally, we can compute the probability that the degree of membership to a class i be bigger than a certain threshold in a region R as:

 

 

where    is the characteritic function of the interval [c,1] .

 

Stopping criteria

 

Without observation noise and with mutually exclusive classes the segmentation process can continue up to obtaining leafs that contain examples belonging to only one class. Then, prunning techniques are used to produce trees capable of generalization to independent but similar data bases. If observation noise is included, the knowledge about the noise can be used to stop the tree generation process at a more apropiate stage, avoiding noise fitting and reducing later prunning. One stopping criteria is to define a theshold for the KS measure. That is, stop segmenting a particular leaf when KS(R/T) < c for all partitions R of the subset of examples belonging to the leaf.

 

 

9.  KS DIFFERENTIABILITY AS FUNCTION OF THE REGION

 

We will consider in this section regions that are defined with numerical atributes. We will analyse how small changes in the region shape will produce small changes in the KS measure when all atributes have measurement noise.

 

Let R be a region in Rn and let  be its boundary . Let's assume that the probability measures in Rn induced by the measurement noise are absolutely continue with respect to the Lebesgue measure of Rn , and let fj be the Radon Nikodym derivative of the induced probability measure by the j th training example with respect to the Lebesgue measure. Let's perturbe the region R to a new region R', defining a perturbation function  defined on the boundary  as:

 

 

where s is in the boundary  and s' is its corresponding point in the perturbed boundary   through the normal n(s) to    in s.

 

             

 

 

 Then the function

 

is Gateaux differentiable. Its derivative is defined as:

 

 

where  is the region R perturbed by  in each point s of its boundary . It is known that the following result holds:

 

 

Then the function   is differentiable for all R such that  , and

 

 

where

 

 

We can use the above computed gradient to implement the gradient algorithm to iteratively improve the region R. This means to perturb the boundary  with the function:

 

 

We can have a very simple interpretation of this perturbation.

 

Let's assume that the region R is such that K(R) > 0. This means that we can think of R as a region where examples of class C1 are concentrated. Note that this fact could happen even when there are more examples belonging to C2 in R than examples belonging to C1 .In this particular situation, the complement of R has most of his examples belonging to class C2 , and even much less examples belonging to class C1 . To see this, consider the term

 

 

This term is the proportion of examples in Ci that belong to R.  Then if K(R) > 0 , it implies that the proportion of examples of class 1 in the region R is higher than the proportion of examples of class2 in region R. That is, examples of class 1 are concentrated on R even though there could be very little amount of them in R compared to examples of class 2. The use of this relative ratio instead of using directly the number of examples of each class, makes possible to handle classes with very low frequency, and therefore allows to detect particular conditions ( expressed througout R ) where they relative occurence is higher. Therefore we distinguish this region as a singular one for class 1 . Then we are interested in perturbing the region R in such a way to include examples in the neighborhood or R that belong to class 1 and exclude examples of class 2.

 

Now, if for the j th training example we have

> 0

 

then we have that the j th training example belongs "more" to class C1 than to class C2 . Therefore, if at the point s in the boundary  the j th training example is close enough to s ( that is, fj(s) > 0 ), then we should move the boundary outward at s to completely include the j th training example in the new region R. By the contrary, if

 

< 0

 

we should move the boundary inward at s in order to exclude the j th training example from the region R.

 

However, the point s is influenced by several training examples. We have to weight the influence of the different training examples that are close to s, in order to decide in which direction pays more to move the boundary at s.

 

The following figure shows the region R and the new modified R' as proposed by the gradient:

 

                       

 

 

The gradient algorithm can be used to improve an allready existing region or to create a new one from the start.

 

In the first case an initial region can be created with the tree segmentation algorithm, which can include linear combinations of the numerical variables and nonlinear combinations defined as a priori knowledge. Then the gradient algorithm is used to perform small tunnings of the classification region.

 

In the other case, the gradient algorithm can be used assuming first a big noise covariance in order to find disciminating regions without getting stack in a local minimum, and then as the noise covariance is reduced a sequence of discriminating regions is computed.

 

 

Parametrized regions

 

Consider now the case when the region R depends on a parameter  . Let  be a region in Rn that depends on the parameter  in Rp, and let  be the boundary of  . Let's assume as before that the probability measures in Rn induced by the measurement noise are absolutely continue with respect to the Lebesgue measure of Rn , and let fj be the Radon Nikodym derivative of the induced probability measure by the j th training example with respect to the Lebesgue measure. For every  in Rp and any perturbation , we define   as

 

 

where s is in the boundary  and s' is its corresponding point in the perturbed boundary   through the normal n(s) to    in s.

 

                   

 

 

 Then the function

 

is Gateaux differentiable, and its derivative is:

 

Then   is differentiable for all  such that  , and

 

 

where

 

 

Thus, the gradient direction is:

 

 

As before, we can have a very simple interpretation of the proposed gradient direction. Let's assume that the region  is such that > 0. This means that examples of class C1 are concentrated in . Then we are interested in perturbing the region  in such a way to include examples in the neighborhood of  that belong to class 1 and exclude examples that belong to class 2. Now, if for the j th training example we have

 

> 0

 

then we have that the j th training example belongs "more" to class C1 than to class C2 . Therefore, if at the point s in the boundary  the j th training example is close enough to s ( that is, fj(s) > 0 ), then we should move the boundary outward at s as much as possible. The direction  is exactly the one that moves  in such a way that increases  the most.  However, we have to weight this direction with the ones suggested by other points s in the boundary where the j th example also influences. Therefore, the weighted direction is:

 

 

Thus the direction to move  , as suggested by example j, is influenced principally by the directions obtained by the points s in the boundary with higher density. If the example j is far away from the boundary, then is 0.

 

The above term takes into account the closeness of the example to the boundary. However we have to consider also a second factor: the opportunity that represent the example to improve the KS distance. This gain potential is bigger when more clearly an example belong to one of the two classes. This factor is measured for the j th example by the following term:

 

 

This term varies between -1 to 1. When this term is close to the extremes -1 or 1, it means that more clearly the j th example belongs to C2 or C1 respectively. Therefore, the gain in KS is bigger when the example is excluded ( respectively included) from the discriminaiting region than the gain obtained from an example that belongs as much to one class than to the other class. Thus, this example has to have a bigger influence to compute the final direction  than another example whose respective term is closer to 0.  However this term has to be multiplied by  to indicate an outward or inward move. Therefore  has to be moved in a weighted direction that takes into account the influence of the two described factors for each example. This weighted direction is:

 

 

That is,

 

    

Regions parametrized by a set of hyperplanes

 

We will consider now the special case when the region R is specified by a set of hyperplanes. These hyperplanes define the boundary of a convex clasification region R. This is analogous to the neural network aproach where the number of hyperplanes has to be specified beforehand ( neurons in the first hidden layer). We will write the gradient algorithm. The algorithm specifies how to make small changes in the hyperplanes in order to arrive after several iterations to a good classification region R.

 

Let  be defined as:

 

 

In this case

 

 

then the normal at the point s in the boundary defined by the i th hyperplane is

 

 

as shown in the figure

 

                         

If we perturb the boundary at s by moving i to i +hi , then the point s' is

 

and can be computed easily using

 

 

then

 

 

and therefore

 

and then

 

 

Therefore the gradient direction is:

 

where

 

 

This gradient direction has a very simple interprettion. Let´s assume that s is a point in the boundary that corresponds to an example thatbelongs more to class 1, and that the region corresponds to class 2. Then the yperplane should be moved in such a way as to completely exclude that example in the new region. This direction is proportional to .

 

 

10. TRAINING WITH SIGNALS

 

In several applications monitoring of time varying features can be used for classification. Therefore the training data bases contain signals. For example temperature or acceleration measurements over a period of time. Here we are interested on the time evolution of these observations, since we suspect that the evolution pattern can make a significant difference to discriminate the classes. In this cases the j th training example has descriptors that are functions. For example, the k descriptor is not a number but a function:  xk(j) L2([0,t1],R). The following figure shows a typical situation with examples that are signals.

 

 

We propose two aproaches to induce a decision tree. The first is trying to find a linear functional that discriminates both classes. The other one is designing a filter that help to discriminate both classes.

 

Linear functional

 

One way to handle the problem is to find a set ot instants tk and thresholds ck that define a decision region of the following type:

 

 

where

 

or using also the time derivatives

 

 

where

 

In our experience this last type of region is used frequently by human experts. However the determination of the best times tk and the alarm thresholds ck  and dk  are usually a difficult problem.

 

Another aproach is to use the average signals:

 

 

and define a new variable z as the projection over the difference of averages . That is

 

 

and therefore the searched region will be

 

 

where .

 

If we assume that the signal y(t) is a noisy observation of a deterministic signal x(t), i.e.,

 

y(t) = x(t) + n(t)

 

where n(t) is white noise with covariance  , then  is differentiable. Therefore KS is a differentiable function of the parameter, and a gradient algorithm can be used.

 

 

Preprocessing through a filter

 

Tipically signals are filtered before being used for training. ARMA filters to filter noise or extract relevant features are used. Here we are interested in designing a "good" filter whose designing goal is to discriminate between two classes. This aproach is different to the usual one where the prefiltering is done without any direct consideration to the learning goal.

 

Let  = (a0,...,ap,b0,...,bq,c) , be the parameters of an ARMA filter with transfer function

 

 

 

           

 

That is

 

 

where  is the filter impulse response function, and therefore  is its Laplace transform. We have to design the filter ( find  ) such that the signals from class1 and class2 are as different as possible.

 

      

 

Let's define

 

 

Other similar regions can also be used such as:

 

 

or

 

 

If the observations y(t) are noisy and come from a deterministic signal x(t) corrupted by additive white noise:

 

y(t) = x(t) + n(t)

 

where n(t) is white noise with variance  , then

 

 

where

 

 

and

 

 

therefore the KS distance is

 

 

which is a differentiable function of .

 

ACKNOWLEDGEMENT

 

I want to thank to Yves Lechevallier from INRIA and Raúl Gormaz for the fruitfull discussions about the concept proposed in this paper. However, eventual mistakes and missconceptions are my responsability. I also appreciate the support from the Ministere d'Affaire Etrangeres of the french goverment that makes possible my visits to work with INRIA researchers.

 

 

REFERENCES