PRTools contents |
DTC
W = DTC(A, CRIT, CHISQSTOPVAL, PRUNE, T, NFEAT)
Input | |
A | Training dataset. Object weights and class priors: It is possible to assign individual weights to dataset objects by using 'weights' identifier: |
If weights are not defined they assumed to be equal to 1 for
all objects. The actual weights used by the training routine
are computed by using supplied weights along with the class
priors (if priors are not set then the apparent priors are used):
ACTUAL_WEIGHTS(I) = M*(WEIGHTS(I)/CW(C))*(PRIOR(C)/SUM(PRIOR))
where M is the total amount of (labelled) objects in A,
CW is the vector weights sums for each class, and C is
the class of the object I. The sum of all actual weights is M.
Feature types:
Features are treated diffrently based on their domain information.
If the feature domain is empty or is the interval/collection of intervals
then this feature is considered to be the continuous one and
branches are created by splitting at the threshold value.
Otherwise (feature domain specifies a set of values or set of
names), the feature is considered to be the nominal one and
branches corresponding to all present feature values are
created.
Unknown and 'non-applicable' values:
If the feature value is NaN then it is assumed that its value
is unknown. In such a situation the object with unknown value
is split into fractions and sent down all branches.
The concept of the 'non-applicable' feature value is different
from the concept of the unknown (missing) value.
...
The 'non-applicable' values for the continues features are
encoded as INF. If feature domain is the (set of) interval(s),
then INF value has to be explicitly added to the domain
defintion. The 'non-applicable' value of nominal features
does not have predefined encoding. If it is necessary user
have to include such value into domain definition.
CRIT Splitting citerion name.
'igr' Information Gain Ratio (default):
As defined by Quinlan. The penalty on the number of the distinct
values of the continues feature is used. If the gain is zero or
negative due to such penalization, the split is not performed.
This leads to smaller trees and may give non-zero training error.
This criterion does not use costs. (Costs are used only at the classification step).
'gini' Gini impurity index. More precisely, the change in this
index. GINI index can be interpreted as a misclassification rate
for the stochastic prior based classifier, so costs are
naturally embedded. If the change in the (absolute) error less
or equal to 0.1 (change in the cost less or equal to 0.1 of minimal
absolute value of non-zero costs) the split is not performed.
This leads to smaller trees and may give non-zero training error.
'miscls' Classification error criterion.
To be used only for educational puposes because
it gives rather inferior results. Costs are naturally embedded.
CHISQSTOPVAL Early stopping crtitical value for the chi-squared test
on the difference between the original node class distribution and
branches class distributions. CHISQSTOPVAL is 0 by default.
Which means that branches will be discarded only if they bring no
change in class distribution.
PRUNE Pruning type name
'prunep' - pessimistic (top-down) pruning as defined by Quinlan.
Pessimistic pruning be perfromed if cost matrix is defined.
'prunet' - test set (bottom-up) pruning using the (required)
dataset T.
These implementations of both pruning algorithms may be not
exactly correct if there are unknown values in datastes.
T Test test for the test set pruning
Output | |
W | Classifier mapping |
If (full grown) branches of (sub)tree do not improve classification error (misclassification cost) they are immediatley discarded. This may happen because we use regularized posteriors. As a result the algorithm is more stable, trees are smaller, but split on the training set may be not perfect.
[1] J.R. Quinlan, Simplifying Decision Trees, International Journal of Man-Machine Studies, 27(3), pp. 221-234, 1987. [2] J.R. Quinlan, Improved use of continuous attributes in C4.5. Journal of AI Research, 4(96), pp. 77-90, 1996.
PRTools contents |