PRTools contents

PRTools manual

datasets

DATASETS

Info on the dataset class construction for PRTools

This is not a command, just an information file.

Datasets in PRTools are in the MATLAB language defined as objects of the  class DATASET. Below, the words 'object' and 'class' are used in the pattern  recognition sense.

A dataset is a set consisting of M objects, each described by K features.  In PRTools, such a dataset is represented by a M x K matrix: M rows, each  containing an object vector of K elements. Usually, a dataset is labeled.  An example of a definition is

    DATA = [RAND(3,2) ; RAND(3,2)+0.5];
    LABS = ['A';'A';'A';'B';'B';'B'];
    A = DATASET(DATA,LABS)

which defines a [6 x 2] dataset with 2 classes.

The [6 x 2] data matrix (6 objects given by 2 features) is accompanied by  labels, assigning each of the objects to one of the two classes A and B.  Class labels can be numbers or strings and should always be given as rows  in the label list. A lable may also have the value NaN or may be an empty  string, indicating an ulabeled object. If the label list is not given,  all objects are marked as unlabeled.

Various other types of information can be stored in a dataset. The most  simple way to get an overview is by typing

    STRUCT(A)

which for the above example displays the following

           DATA: [6x2 double]
        LABLIST: [2x1 double]
           NLAB: [6x1 double]
        LABTYPE: 'crisp'
        TARGETS: []
        FEATLAB: [2x1 double]
        FEATDOM: {1x2 cell}
          PRIOR: []
           COST: []
        OBJSIZE: 6
       FEATSIZE: 2
          IDENT: {6x1 cell}
        VERSION: {1x2 cell}
           NAME: []
           USER: []

These fields have the following meaning

 DATA an array containing the objects (the rows) represented by  features (the columns). In the software and help-files, the number  of objects is usually denoted by M and the number of features is  denoted by K. So, DATA has the size of [M,K]. This is also defined  as the size of the entire dataset.
 LABLIST The names of the classes, stored row-wise. These class names  should be integers, strings or cells of strings. Mixtures of  these are not supported. LABLIST has as many rows as there are  classes. This number is usually denoted by C. LABLIST is  constructed from the set of LABELS given in the DATASET command  by determining the unique names while ordering them alphabetically.
 NLAB an [M x 1] vector of integers between 1 and C, defining for each  of the M objects its class. They are indexing LABLIST.
 LABTYPE 'CRISP', 'SOFT' or 'TARGETS' are the three possible label types.  In case of 'CRISP' labels, a unique class, defined by NLAB, is  assigned to each object, pointing to the class names given in  LABLIST.  For 'SOFT' labels, each object has a corresponding vector of C numbers between 0 and 1 indicating its membership (or confidence  or posterior probability) of each of the C classes. These numbers  are stored in the array TARGETS of the size M x C. They don't  necessarily sum to one for individual row vectors.  Labels of type 'TARGETS' are in fact no labels, but merely target  vectors of length C. The values are again stored in TARGETS and  are not restricted in value.
 TARGETS [M,C] array storing the values of the soft labels or targets.
 FEATLAB A label list (like LABLIST) of K rows storing the names of the  features.
 FEATDOM A cell array describing for each feature its domain.
 PRIOR Vector of length C storing the class prior probabilities. They  should sum to one. If PRIOR is empty ([]) it is assumed that the  class prior probabilities correspond to the class frequencies.
 COST Classification cost matrix. COST(I,J) are the costs  of classifying an object from class I as class J. Column C+1 generates an alternative reject class and may be omitted,  yielding a size of [C,C]. An empty cost matrix, COST = [] (default) is interpreted as COST = ONES(C) - EYE(C) (identical  costs of misclassification).
 OBJSIZE The number of objects, M. In case the objects are related to a  n-dimensional structure, OBJSIZE is a vector of length n, storing  the size of this structure. For instance, if the objects are pixels  in a [20 x 16] image, then OBJSIZE = [20,16] and M = 320.
 FEATSIZE The number of features, K. In case the features are related to  an n-dimensional structure, FEATSIZE is a vector of length n,  storing the size of this structure. For instance, if the features  are pixels in a [20 x 16] image, then FEATSIZE = [20,16] and  K = 320.
 IDENT A cell array of M elements storing indicators of the M objects.  They are initialized by integers 1:M.
 VERSION Some information related to the version of PRTools used for  defining the dataset.
 NAME A character string naming the dataset, possibly used to annotate  related graphics.
 USER Free field for the user, not used by PRTools.

The fields can be set by commands like SETDATA, SETFEATLAB, SETLABELS,  see below for a complete list.  Note that there is no field LABELS in the DATASET definition. Labels are  converted to NLAB and LABLIST. The command SETLABELS however exists and  takes care of the conversion.

The data and information stored in a dataset can be retrieved as follows

Many standard MATLAB operations and a number of general MATLAB commands have  been overloaded for variables of the DATASET type.

See also

dataset, data2im, obj2feat, feat2obj, im2feat, im2obj, dataim, setdata, setfeatlab, setfeatdom, setfeatsize, setident, setlabels, setlablist, setlabtype, setname, setnlab, setobjsize, setprior, setcost, settargets, setuser, setlablistnames, setversion, getdata, getfeatlab, getfeatdom, getfeatsize, getident, getlabels, getlablist, getlabtype, getname, getnlab, getobjsize, getprior, getcost, getsize, gettargets, getuser, getversion, getclassi, getlablistnames, findident, findlabels, findnlab,

PRTools contents

PRTools manual