HOME Dataset background Datasets Dataset overloadDataset definition

Dataset definition

Constructor

The dataset constructor looks like

    a = dataset(data,labels)

The two items data and labels are essential for the operation of PRTools. If data is neglected (data = []) an empty dataset is defined. If labels is not supplied the objects remain unlabeled. In case just a part of the objects have no labels the corresponding entry of labels should contain a NaN for numeric labels or an empty string ('') in case of string labels.

Label types

There are three types of labels supported for datasets:

More information on label types can be found elsewhere.

Dataset structure

Various items can be stored in a dataset. A full list can be found be converting a dataset variable into a structure.

    data = rand(6,2);
    labels = [1 1 1, 2 2 2]';
    a = dataset(data,labels)
%       6 by 2 dataset with 2 classes: [3  3]
    struct(a)
%
%        data: [6x2 double]
%     lablist: {2x4 cell}
%        nlab: [6x1 double]
%     labtype: 'crisp'
%     targets: []
%     featlab: []
%     featdom: {}
%       prior: []
%        cost: []
%     objsize: 6
%    featsize: 2
%       ident: [1x1 struct]
%     version: {[1x1 struct]  '03-Jan-2011 15:24:04'}
%        name: []
%        user: []

All fields have a corresponding set-command (e.g. setdata) to store it and a get-command (e.g. getdata) to retrieve it. Users are discouraged to use the '.'-constructs (e.g. a.files) as it will not guarantee consistency with other fields. In some cases not the exact fields are retrieved but some derived data. In the table more information is given.

> The fields of the dataset structure

data

This is the main field, storing the data as it is supplied by calling the dataset constructor or by setdata. The size of dataset, the number of objects (m) by the number of features (k) is derived from data.
lablist This is a cell array that encodes the class names derived from the objects labels. Datasets can have multiple sets of labels for their objects of which just one is active, multi-labeling. The lablist field stores the necessary administration. The active set of labels can be retrieved by the commands classnames and getlablist. See also the following nlab item.
nlab The labels supplied in the dataset definition are summarized by lablist and nlab. lablist contains the unique labels (class names) and nlab is an index vector for lablist. Its values range between 1 and the total number of classes (size of lablist). Entries in nlab for objects that are unlabeled are set to 0 (zero). The setnlab command should be treated with care as it changes the labeling of the dataset.
labtype The labeling type (crisp, soft or targets, see above) is stored here. setlabtype changes the label type, but may also change nlab and lablist fields. The conversion rules are described elsewhere.
featlab The feature labels are strings or numbers and are used by PRTools (if given) to annotate plots.
featdom Here feature domains are stored. If these fields are set tests are performed whenever the values in the data field changes to check whether the new data is within the supplied domains.
prior Classes in a dataset may have prior probabilities. These are used in density based classifiers, in error evaluation by testc and on some other places. If not set, the prior field is empty and when prior probabilities are needed the class frequencies in the dataset are taken.
cost In this field a cost matrix can be stored for performance evaluation and procedures that explicitly minimize classification costs. Unless explicitly mentioned PRTools neglects this field.
objsize The object size is the number of rows (objects) in the data field. It is retrieved by the size and getsize commands (getsize may return the number of classes as well). Although the routines getobjsize and setobjsize exist, users are discouraged to use them except in relation with image handling.
featsize The feature size is the number of columns (features) in the data field. It is retrieved by the size and getsize commands (getsize retrieves the number of classes as well). Although the routines getfeatsize and setfeatsize exist, users are discouraged to use them except in relation with image handling.
ident In subfields of the ident field various object identifiers can be stored. One field is always available: ident. Unless changed by the user it contains the object indices at creation of the dataset. Every ident subfield stores vectors or arrays of doubles, strings or cells with as many rows as there are objects in the dataset.
version At creation of the dataset PRTools stores here its version and the date.
name The user may supply a name here. It is displayed in the command window when a command returning a dataset is executed without a semicolon. The dataset name may also be used for annotating plots.
user In this field the user can add and retrieve any additional annotation for the dataset in its entirety.

When datasets are changed, e.g. by a transformation of the data, or by taking a subset of features or objects, all relevant information is copied, including the name and user field.

The dataset fields might also be accessed by structural indexing.


R.P.W. Duin, January 23, 2013


HOME Dataset background Datasets Dataset overloadDataset definition