Dataset definition |
Constructor
The dataset
constructor looks like
a = dataset(data,labels)
data
is an array of size [m,k]
storing the k
-dimensional vector representations of m
objects.
labels
contains the m
object labels. The following types are supported:k
integers.k
strings.k
cells, each containing a string.
The two items data
and labels
are essential for the operation of PRTools
. If data is neglected (data = []
) an empty dataset is defined. If labels
is not supplied the objects remain unlabeled. In case just a part of the objects have no labels the corresponding entry of labels
should contain a NaN
for numeric labels or an empty string (''
) in case of string labels.
Label types
There are three types of labels supported for datasets:
PRTools
framework. There are just a few routines that use this facility.
More information on label types can be found elsewhere.
Various items can be stored in a dataset. A full list can be found be converting a dataset variable into a structure.
data = rand(6,2); labels = [1 1 1, 2 2 2]'; a = dataset(data,labels) % 6 by 2 dataset with 2 classes: [3 3] struct(a) % % data: [6x2 double] % lablist: {2x4 cell} % nlab: [6x1 double] % labtype: 'crisp' % targets: [] % featlab: [] % featdom: {} % prior: [] % cost: [] % objsize: 6 % featsize: 2 % ident: [1x1 struct] % version: {[1x1 struct] '03-Jan-2011 15:24:04'} % name: [] % user: []
All fields have a corresponding set-command (e.g. setdata
) to store it and a get-command (e.g. getdata
) to retrieve it. Users are discouraged to use the '.'-constructs (e.g. a.files) as it will not guarantee consistency with other fields. In some cases not the exact fields are retrieved but some derived data. In the table more information is given.
> The fields of the dataset structure | |
| This is the main field, storing the data as it is supplied by calling the dataset constructor or by setdata . The size of dataset, the number of objects (m ) by the number of features (k ) is derived from data . |
lablist | This is a cell array that encodes the class names derived from the objects labels. Datasets can have multiple sets of labels for their objects of which just one is active, multi-labeling. The lablist field stores the necessary administration. The active set of labels can be retrieved by the commands classnames and getlablist . See also the following nlab item. |
nlab | The labels supplied in the dataset definition are summarized by lablist and nlab . lablist contains the unique labels (class names) and nlab is an index vector for lablist . Its values range between 1 and the total number of classes (size of lablist ). Entries in nlab for objects that are unlabeled are set to 0 (zero). The setnlab command should be treated with care as it changes the labeling of the dataset. |
labtype | The labeling type (crisp, soft or targets, see above) is stored here. setlabtype changes the label type, but may also change nlab and lablist fields. The conversion rules are described elsewhere. |
featlab | The feature labels are strings or numbers and are used by PRTools (if given) to annotate plots. |
featdom | Here feature domains are stored. If these fields are set tests are performed whenever the values in the data field changes to check whether the new data is within the supplied domains. |
prior | Classes in a dataset may have prior probabilities. These are used in density based classifiers, in error evaluation by testc and on some other places. If not set, the prior field is empty and when prior probabilities are needed the class frequencies in the dataset are taken. |
cost | In this field a cost matrix can be stored for performance evaluation and procedures that explicitly minimize classification costs. Unless explicitly mentioned PRTools neglects this field. |
objsize | The object size is the number of rows (objects) in the data field. It is retrieved by the size and getsize commands (getsize may return the number of classes as well). Although the routines getobjsize and setobjsize exist, users are discouraged to use them except in relation with image handling. |
featsize | The feature size is the number of columns (features) in the data field. It is retrieved by the size and getsize commands (getsize retrieves the number of classes as well). Although the routines getfeatsize and setfeatsize exist, users are discouraged to use them except in relation with image handling. |
ident | In subfields of the ident field various object identifiers can be stored. One field is always available: ident . Unless changed by the user it contains the object indices at creation of the dataset. Every ident subfield stores vectors or arrays of doubles, strings or cells with as many rows as there are objects in the dataset. |
version | At creation of the dataset PRTools stores here its version and the date. |
name | The user may supply a name here. It is displayed in the command window when a command returning a dataset is executed without a semicolon. The dataset name may also be used for annotating plots. |
user | In this field the user can add and retrieve any additional annotation for the dataset in its entirety. |
When datasets are changed, e.g. by a transformation of the data, or by taking a subset of features or objects, all relevant information is copied, including the name
and user
field.
The dataset fields might also be accessed by structural indexing.
R.P.W. Duin
, January 28, 2013Dataset definition |