Converting multiclass vector datasets¶
Abstract: This tutorial extends some ideas in the Converting machine learning benchmark datasets to demonstrate how multiclass datasets work.
Examples in this notebook are provided as documentation, and are available under the terms of the Apache 2.0 License.
!pip install relational-datasets
Refresher on Binary Classification¶
When y
is a vector containing 0 and 1, then examples are automatically split into positive and negative examples:
from relational_datasets.convert import from_numpy
import numpy as np
binary_data, binary_modes = from_numpy(
np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
np.array([0, 0, 1]),
)
binary_data.pos
['v4(id3).']
binary_data.neg
['v4(id1).', 'v4(id2).']
When y
is a vector containing more than 0 and 1, then we're in a multiclass setting.
multiclass_data, multiclass_modes = from_numpy(
np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
np.array([0, 1, 2]),
)
multiclass_data.pos
['v4(id1,v4_0).', 'v4(id2,v4_1).', 'v4(id3,v4_2).']
In this case, all of the examples are placed into the positive examples, and the negative examples are left empty. For classification, data should be further split into $K$ one-versus-rest datasets.
multiclass_data.neg
[]
The modes should reflect this difference:
binary_modes[-1]
'v4(+id).'
multiclass_modes[-1]
'v4(+id,#classlabel).'
Worked example with scikit-learn's load_iris
¶
Here we: (1) load the data and class labels, (2) split into training and test sets, (3) bin the continuous features to discrete, and (4) convert to the relational format.
!pip install scikit-learn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
(1) Load the data, target, and variable names¶
Invoking load_iris
returns a dictionary-like object with keys for .data
, .target
, .feature_names
, and .target_names
. We'll use these to pull out our X
matrix, y
array, and variable names.
iris = load_iris()
X = iris.data
y = iris.target
variable_names = [name.replace("(cm)", "").replace(" ", "") for name in iris.feature_names] + [iris.target_names[1]]
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
variable_names
['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'versicolor']
(2) Split out training and test sets¶
X_train, X_test, y_train, y_test = train_test_split(X, y)
(3) Discretize continuous features to discrete¶
scikit-learn's KBinsDiscretizer
will help us here, but we'll want an ordinal (0, 1, 2, 3, 4) encoding for our discrete features rather than the default one-hot encoding, and we need to ensure that the resulting matrices are converted back to integers.
disc = KBinsDiscretizer(n_bins=5, encode="ordinal")
X_train = disc.fit_transform(X_train).astype(int)
X_test = disc.transform(X_test).astype(int)
iris_train, iris_modes = from_numpy(X_train, y_train, names=variable_names)
iris_modes
['sepallength(+id,#varsepallength).', 'sepalwidth(+id,#varsepalwidth).', 'petallength(+id,#varpetallength).', 'petalwidth(+id,#varpetalwidth).', 'versicolor(+id,#classlabel).']
(4) Convert arrays to RelationalDataset
¶
Finally, let's convert our training and test folds into RelationalDatasets and modes:
iris_test, _ = from_numpy(X_test, y_test, names=variable_names)
len(iris_train.pos), len(iris_train.neg), len(iris_train.facts)
(112, 0, 448)
len(iris_test.pos), len(iris_test.neg), len(iris_test.facts)
(38, 0, 152)