Converting multiclass vector datasets¶

Abstract: This tutorial extends some ideas in the Converting machine learning benchmark datasets to demonstrate how multiclass datasets work.

Examples in this notebook are provided as documentation, and are available under the terms of the Apache 2.0 License.

In [ ]:

            
                Copied!
                
!pip install relational-datasets
!pip install relational-datasets

Refresher on Binary Classification¶

When y is a vector containing 0 and 1, then examples are automatically split into positive and negative examples:

In [1]:

            
                Copied!
                
from relational_datasets.convert import from_numpy
import numpy as np

binary_data, binary_modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([0, 0, 1]),
)

binary_data.pos
from relational_datasets.convert import from_numpy
import numpy as np

binary_data, binary_modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([0, 0, 1]),
)

binary_data.pos

Out[1]:

['v4(id3).']

In [2]:

            
                Copied!
                
binary_data.neg
binary_data.neg

Out[2]:

['v4(id1).', 'v4(id2).']

When y is a vector containing more than 0 and 1, then we're in a multiclass setting.

In [3]:

            
                Copied!
                
multiclass_data, multiclass_modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([0, 1, 2]),
)

multiclass_data.pos
multiclass_data, multiclass_modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([0, 1, 2]),
)

multiclass_data.pos

Out[3]:

['v4(id1,v4_0).', 'v4(id2,v4_1).', 'v4(id3,v4_2).']

In this case, all of the examples are placed into the positive examples, and the negative examples are left empty. For classification, data should be further split into $K$ one-versus-rest datasets.

In [4]:

            
                Copied!
                
multiclass_data.neg
multiclass_data.neg

Out[4]:

[]

The modes should reflect this difference:

In [5]:

            
                Copied!
                
binary_modes[-1]
binary_modes[-1]

Out[5]:

'v4(+id).'

In [6]:

            
                Copied!
                
multiclass_modes[-1]
multiclass_modes[-1]

Out[6]:

'v4(+id,#classlabel).'

Worked example with scikit-learn's `load_iris`¶

Here we: (1) load the data and class labels, (2) split into training and test sets, (3) bin the continuous features to discrete, and (4) convert to the relational format.

In [ ]:

            
                Copied!
                
!pip install scikit-learn
!pip install scikit-learn

In [7]:

            
                Copied!
                
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

(1) Load the data, target, and variable names¶

Invoking load_iris returns a dictionary-like object with keys for .data, .target, .feature_names, and .target_names. We'll use these to pull out our X matrix, y array, and variable names.

In [8]:

            
                Copied!
                
iris = load_iris()

X = iris.data
y = iris.target
variable_names = [name.replace("(cm)", "").replace(" ", "") for name in iris.feature_names] + [iris.target_names[1]]

y
iris = load_iris()

X = iris.data
y = iris.target
variable_names = [name.replace("(cm)", "").replace(" ", "") for name in iris.feature_names] + [iris.target_names[1]]

y

Out[8]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [9]:

            
                Copied!
                
variable_names
variable_names

Out[9]:

['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'versicolor']

(2) Split out training and test sets¶

In [10]:

            
                Copied!
                
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y)

(3) Discretize continuous features to discrete¶

scikit-learn's KBinsDiscretizer will help us here, but we'll want an ordinal (0, 1, 2, 3, 4) encoding for our discrete features rather than the default one-hot encoding, and we need to ensure that the resulting matrices are converted back to integers.

In [11]:

            
                Copied!
                
disc = KBinsDiscretizer(n_bins=5, encode="ordinal")

X_train = disc.fit_transform(X_train).astype(int)
X_test = disc.transform(X_test).astype(int)
disc = KBinsDiscretizer(n_bins=5, encode="ordinal")

X_train = disc.fit_transform(X_train).astype(int)
X_test = disc.transform(X_test).astype(int)

In [12]:

            
                Copied!
                
iris_train, iris_modes = from_numpy(X_train, y_train, names=variable_names)

iris_modes
iris_train, iris_modes = from_numpy(X_train, y_train, names=variable_names)

iris_modes

Out[12]:

['sepallength(+id,#varsepallength).',
 'sepalwidth(+id,#varsepalwidth).',
 'petallength(+id,#varpetallength).',
 'petalwidth(+id,#varpetalwidth).',
 'versicolor(+id,#classlabel).']

(4) Convert arrays to `RelationalDataset`¶

Finally, let's convert our training and test folds into RelationalDatasets and modes:

In [13]:

            
                Copied!
                
iris_test, _ = from_numpy(X_test, y_test, names=variable_names)
iris_test, _ = from_numpy(X_test, y_test, names=variable_names)

In [14]:

            
                Copied!
                
len(iris_train.pos), len(iris_train.neg), len(iris_train.facts)
len(iris_train.pos), len(iris_train.neg), len(iris_train.facts)

Out[14]:

(112, 0, 448)

In [15]:

            
                Copied!
                
len(iris_test.pos), len(iris_test.neg), len(iris_test.facts)
len(iris_test.pos), len(iris_test.neg), len(iris_test.facts)

Out[15]:

(38, 0, 152)

Last update: June 20, 2022