Converting machine learning benchmark datasets¶

Alexander L. Hayes: Ph.D. Student, Indiana University.

Abstract: Most benchmark machine learning datasets have a vector-based representation, where we have a single type of object (people, images, houses) and we learn an attribute of those objects (disease risk, cat/dog, median price). This tutorial bridges the gap between vector-based machine learning and relational machine learning, and shows how to view the former in terms of the latter.

Examples in this notebook are provided as documentation, and are available under the terms of the Apache 2.0 License.

In [ ]:

            
                Copied!
                
!pip install numpy relational-datasets
!pip install numpy relational-datasets

In [1]:

            
                Copied!
                
from relational_datasets.convert import from_numpy
import numpy as np
from relational_datasets.convert import from_numpy
import numpy as np

Binary Classification¶

We're in a binary classification setting when the target array y contains 0/1 integers.

In [2]:

            
                Copied!
                
train, modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([0, 0, 1]),
)
train, modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([0, 0, 1]),
)

In [3]:

            
                Copied!
                
train.pos
train.pos

Out[3]:

['v4(id3).']

In [4]:

            
                Copied!
                
train.neg
train.neg

Out[4]:

['v4(id1).', 'v4(id2).']

Here we are learning from a collection of one type of object. Since there is only one type of object, we can enumerate them with an id.

The positive examples show that the object with id3 is a positive instance of a class, and the negative examples show that objects id1 and id2 are not instances of this class.

In [5]:

            
                Copied!
                
train.facts
train.facts

Out[5]:

['v1(id1,v1_0).',
 'v1(id2,v1_0).',
 'v1(id3,v1_1).',
 'v2(id1,v2_1).',
 'v2(id2,v2_1).',
 'v2(id3,v2_2).',
 'v3(id1,v3_1).',
 'v3(id2,v3_2).',
 'v3(id3,v3_2).']

Modes are a type of background knowledge that show up in the fields of Inductive Logic Programming and Statistical Relational Learning. A full discussion of them is not feasible here, but briefly: modes provide (1) type information and help (2) constrain the search space during learning.

Alexander did write a slightly longer discussion about modes to accompany a Knowledge Capture article.

ILP/SRL can also be highly sensitive to this type of background knowledge. Andrew Cropper, Sebastijan Dumančić, and Stephen H. Muggleton include a more general treatment of refining and learning background knowledge in their 2020 IJCAI article.

Modes can be set automatically in the propositional setting. The ones below say: "When learning about a binary attribute v4, we will bind the id of an object to a specific instances (id1, id2, id3), and then learn about it with respect to specific values (#) of its attributes v1, v2, and v3."

In [6]:

            
                Copied!
                
modes
modes

Out[6]:

['v1(+id,#varv1).', 'v2(+id,#varv2).', 'v3(+id,#varv3).', 'v4(+id).']

Regression¶

When y contains floating point numbers we're in a regression setting.

In [7]:

            
                Copied!
                
train, modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([1.1, 0.9, 2.5]),
)
train, modes = from_numpy(
    np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
    np.array([1.1, 0.9, 2.5]),
)

We represent this by marking all objects as "positive examples," but we want to learn about a continuous value.

In [8]:

            
                Copied!
                
train.pos
train.pos

Out[8]:

['regressionExample(v4(id1),1.1).',
 'regressionExample(v4(id2),0.9).',
 'regressionExample(v4(id3),2.5).']

In [9]:

            
                Copied!
                
train.neg
train.neg

Out[9]:

[]

In [10]:

            
                Copied!
                
train.facts
train.facts

Out[10]:

['v1(id1,v1_0).',
 'v1(id2,v1_0).',
 'v1(id3,v1_1).',
 'v2(id1,v2_1).',
 'v2(id2,v2_1).',
 'v2(id3,v2_2).',
 'v3(id1,v3_1).',
 'v3(id2,v3_2).',
 'v3(id3,v3_2).']

Side Note: Naming Variables¶

From the previous examples, we saw that names for the variables and targets were automatically assigned (with the last value v4 being the target).

The from_numpy function returns a tuple containing a RelationalDataset and a list of strings containing the modes. If an additional list of strings is passed, then those are used when converting the arrays.

Here we invent a dataset where each id represents a person, and we want to learn about their risk for a condition in based on their age, BMI, and coronary artery calcification (cac) levels.

In [11]:

            
                Copied!
                
                    
                    
                
                

        
X = np.array([[1, 1, 2], [1, 1, 0], [0, 1, 0], [1, 1, 1], [0, 1, 1]])
y = np.array([0, 0, 1, 1, 0])

data, modes = from_numpy(
    X,
    y,
    ["age", "bmi", "cac", "highrisk"],
)
X = np.array([[1, 1, 2], [1, 1, 0], [0, 1, 0], [1, 1, 1], [0, 1, 1]])
y = np.array([0, 0, 1, 1, 0])

data, modes = from_numpy(
    X,
    y,
    ["age", "bmi", "cac", "highrisk"],
)

In [12]:

            
                Copied!
                
data.pos
data.pos

Out[12]:

['highrisk(id3).', 'highrisk(id4).']

In [13]:

            
                Copied!
                
data.neg
data.neg

Out[13]:

['highrisk(id1).', 'highrisk(id2).', 'highrisk(id5).']

In [14]:

            
                Copied!
                
data.facts
data.facts

Out[14]:

['age(id1,age_1).',
 'age(id2,age_1).',
 'age(id3,age_0).',
 'age(id4,age_1).',
 'age(id5,age_0).',
 'bmi(id1,bmi_1).',
 'bmi(id2,bmi_1).',
 'bmi(id3,bmi_1).',
 'bmi(id4,bmi_1).',
 'bmi(id5,bmi_1).',
 'cac(id1,cac_2).',
 'cac(id2,cac_0).',
 'cac(id3,cac_0).',
 'cac(id4,cac_1).',
 'cac(id5,cac_1).']

In [15]:

            
                Copied!
                
modes
modes

Out[15]:

['age(+id,#varage).',
 'bmi(+id,#varbmi).',
 'cac(+id,#varcac).',
 'highrisk(+id).']

Worked example with scikit-learn's `load_breast_cancer`¶

load_breast_cancer is based on the Breast Cancer Wisconsin dataset.

Here we: (1) load the data and class labels, (2) split into training and test sets, (3) bin the continuous features to discrete, and (4) convert to the relational format.

In [ ]:

            
                Copied!
                
!pip install scikit-learn
!pip install scikit-learn

In [16]:

            
                Copied!
                
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

(1) Load the data, target, and variable names¶

Invoking load_breast_cancer returns a dictionary-like object with keys for .data, .target, .feature_names, and .target_names. We'll use these to pull out our X matrix, y array, and variable names.

In [17]:

            
                Copied!
                
breast_cancer = load_breast_cancer()

bc_X = breast_cancer.data
bc_y = breast_cancer.target
variable_names = [name.replace(" ", "") for name in breast_cancer.feature_names.tolist()] + [breast_cancer.target_names[1]]

bc_X
breast_cancer = load_breast_cancer()

bc_X = breast_cancer.data
bc_y = breast_cancer.target
variable_names = [name.replace(" ", "") for name in breast_cancer.feature_names.tolist()] + [breast_cancer.target_names[1]]

bc_X

Out[17]:

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [18]:

            
                Copied!
                
bc_y
bc_y

Out[18]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

In [19]:

            
                Copied!
                
variable_names
variable_names

Out[19]:

['meanradius',
 'meantexture',
 'meanperimeter',
 'meanarea',
 'meansmoothness',
 'meancompactness',
 'meanconcavity',
 'meanconcavepoints',
 'meansymmetry',
 'meanfractaldimension',
 'radiuserror',
 'textureerror',
 'perimetererror',
 'areaerror',
 'smoothnesserror',
 'compactnesserror',
 'concavityerror',
 'concavepointserror',
 'symmetryerror',
 'fractaldimensionerror',
 'worstradius',
 'worsttexture',
 'worstperimeter',
 'worstarea',
 'worstsmoothness',
 'worstcompactness',
 'worstconcavity',
 'worstconcavepoints',
 'worstsymmetry',
 'worstfractaldimension',
 'benign']

(2) Split out training and test sets¶

In [20]:

            
                Copied!
                
X_train, X_test, y_train, y_test = train_test_split(bc_X, bc_y)
X_train, X_test, y_train, y_test = train_test_split(bc_X, bc_y)

(3) Discretize continuous features to discrete¶

scikit-learn's KBinsDiscretizer will help us here, but we'll want an ordinal (0, 1, 2, 3, 4) encoding for our discrete features rather than the default one-hot encoding, and we need to ensure that the resulting matrices are converted back to integers.

NOTE Notice that we call .astype(int) on the outputs of the discretizer. Usually scikit-learn returns floats in these cases.

In [21]:

            
                Copied!
                
disc = KBinsDiscretizer(n_bins=5, encode="ordinal")
X_train = disc.fit_transform(X_train).astype(int)
X_test = disc.transform(X_test).astype(int)

X_train
disc = KBinsDiscretizer(n_bins=5, encode="ordinal")
X_train = disc.fit_transform(X_train).astype(int)
X_test = disc.transform(X_test).astype(int)

X_train

Out[21]:

array([[2, 2, 2, ..., 1, 1, 0],
       [1, 0, 1, ..., 1, 0, 2],
       [1, 2, 1, ..., 2, 3, 1],
       ...,
       [4, 4, 4, ..., 4, 2, 3],
       [4, 3, 4, ..., 4, 3, 3],
       [1, 0, 1, ..., 1, 0, 1]])

(4) Convert arrays to `RelationalDataset`¶

Finally, let's convert our training and test folds into RelationalDatasets and modes:

In [22]:

            
                Copied!
                
bc_train, bc_modes = from_numpy(X_train, y_train, names=variable_names)
bc_test, _ = from_numpy(X_test, y_test, names=variable_names)

bc_modes
bc_train, bc_modes = from_numpy(X_train, y_train, names=variable_names)
bc_test, _ = from_numpy(X_test, y_test, names=variable_names)

bc_modes

Out[22]:

['meanradius(+id,#varmeanradius).',
 'meantexture(+id,#varmeantexture).',
 'meanperimeter(+id,#varmeanperimeter).',
 'meanarea(+id,#varmeanarea).',
 'meansmoothness(+id,#varmeansmoothness).',
 'meancompactness(+id,#varmeancompactness).',
 'meanconcavity(+id,#varmeanconcavity).',
 'meanconcavepoints(+id,#varmeanconcavepoints).',
 'meansymmetry(+id,#varmeansymmetry).',
 'meanfractaldimension(+id,#varmeanfractaldimension).',
 'radiuserror(+id,#varradiuserror).',
 'textureerror(+id,#vartextureerror).',
 'perimetererror(+id,#varperimetererror).',
 'areaerror(+id,#varareaerror).',
 'smoothnesserror(+id,#varsmoothnesserror).',
 'compactnesserror(+id,#varcompactnesserror).',
 'concavityerror(+id,#varconcavityerror).',
 'concavepointserror(+id,#varconcavepointserror).',
 'symmetryerror(+id,#varsymmetryerror).',
 'fractaldimensionerror(+id,#varfractaldimensionerror).',
 'worstradius(+id,#varworstradius).',
 'worsttexture(+id,#varworsttexture).',
 'worstperimeter(+id,#varworstperimeter).',
 'worstarea(+id,#varworstarea).',
 'worstsmoothness(+id,#varworstsmoothness).',
 'worstcompactness(+id,#varworstcompactness).',
 'worstconcavity(+id,#varworstconcavity).',
 'worstconcavepoints(+id,#varworstconcavepoints).',
 'worstsymmetry(+id,#varworstsymmetry).',
 'worstfractaldimension(+id,#varworstfractaldimension).',
 'benign(+id).']

Last update: June 20, 2022