Converting machine learning benchmark datasets¶
Alexander L. Hayes: Ph.D. Student, Indiana University.
Abstract: Most benchmark machine learning datasets have a vector-based representation, where we have a single type of object (people, images, houses) and we learn an attribute of those objects (disease risk, cat/dog, median price). This tutorial bridges the gap between vector-based machine learning and relational machine learning, and shows how to view the former in terms of the latter.
Examples in this notebook are provided as documentation, and are available under the terms of the Apache 2.0 License.
!pip install numpy relational-datasets
from relational_datasets.convert import from_numpy
import numpy as np
Binary Classification¶
We're in a binary classification setting when the target array y
contains 0/1 integers.
train, modes = from_numpy(
np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
np.array([0, 0, 1]),
)
train.pos
['v4(id3).']
train.neg
['v4(id1).', 'v4(id2).']
Here we are learning from a collection of one type of object. Since there is only one type of object, we can enumerate them with an id
.
The positive examples show that the object with id3
is a positive instance of a class, and the negative examples show that objects id1
and id2
are not instances of this class.
train.facts
['v1(id1,v1_0).', 'v1(id2,v1_0).', 'v1(id3,v1_1).', 'v2(id1,v2_1).', 'v2(id2,v2_1).', 'v2(id3,v2_2).', 'v3(id1,v3_1).', 'v3(id2,v3_2).', 'v3(id3,v3_2).']
Modes are a type of background knowledge that show up in the fields of Inductive Logic Programming and Statistical Relational Learning. A full discussion of them is not feasible here, but briefly: modes provide (1) type information and help (2) constrain the search space during learning.
Alexander did write a slightly longer discussion about modes to accompany a Knowledge Capture article.
ILP/SRL can also be highly sensitive to this type of background knowledge. Andrew Cropper, Sebastijan Dumančić, and Stephen H. Muggleton include a more general treatment of refining and learning background knowledge in their 2020 IJCAI article.
Modes can be set automatically in the propositional setting. The ones below say: "When learning about a binary attribute v4
, we will bind the id
of an object to a specific instances (id1
, id2
, id3
), and then learn about it with respect to specific values (#
) of its attributes v1
, v2
, and v3
."
modes
['v1(+id,#varv1).', 'v2(+id,#varv2).', 'v3(+id,#varv3).', 'v4(+id).']
Regression¶
When y
contains floating point numbers we're in a regression setting.
train, modes = from_numpy(
np.array([[0, 1, 1], [0, 1, 2], [1, 2, 2]]),
np.array([1.1, 0.9, 2.5]),
)
We represent this by marking all objects as "positive examples," but we want to learn about a continuous value.
train.pos
['regressionExample(v4(id1),1.1).', 'regressionExample(v4(id2),0.9).', 'regressionExample(v4(id3),2.5).']
train.neg
[]
train.facts
['v1(id1,v1_0).', 'v1(id2,v1_0).', 'v1(id3,v1_1).', 'v2(id1,v2_1).', 'v2(id2,v2_1).', 'v2(id3,v2_2).', 'v3(id1,v3_1).', 'v3(id2,v3_2).', 'v3(id3,v3_2).']
Side Note: Naming Variables¶
From the previous examples, we saw that names for the variables and targets were automatically assigned (with the last value v4
being the target).
The from_numpy
function returns a tuple containing a RelationalDataset
and a list of strings containing the modes. If an additional list of strings is passed, then those are used when converting the arrays.
Here we invent a dataset where each id
represents a person, and we want to learn about their risk for a condition in based on their age, BMI, and coronary artery calcification (cac) levels.
X = np.array([[1, 1, 2], [1, 1, 0], [0, 1, 0], [1, 1, 1], [0, 1, 1]])
y = np.array([0, 0, 1, 1, 0])
data, modes = from_numpy(
X,
y,
["age", "bmi", "cac", "highrisk"],
)
data.pos
['highrisk(id3).', 'highrisk(id4).']
data.neg
['highrisk(id1).', 'highrisk(id2).', 'highrisk(id5).']
data.facts
['age(id1,age_1).', 'age(id2,age_1).', 'age(id3,age_0).', 'age(id4,age_1).', 'age(id5,age_0).', 'bmi(id1,bmi_1).', 'bmi(id2,bmi_1).', 'bmi(id3,bmi_1).', 'bmi(id4,bmi_1).', 'bmi(id5,bmi_1).', 'cac(id1,cac_2).', 'cac(id2,cac_0).', 'cac(id3,cac_0).', 'cac(id4,cac_1).', 'cac(id5,cac_1).']
modes
['age(+id,#varage).', 'bmi(+id,#varbmi).', 'cac(+id,#varcac).', 'highrisk(+id).']
Worked example with scikit-learn's load_breast_cancer
¶
load_breast_cancer
is based on the Breast Cancer Wisconsin dataset.
Here we: (1) load the data and class labels, (2) split into training and test sets, (3) bin the continuous features to discrete, and (4) convert to the relational format.
!pip install scikit-learn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
(1) Load the data, target, and variable names¶
Invoking load_breast_cancer
returns a dictionary-like object with keys for .data
, .target
, .feature_names
, and .target_names
. We'll use these to pull out our X
matrix, y
array, and variable names.
breast_cancer = load_breast_cancer()
bc_X = breast_cancer.data
bc_y = breast_cancer.target
variable_names = [name.replace(" ", "") for name in breast_cancer.feature_names.tolist()] + [breast_cancer.target_names[1]]
bc_X
array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01, 1.189e-01], [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01, 8.902e-02], [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01, 8.758e-02], ..., [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01, 7.820e-02], [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01, 1.240e-01], [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01, 7.039e-02]])
bc_y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])
variable_names
['meanradius', 'meantexture', 'meanperimeter', 'meanarea', 'meansmoothness', 'meancompactness', 'meanconcavity', 'meanconcavepoints', 'meansymmetry', 'meanfractaldimension', 'radiuserror', 'textureerror', 'perimetererror', 'areaerror', 'smoothnesserror', 'compactnesserror', 'concavityerror', 'concavepointserror', 'symmetryerror', 'fractaldimensionerror', 'worstradius', 'worsttexture', 'worstperimeter', 'worstarea', 'worstsmoothness', 'worstcompactness', 'worstconcavity', 'worstconcavepoints', 'worstsymmetry', 'worstfractaldimension', 'benign']
(2) Split out training and test sets¶
X_train, X_test, y_train, y_test = train_test_split(bc_X, bc_y)
(3) Discretize continuous features to discrete¶
scikit-learn's KBinsDiscretizer
will help us here, but we'll want an ordinal (0, 1, 2, 3, 4) encoding for our discrete features rather than the default one-hot encoding, and we need to ensure that the resulting matrices are converted back to integers.
.astype(int)
on the outputs of the discretizer. Usually scikit-learn returns floats in these cases.disc = KBinsDiscretizer(n_bins=5, encode="ordinal")
X_train = disc.fit_transform(X_train).astype(int)
X_test = disc.transform(X_test).astype(int)
X_train
array([[2, 2, 2, ..., 1, 1, 0], [1, 0, 1, ..., 1, 0, 2], [1, 2, 1, ..., 2, 3, 1], ..., [4, 4, 4, ..., 4, 2, 3], [4, 3, 4, ..., 4, 3, 3], [1, 0, 1, ..., 1, 0, 1]])
(4) Convert arrays to RelationalDataset
¶
Finally, let's convert our training and test folds into RelationalDatasets
and modes
:
bc_train, bc_modes = from_numpy(X_train, y_train, names=variable_names)
bc_test, _ = from_numpy(X_test, y_test, names=variable_names)
bc_modes
['meanradius(+id,#varmeanradius).', 'meantexture(+id,#varmeantexture).', 'meanperimeter(+id,#varmeanperimeter).', 'meanarea(+id,#varmeanarea).', 'meansmoothness(+id,#varmeansmoothness).', 'meancompactness(+id,#varmeancompactness).', 'meanconcavity(+id,#varmeanconcavity).', 'meanconcavepoints(+id,#varmeanconcavepoints).', 'meansymmetry(+id,#varmeansymmetry).', 'meanfractaldimension(+id,#varmeanfractaldimension).', 'radiuserror(+id,#varradiuserror).', 'textureerror(+id,#vartextureerror).', 'perimetererror(+id,#varperimetererror).', 'areaerror(+id,#varareaerror).', 'smoothnesserror(+id,#varsmoothnesserror).', 'compactnesserror(+id,#varcompactnesserror).', 'concavityerror(+id,#varconcavityerror).', 'concavepointserror(+id,#varconcavepointserror).', 'symmetryerror(+id,#varsymmetryerror).', 'fractaldimensionerror(+id,#varfractaldimensionerror).', 'worstradius(+id,#varworstradius).', 'worsttexture(+id,#varworsttexture).', 'worstperimeter(+id,#varworstperimeter).', 'worstarea(+id,#varworstarea).', 'worstsmoothness(+id,#varworstsmoothness).', 'worstcompactness(+id,#varworstcompactness).', 'worstconcavity(+id,#varworstconcavity).', 'worstconcavepoints(+id,#varworstconcavepoints).', 'worstsymmetry(+id,#varworstsymmetry).', 'worstfractaldimension(+id,#varworstfractaldimension).', 'benign(+id).']