Kernels computation

MKLpy constains several functions to generate kernels for vectorial, booelan, and string kernels. The base syntax for a kernel function is K = k(X, Z=None, **args), where X and Z are two matrices containing examples (rows), and K is the resulting kernel matrix. As previously mentioned, the type of input data can be ndarray, torch.Tensor, or other iterables castable into tensors. Note that we use the same syntax from scikit-learn.

In the following snippets, we assume that Xtr and Xte are the training and test input matrices respectively.

Vectorial kernels

Several kernel functions exist to deal with vectorial data, where each example $x$ is described by a real-valued feature vector $x \in \mathbb{R}^d$ . The following table describes the kernel functions for vectorial data provided by MKLpy

Kernel function	Definition	Parameters
linear_kernel	$\langle x,z \rangle$	-
homogeneous_polynomial_kernel	$\langle x,z \rangle^d$	d: `int`
polynomial_kernel	$(gamma \langle x,z \rangle + coef0)^d$	d: `int`, gamma: `float`, coef0: `float`
rbf_kernel	$exp(-gamma \\|x-z\\|_2^2)$	gamma: `float`
euclidean_distances	$\\|x-z\\|_2^2$	-

These kernels are available in the MKLpy.metrics.pairwise module. An example of invocation is shown in the following

from MKLpy.metrics.pairwise import homogeneous_polynomial_kernel as hpk
K_train = hpk(Xtr, degree=2)
K_test  = hpk(Xte, Xtr, degree=2)

Alternatively, you can use kernel functions from the scikit-learn package

from sklearn.metrics.pairwise import rbf_kernel
K_train = rbf_kernel(Xtr, gamma=.1)

See

Scikit-learn provides several kernel functions (that may not accept torch.Tensor as input). For further details see here.

Boolean kernels

Boolean kernels are kernel functions specifically designed for binary-valued and categorical (one-hotted) datasets. The implicit feature space of these kernels consists of logical formulae, such as conjunctions, disjunctions, or their combinations.

Assuming $n$ be the dimension of feature vectors, boolean kernels available in MKLpy are:

Kernel function	Definition	Parameters
monotone_conjunctive_kernel	$\binom{\langle x,z \rangle}{c}$	c: `int` (arity of the conjunctions)
monotone_disjunctive_kernel	$\binom{n}{d}-\binom{n-\langle x,x \rangle}{d}-\binom{n-\langle z,z \rangle}{d} +\binom{n-\langle x,x \rangle-\langle z,z \rangle+\langle x,z \rangle}{d}$	d: `int` (arity of the disjunctions)

These kernels work only with binary-valued examples, $x\in\{0,1\}^n$ . You may use boolean kernels with vectorial data if you apply a binarization of the features.

A simple binarized available in MKLpy is MKLpy.preprocessing.binarization.AverageBinarizer, that binarizes features by applying a hard-threshold based on the average values of original features.

from MKLpy.preprocessing.binarization import AverageBinarizer
X = ... # my original non-binary examples matrix
binarizer = AverageBinarizer().fit(X)
X_bin = binarizer.transform(X)

Paper

The boolean kernels provided in MKLpy have been presented in the following paper:
Mirko Polato, Ivano Lauriola, and Fabio Aiolli: "A novel boolean kernels family for categorical data". Entropy (2018)

If you use these kernels in scientific projects, please cite the aforementioned paper

@article{polato2018novel,
  title={A novel boolean kernels family for categorical data},
  author={Polato, Mirko and Lauriola, Ivano and Aiolli, Fabio},
  journal={Entropy},
  volume={20},
  number={6},
  pages={444},
  year={2018},
  publisher={Multidisciplinary Digital Publishing Institute}
}

String kernels

Strings are structured objects consisting of ordered sequences of characters or symbols. MKLpy provides multiple string kernelsbased on sub-structures. In short, each feature describes the frequency (or a related measure) of the occurrence of a certain sub-structure in the input string.

The string kernels provided in the MKLpy.metrics.pairwise module are summarized in the table below

Kernel function	Parameters
spectrum_kernel	p: `int` the length of sub-structures
fixed_length_subsequences_kernel	p: `int` the length of sub-structures
all_subsequences_kernel	-

The sintax is quite similar compared to other kernel functions. The only difference is that these kernels require strings as input instead of matrices.

from MKLpy.metrics.pairwise import  spectrum_kernel

X = ['aabb', 'abba', 'baac']
K = spectrum_kernel(X, p=2)

Warning

Note that, due to the nature of strings, tensors cannot be used.

These kernels compute the explicit representation and they they perform the dot-product between the pairwise representations, i.e. $k(x,z) = \langle\phi(x),\phi(z)\rangle.

Additionally, we can directly access the explicit embeddings via specialized functions.

from MKLpy.metrics.pairwise import  spectrum_embedding, 
                                    fixed_length_subsequences_embedding, 
                                    all_subsequences_embedding

s = spectrum_embedding('aaabc', p=2)    #computes the 2-spectrum embedding

For computational purposes, we encode string embeddings as dictionaries containing non-zero features, i.e.

print (s)
{'aa': 2, 'ab': 1, 'bc': 1}

Book

If you need further information concerning string kernels, you may refer to:

Shawe-Taylor John, and Nello Cristianini. "Kernel methods for pattern analysis". Cambridge university press (2004).