The mechanism of leukemia, as in other cancers, is closely related to gene expressions. For example, chronic myelogenous leukemia, CML is caused by the constant activation of tyrosine kinases due to gene mutations. Identification of the proteins associated with the disease, the so-called disease-associated genes provides a clue to the development of effective drugs for the disease. In the case of CML, this was the tyrosine kinase inhibitor imatinib, an early example of a molecularly targeted drug.
In general, when genotypes based on gene expressions share certain features with the patient's phenotypes, these features are considered to be the representation of the disease-associated genes. Based on this idea, Golub et al. proposed in 1999 to classify diseases based on gene expression levels as features1.
The process of determining the association between gene expressions and phenotypes is essentially unsupervised or self-supervised learning. Here, we use one of the basic methods of unsupervised learning, clustering, to classify the phenotypes of patients based on gene expression levels.
We use the dataset of acute lymphocytic leukemia, ALL and acute myeloid leukemia, AML by Golub et al. The dataset is divided into two parts for training and validation; here we use the training set consisting of 7,129 gene expression levels from 27 ALL and 11 AML patients. In the original dataset, gene expression levels are expressed as integer values, but we treat them as real numbers for later calculations.
We define a
More generally, with
For the
After constructing the first cluster, the distances between vectors that do not belong to the cluster and the cluster are calculated, and a new cluster
We use the definition by Ward for linkage2.
That is, when we newly construct a cluster
where
it can be seen that Ward's definition is a natural extension of the Euclidean distance.
After continuing the agglomerative operation of constructing a new cluster from the vectors or clusters in the nearest neighborhood using the linkage, for the
The actual results computed on the Golub et al. dataset are shown in Figure 1. The clustering result can be represented as a dendrogram or a phylogenetic tree due to its hierarchical structure.
Figure 1. Dendrogram representation of the results of hierarchical clustering by gene expression levels for each patient.
If we take three clusters with a threshold distance of 70.0, shown as a dashed line in the Figure 1, we can almost clearly classify the AML T-cell/B-cell and AML phenotypes, despite unsupervised learning without using phenotypes as labels.
The results of projecting the vector representations of the genotypes into two dimensions using principal component analysis, PCA, which is an unsupervised learning as well as hierarchical clustering, are shown in Figure 2.
Figure 2. Projection of the vectors of genotypes with PCA.
Although PCA, a linear method, can be used to classify ALL and AML groups, the discrimination between T-cell and B-cell groups of ALL is unclear.
There are several known packages for hierarchical clustering in Python.
Here, we will implement the algorithm using numpy
to understand the algorithm.
As a data structure required for the calculation, we define a dictionary
C = dict()
for i in range(X.shape[0]):
C[i] = 1
Since the rows of the dataset
We also determine the matrix
class DistanceMatrix(object):
def __init__(self):
self.matrix = dict()
def __setitem__(self, key, value):
i, j = key
if i > j:
i, j = j, i
self.matrix[i, j] = value
def __getitem__(self, key):
i, j = key
if i == j:
return 0
if i > j:
i, j = j, i
return self.matrix[i, j]
considering that Ward linkage includes Euclidean distances, this matrix
euclidean = lambda x, y: np.sqrt(np.sum((x - y)*(x - y)))
D = DistanceMatrix()
for i in range(X.shape[0]):
for j in range(X.shape[0]):
if i < j:
D[i, j] = euclidean(X[i, :], X[j, :])
Create a new cluster from the vectors and clusters with the closest distances contained in this
for k in range(X.shape[0] - 1):
# Find the two clusters in the closest neighborhood.
minimum = np.Infinity
for i in C.keys():
for j in C.keys():
if i < j and D[i, j] < minimum:
minimum = D[i, j]
x, y = i, j
# Create the new cluster from x and y.
C[X.shape[0] + k] = C[x] + C[y]
# Update the distance matrix.
for i in C.keys():
if i < X.shape[0] + k:
D[i, X.shape[0] + k] = ward(x, y, i, D, C)
# Clusters x and y are included in the new cluster.
del C[x], C[y]