Data Analytics in Bioinformatics. Группа авторов

Читать онлайн.
Название Data Analytics in Bioinformatics
Автор произведения Группа авторов
Жанр Программы
Серия
Издательство Программы
Год выпуска 0
isbn 9781119785606



Скачать книгу

Sample 3 Sample 4 Sample m Gene1 C11 C12 C13 C13 C1m Gene2 C21 C22 C23 C24 C2m Gene3 C31 C32 C33 C34 C3m Gene4 C41 C42 C43 C44 C4m Gene n Cn1 Cn2 Cn3 Cn4 Cnm

      These gene expression matrices are summed up into a database as shown in Figure 2.2 which acts as a repository for all the genetic information like gene regulation, genetics functionality of diseases, drug discovery based upon gene structures, response to drug projection and so on. These databases with all the genomic information when searched for particular instance of genes, it retrieves all the relevant information with high similarities using these clustering algorithms [32].

      The sequence of actions for Clustering of Gene Expression (GE), include

      Figure 2.2 Example matrix of gene expression (10 genes in a row and 2 samples in columns) [29–31].

       A matrix which hold the data of GE structure which consists of size, dimension, number, origin, etc.

       Features are extracted to identify the similarities (feature extraction). The input data is reduced without loss of information making it more suitable to application is called feature selection or feature reduction. The feature reduction can be achieved using few common algorithms for example Principal Component Analysis [11].

       Based upon similarities or features, all the genes which express similar expression are made into one cluster and which are distinct are grouped into another cluster. The degree of similarity or the proximity levels are computed using certain distance metrics applied on gene expression data provided. Euclidean distance, Manhattan distance, Mahalanobis distance, etc. are few metrics applied to quantify the level of similarity in order to form a cluster.

       2.3.2 Clustering Algorithms

      The genetic data (genes) are closely connected. Based upon the dataset appropriate clustering algorithms need to be chosen. These clustering approaches also reveals about the relationship between the clusters based upon the similarity. This section explains various partition algorithms. Figure 2.3 summarizes various clustering based partitions those can be used in Bioinformatics.

Schematic illustration of the partition clustering algorithms.

      Figure 2.3 Partition clustering algorithms.

       2.3.3.1 k-Means Clustering

      This algorithm partitions the dataset of n objects into k clusters in the feature space. The value of k (number of clusters) is fixed beforehand. Initially the algorithm randomly assumes k number of cluster centers (cluster mean) in the feature space. The distance (computed using Euclidean distance) of each gene sample is computed from all these cluster means. The sample is then grouped under the cluster that has minimum distance from the sample. The same process is followed for all the samples. Once all the samples are grouped under one of these clusters, each cluster mean is computed again considering the entire samples in that cluster and the cluster mean is updated. The above steps are iterated until the algorithm forms k stable clusters.

      The drawback of such algorithm is that determining the clusters (k) prior to implementation reduces the effectiveness of cluster formation and is prone to noisy data in gene expression values [13].

       2.3.3.2 Cluster Center Initialization Algorithm (CCIA)

      In traditional k-means algorithm the centroids of clusters are randomly selected initially, yielding low quality clusters. As an attempt in producing high quality clustering CCIA is introduced. This algorithm can be referred as an enhancement to k-means algorithm in which centroids (k) are accurately selected which are distant from the outliers. This selection process is done by computing mean and standard deviation on the data elements. CCIA now examines most nearest set of elements from the dataset (D) and fabricates into a new dataset (X) after discarding the outliers. Later those set of elements are eliminated from the dataset (D). This process is iterated until the dataset (X) equals with (k).

      Drawback—This algorithm is incapable in dealing with closely connected clusters and cannot deal with high dimensional data [14, 15].

       2.3.3.3 Intelligent Kernel k-Mean (IKKM)

      Generally gene expression data is nonlinearly separable and the gene data is represented in a matrix format using linear kernel function as shown in Figure 2.2. IKKM is applied on the kernel matrix in which clusters need not be predefined.

      Drawback—Cannot deal with high dimensional dataset [16].

       2.3.3.4 Clustering Large Applications (CLARA)

      Rather than considering and calculating medoids for the entire dataset CLARA considers a subset from the dataset with a constant size upon which PAM(partitioning around medoids) algorithm is applied. PAM algorithm aims at reducing the degree of similarity by calculating the distance between the cluster medoids and the elements in the subset space. As a result an optimal set of medoids are obtained which requires less store space when compared to other algorithms. In contrast to other partitioning algorithms this measures the degree of dissimilarity amongst the elements and the medoid of the cluster. This algorithm is applied on huge datasets producing accurate results with less computation time [17].

      Drawback—Cannot deal with higher dimensional data and closely connected clusters which leaves this algorithm a second choice in gene expression data.

       2.3.4 Hierarchical Clustering Algorithms

      Hierarchical clustering