The method
- Correlations: Calculation of Pearson correlation
coefficient between every pair of metrics available in order to
quantify their interrelationship degree. The score is in the range
[-1,1].
- Perfect correlations: -1 (negative), and 1 (positive)
- Stability: This analysis permits to estimate whether the
clustering is meaningfully affected by small variations in the data
[1]. First, a clustering using the k-means algorithm is carried out.
The value of K can be provided by the user. Then, the stability
index is the mean of the Jaccard coefficient [2] values of 1000
bootstrap replicates. The values are in the range [0,1], having the
following meaning:
- Unstable: [0, 0.60[
- Doubtful: [0.60, 0.75]
- Stable: ]0.75, 0.85]
- Highly Stable: ]0.85, 1]
- Goodness of classifications: The goodness of the
classifications are assessed by validating the clusters generated.
For this purpose, we use the Silhouette width as validity index.
This index computes and compares the quality of the clustering
outputs found by the different metrics, thus enabling to measure the
goodness of the classification for both instances and metrics. More
precisely, this measurement provides an assessment of how similar an
instance is to other instances from the same cluster and dissimilar
to all the other clusters. The average on all the instances
quantifies how appropriately the instances are clustered. Kaufman
and Rousseeuw [3] suggested the interpretation of the global
Silhouette width score as the effectiveness of the clustering
structure. The values are in the range [0,1], having the following
meaning:
- There is no substantial clustering structure: [-1, 0.25].
- The clustering structure is weak and could be artificial: ]0.25,
0.50].
- There is a reasonable clustering structure: ]0.50, 0.70].
- A strong clustering structure has been found: ]0.70, 1].
[1] Cheng, R. and Milligan, G. W. (1996). Measuring the influence of
individual data points in a cluster analysis. Journal of Classification,
13, 315–335.
[2] Jaccard, C. (1901). Distribution de la flore alpine dans le Basin de
Dranses et dans quelques regions voisines. Bulletin de la Societe
Vaudoise des Sciences Naturelles, 37, 241–272.
[3] Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data: An
Introduction to Cluster Analysis. Wiley.
- The input is a CSV file. Format information;
- Column delimiter: ","
- Decimal separator: "."
- The first row is the header. The first column of the header is
the ID or name of the instance of the dataset (e.g., ontology,
pathway, etc.) on which the metrics are measured. The rest of the
columns of the header contains the names of the metrics. The rest
of rows contains the measurement of the metrics for each instance
in the dataset.
- Minimum requirements and limitations
- Analysis of correlations: the input file must contain at least
two metrics for the analysis of correlations.
- The user has to select the number of clusters (K) to be created
by the k-means algorithm. Our current implementations uses the
same value of K for all the metrics. Please ensure that the number
of different measurements for each metrics is larger than K.
If you wish to use different values of K for different metrics,
please split your input file into several ones.
- The bootstrapping method applied for the calculation of the
stability may fail if the number of different measurements for a
metric is close to K. If this happens, please try with a lower
value of K.
Example of input format (extracted from our use case on RNA
quality metrics)
Aliquot,RIN,DegFact
1,9,4.9
2,8.5,4.8
3,8.3,5.5
4,8.1,5.2
5,8.1,5
6,7.9,6.9
7,7.8,6.1
8,7.3,6.8
9,6.4,13.1
10,6.3,11.6
11,6.2,12.3
12,5.5,16.6
13,4.7,16.1
14,4.6,22.3
15,4,26
16,2.5,30.5
Data output
The web service provides different kind of outputs, depending on how
it is invoked:
CSV file or images with the results of the analysis of correlations
Example of CSV for the correlations: it contains the matrix of
correlations
"","DegFact","RIN"
"DegFact",1,-0.974468501362178
"RIN",-0.974468501362178,1
CSV file or images with the results of the analysis of stability
of the metrics
Example of CSV for the stability: The columns are the name of the
metric, the stability scores for the K clusters and the mean stability
score
Metric,Stability_category_1,Stability_category_2,Stability_category_3,Stability_category_4,Stability_category_5,Mean_stability
DegFact,0.7483333333333333,0.43483333333333335,0.6331666666666667,0.842,0.8033333333333333,0.6923333333333334
RIN,0.5866666666666667,0.8023333333333333,0.8226666666666667,0.632,0.6767380952380952,0.7040809523809524
CSV file or images with the results of the analysis of the
goodness of the classifications of the metrics.
Example of CSV for the goodness: The columns are the
name of the metrics, the silhouette width for the K clustes, the average
silhouette width and the number of instances of each cluster
Metric,Cluster_1_SilScore,Cluster_2_SilScore,Cluster_3_SilScore,Cluster_4_SilScore,Cluster_5_SilScore,Avg_Silhouette_Width,Cluster_1_Size,Cluster_2_Size,Cluster_3_Size,Cluster_4_Size,Cluster_5_Size
DegFact,0.718287037037037,0.0375000000000007,0.904545454545454,0.585791823535685,0.521618857725795,0.591246,4,2,2,5,3
RIN,0,0.682539682539682,0.615758840004668,0.433333333333333,0.348714574898785,0.5282413,1,3,4,3,5
Example of output format (extracted from our use case on RNA
quality metrics).
Comparison of stability of the metrics using K=3 (left) and
K=5 (right)
The metrics are highly stable with k=3, but are doubtful with k=5.
This means that applying these metrics is more effective on
this dataset when trying to classify the instances in three groups
than in five ones.
Comparison of goodness of the classifications of the metric
DegFact using K=3 (left) and K=5 (right)
The silhouette width for k=3 is 0.74 which means that it has a strong
structure, whereas the structure is reasonable with K=5 (score 0.58).
Using K=3 is more appropriate for this metric.
The REST API
The documentation of the REST API is available in our
API page.
Browser Compatibility
The online interface has been successfully tested in the following
web browsers on desktop computers.
|
Google Chrome
|
Safari |
Mozilla Firefox |
Microsoft Edge |
Windows 10
|
71.0.3578.98 |
Not tested |
64.0 |
2.17134.1.0 |
Linux (Ubuntu 18.04) |
68.0.3440.106 |
Not tested |
Quantum |
Not tested |
Mac OS 10.13.6
|
71.0.3578.98 |
12.0.2 |
64.0
|
Not tested |