R package evaluomeR how-to guide
evaluomeR 0.99.3
The evaluomeR package permits to evaluate the reliability of metrics by analysing the stability and goodness of the classifications of such metrics. The method takes the measurements of the metrics for the dataset and evaluates the reliability of the metrics according to the following analyses: Correlations, Stability and Goodness of classifications.
Correlations: Calculation of Pearson correlation coefficient between every pair of metrics available in order to quantify their interrelationship degree. The score is in the range [-1,1].
Stability: This analysis permits to estimate whether the clustering is meaningfully affected by small variations in the sample (Milligan and Cheng 1996). First, a clustering using the k-means algorithm is carried out. The value of K can be provided by the user. Then, the stability index is the mean of the Jaccard coefficient (Jaccard 1901) values of a number of bootstrap replicates. The values are in the range [0,1], having the following meaning:
Goodness of classifications: The goodness of the classifications are assessed by validating the clusters generated. For this purpose, we use the Silhouette width as validity index. This index computes and compares the quality of the clustering outputs found by the different metrics, thus enabling to measure the goodness of the classification for both instances and metrics. More precisely, this goodness measurement provides an assessment of how similar an instance is to other instances from the same cluster and dissimilar to the rest of clusters. The average on all the instances quantifies how the instances appropriately are clustered. Kaufman and Rousseeuw (Kaufman and Rousseeuw 2009) suggested the interpretation of the global Silhouette width score as the effectiveness of the clustering structure. The values are in the range [0,1], having the following meaning:
The installation of evaluomeR package is performed via Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("evaluomeR")
The package evaluomeR depends on the following CRAN packages for the calculus: fpc (Hennig 2018), cluster (Maechler et al. 2018), corrplot (Wei and Simko 2017). Moreover, this package also depends on grDevices, graphics, stats and utils from R Core (R Core Team 2018) for plotting and on the Bioconductor package SummarizedExperiment (Morgan et al. 2018) for input/output data.
The input is a SummarizedExperiment
object. The assay contained in SummarizedExperiment
must follow a certain structure, see Table 1: A valid header must be specified. The first column of the header is the ID or name of the instance of the dataset (e.g., ontology, pathway, etc.) on which the metrics are measured. The other columns of the header contains the names of the metrics. The rows contains the measurements of the metrics for each instance in the dataset.
ID | MetricNameA | MetricNameB | MetricNameC | … |
---|---|---|---|---|
instance1 | 1.2 | 6.4 | 0.5 | … |
instance2 | 2.4 | 5.4 | 0.8 | … |
instance3 | 1.9 | 8.9 | 1.1 | … |
In our package we provide three different sample input data:
ont-metrics: Structural ontology metrics, 19 metrics measuring structural aspects of bio-ontologies have been analysed on two different corpora of ontologies: OBO Foundry and AgroPortal (Franco et al. 2019).
rna-metrics: RNA quality metrics for the assessment of gene expression differences, 2 quality metrics from 16 aliquots of a unique batch of RNA Samples. The metrics are Degradation Factor (DegFact) and RNA Integrity Number (RIN) (Imbeaud et al. 2005).
biopathways-metrics: Metrics for biological pathways, 2 metrics that quantitative characterizations of the importance of regulation in biochemical pathway systems, including systems designed for applications in synthetic biology or metabolic engineering. The metrics are reachability and efficiency (Davis and Voit 2018).
The user shall run the loadSample
method to load evaluomeR sample input data. This requires to provide the descriptor of the desired dataset. The descriptor can take the following values: “ont-metrics”, “rna-metrics” or “biopathways-metrics”.
library(evaluomeR)
ontMetrics <- loadSample("ont-metrics")
rnaMetrics <- loadSample("rna-metrics")
biopathwaysMetrics <- loadSample("biopathways-metrics")
We provide the correlations
function to evaluate the correlations among the metrics defined in the SummarizedExperiment
:
library(evaluomeR)
rnaMetrics <- loadSample("rna-metrics")
correlationSE <- correlations(rnaMetrics, margins = c(4,4,12,10))
# Access the correlation matrix via its first assay:
# assay(correlationSE,1)
The calculation of the stability indices is performed by stability
and stabilityRange
functions.
The stability index analysis is performed by the stability
function. For instance, running a stability analysis for the metrics of rnaMetrics
with a number of 100
bootstrap replicates with a k-means cluster whose k
is 2 (note that k
must be inside [2,15] range):
stabilityData <- stability(rnaMetrics, k=2, bs = 100)
The stability
function returns the stabilityData
object, a SummarizedExperiment
that contains an assay with the information shown in the plot:
assay(stabilityData, 1)
Metric | Mean_stability_k_2 |
---|---|
RIN | 0.839702380952381 |
DegFact | 0.858073232323232 |
The plot represents the stability mean from each metric for a given k
value. This mean is calculated by performing the average of every stability index from k
ranges [1,k] for each metric.
The stabilityRange
function is an iterative method of stability
function. It performs a stability analysis for a range of k
values (k.range
).
For instance, analyzing the stability of rnaMetrics
in range [2,4], with bs=100
:
stabilityRangeData = stabilityRange(rnaMetrics, k.range=c(2,4), bs = 100)
Two kind of graphs are plotted in stabilityRange
function. The first type (titled as “St. Indices for k=X across metrics”) shows, for every k
value, the stability indices across the metrics. The second kind (titled as St. Indices for metric ‘X’ in range [x,y]), shows a plot of the behaviour of each metric across the k
range.
There are two methods to calculate the goodness of classifications: quality
and qualityRange
.
This method plots how the metrics behave for the current k
value, according to the average silhouette width. Also, it will plot how the clusters are grouped for each metric (one plot per metric). For instance, running a quality analysis for the two metrics of rnaMetrics
dataset, being k=4
:
qualityData = quality(rnaMetrics, k = 4)
The data of the first plot titled as “Qual. Indices for k=4 across metrics” according to Silhouette avg. width, is stored in Avg_Silhouette_Width column from the first assay of the SummarizedExperiment
, qualityData
. The other three plots titled by their metric name display the input rows grouped by colours for each cluster, along with their Silhouette width scores.
The variable qualityData
contains information about the clusters of each metric: The average silhouette width per cluster, the overall average sihouette width (taking into account all the clusters) and the number of individuals per cluster:
assay(qualityData,1)
Metric | Cluster_1_SilScore | Cluster_2_SilScore | Cluster_3_SilScore | Cluster_4_SilScore | Avg_Silhouette_Width | Cluster_1_Size | Cluster_2_Size | Cluster_3_Size | Cluster_4_Size |
---|---|---|---|---|---|---|---|---|---|
RIN | 0 | 0.682539682539682 | 0.661697073729871 | 0.68338517747747 | 0.635093047647393 | 1 | 3 | 4 | 8 |
DegFact | 0.759196481622952 | 0.59496499852177 | 0.600198799385732 | 0.521618857725795 | 0.634170498361632 | 5 | 3 | 5 | 3 |
The qualityRange
function is an iterative method that uses the same functionality of quality
for a range of values (k.range
), instead for one unique k
value. This methods allows to analyse the goodness of the classifications of the metric for different values of the range.
In the next example we will keep using the rnaMetrics
dataset, and a k.range
set to [4,6].
k.range = c(4,6)
qualityRangeData = qualityRange(rnaMetrics, k.range)
The qualityRange
function also returns two kind of plots, as seen in Stability range section. One for each k
in the k.range
, showing the quality indices (goodness of the classification) across the metrics, and a second type of plot to show each metric with its respective quality index in each k
value.
The qualityRangeData
object returned by qualityRange
is list of SummarizedExperiment
, whose size is diff(k.range)+1
. In the example shown above, the size of qualityRangeData
is 3, since the array length would contain the dataframes from k=4
to k=6
.
diff(k.range)+1
## [1] 3
length(qualityRangeData)
## [1] 3
The user can access a specific dataframe for a given k
value in three different ways: by dollar notation, brackets notation or using our wrapper method getDataQualityRange
. For instance, if the user wishes to retrieve the dataframe which contains information of k=5
, being the k.range
[4,6]:
k5Data = qualityRangeData$k_5
k5Data = qualityRangeData[["k_5"]]
k5Data = getDataQualityRange(qualityRangeData, 5)
assay(k5Data, 1)
Metric | Cluster_1_SilScore | Cluster_2_SilScore | Cluster_3_SilScore | Cluster_4_SilScore | Cluster_5_SilScore | Avg_Silhouette_Width | Cluster_1_Size | Cluster_2_Size | Cluster_3_Size | Cluster_4_Size | Cluster_5_Size |
---|---|---|---|---|---|---|---|---|---|---|---|
RIN | 0 | 0.682539682539682 | 0.615758840004668 | 0.433333333333333 | 0.348714574898785 | 0.472139205133228 | 1 | 3 | 4 | 3 | 5 |
DegFact | 0.718287037037037 | 0.0375000000000007 | 0.904545454545454 | 0.585791823535685 | 0.521618857725795 | 0.578190921755929 | 4 | 2 | 2 | 5 | 3 |
Once the user believes to have found a proper k
value, then the user can run the quality
function to see further silhouette information on the plots.
In this section we describe a series of parameters that are shared among our analysis functions: correlations
, stability
, stabilityRange
, quality
and qualityRange
.
The generation of the images can be disabled by setting to FALSE
the parameter getImages
:
stabilityData <- stability(rnaMetrics, k=5, bs = 50, getImages = FALSE)
This prevents from generating any graph, performing only the calculus. By default getImages
is set to TRUE
.
The plots can be stored in disk by setting a path
to an existing directory:
location <- "/home/user/directory" # Set a path to a valid directory
qualityRangeData <- qualityRange(data=rnaMetrics, k.range=c(2,3), getImages = FALSE, path=location)
The default value of path
is NULL
.
The plots can be labeled by the label
parameter. The value of label
is put into the main title of the plots:
stabilityRangeData <- stabilityRange(data=rnaMetrics, k.range=c(6,7), bs=20, getImages = TRUE, label="Experiment 1:")
Additionally, if the user establishes both label
and path
parameters, the label
text is appended to the begining of name of the plot file. The default value of label
is NULL
.
The source code is available at github. For bug/error reports please refer to joseantonio.bernabe1@um.es. Questions regarding functionality can be sent to joseantonio.bernabe1@um.es, mfranco@um.es or jmvivomo@um.es.
The package ‘evaluomeR’ is licensed under GPL-3.
Currently there is no literature for evaluomeR. Please cite the R package, the github or the website. This package will be updated as soon as a citation is available.
The evaluomeR functionality can also be access through a web interface1 Evaluome web an API REST2 API documentation.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
## [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
## [5] LC_TIME=Spanish_Spain.1252
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] evaluomeR_0.99.3 SummarizedExperiment_1.12.0
## [3] DelayedArray_0.8.0 BiocParallel_1.16.6
## [5] matrixStats_0.54.0 Biobase_2.42.0
## [7] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1
## [9] IRanges_2.16.0 S4Vectors_0.20.1
## [11] BiocGenerics_0.28.0 magrittr_1.5
## [13] kableExtra_1.0.1 BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] mclust_5.4.2 Rcpp_1.0.0 mvtnorm_1.0-8
## [4] lattice_0.20-38 class_7.3-14 digest_0.6.18
## [7] R6_2.4.0 evaluate_0.13 highr_0.7
## [10] httr_1.4.0 pillar_1.3.1 Rdpack_0.10-1
## [13] zlibbioc_1.28.0 rlang_0.3.1 diptest_0.75-7
## [16] rstudioapi_0.9.0 kernlab_0.9-27 Matrix_1.2-15
## [19] rmarkdown_1.11 webshot_0.5.1 readr_1.3.1
## [22] stringr_1.4.0 RCurl_1.95-4.11 munsell_0.5.0
## [25] compiler_3.5.2 xfun_0.4 pkgconfig_2.0.2
## [28] htmltools_0.3.6 nnet_7.3-12 tibble_2.0.1
## [31] GenomeInfoDbData_1.2.0 bookdown_0.9 viridisLite_0.3.0
## [34] crayon_1.3.4 MASS_7.3-51.1 bitops_1.0-6
## [37] grid_3.5.2 scales_1.0.0 bibtex_0.4.2
## [40] stringi_1.2.4 XVector_0.22.0 flexmix_2.3-14
## [43] robustbase_0.93-3 xml2_1.2.0 tools_3.5.2
## [46] fpc_2.1-11.1 glue_1.3.0 trimcluster_0.1-2.1
## [49] DEoptimR_1.0-8 hms_0.4.2 yaml_2.2.0
## [52] colorspace_1.4-0 cluster_2.0.7-1 BiocManager_1.30.4
## [55] prabclus_2.2-7 gbRd_0.4-11 rvest_0.3.2
## [58] corrplot_0.84 knitr_1.21 modeltools_0.2-22
Davis, Jacob D, and Eberhard O Voit. 2018. “Metrics for Regulated Biochemical Pathway Systems.” Bioinformatics. doi:10.1093/bioinformatics/bty942.
Franco, Manuel, Juana María Vivo, Manuel Quesada-Martínez, Astrid Duque-Ramos, and Jesualdo Tomás Fernández-Breis. 2019. “Evaluation of ontology structural metrics based on public repository data.” Bioinformatics, February. doi:10.1093/bib/bbz009.
Hennig, Christian. 2018. R Package “Fpc”: Flexible Procedures for Clustering. https://cran.r-project.org/package=fpc.
Imbeaud, Sandrine, Esther Graudens, Virginie Boulanger, Xavier Barlet, Patrick Zaborski, Eric Eveno, Odilo Mueller, Andreas Schroeder, and Charles Auffray. 2005. “Towards Standardization of Rna Quality Assessment Using User-Independent Classifiers of Microcapillary Electrophoresis Traces.” Nucleic Acids Research 33 (6). Oxford University Press: e56–e56.
Jaccard, Paul. 1901. “Distribution de La Flore Alpine Dans Le Bassin Des Dranses et Dans Quelques Régions Voisines.” Bull Soc Vaudoise Sci Nat 37: 241–72.
Kaufman, Leonard, and Peter J Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley & Sons.
Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, Kurt Hornik, Matthias Studer, Pierre Roudier, Juan Gonzalez, and Kamil Kozlowski. 2018. R Package “Cluster”: “Finding Groups in Data”: Cluster Analysis Extended Rousseeuw et Al". https://cran.r-project.org/package=cluster.
Milligan, Glenn W, and Richard Cheng. 1996. “Measuring the Influence of Individual Data Points in a Cluster Analysis.” Journal of Classification 13 (2). Springer: 315–35.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2018. SummarizedExperiment: SummarizedExperiment Container.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wei, Taiyun, and Viliam Simko. 2017. R Package “Corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.