METRICS Tool v1.0

Items/Conditions	Definitions	Options
Please fill out all conditions first for relevant sections and then all active items to calculate METRICS score. Please note that default option is "No". ? Stands for explanation of items and conditions. C Stands for conditional items or sections.
Study Design
Item#1	?>>>Whether any guideline or checklist, e.g., CLEAR checklist, is used in designing and reporting, as appropriate for the study design (e.g., handcrafted radiomics or deep learning pipeline). Adherence to radiomics and/or machine learning-specific checklists or guidelines	Yes No
Item#2	?>>>Whether inclusion and exclusion criteria are explicitly defined. These should lead to a representative study sample that matches the general population of interest for the study aim. Eligibility criteria that describe a representative study population	Yes No
Item#3	?>>>Whether the reference standard or outcome measure is representative of the current clinical practice and robust. Examples of high-quality reference standards are preferably histopathology, well-established clinical and genomic markers, the latest version of the prognostic tools, guideline-based follow-up or consensus-based expert opinions. Examples of poor quality reference standards are those based on qualitative image evaluation, images that are later used for feature extraction, or outdated versions of prognostic tools. High-quality reference standard with a clear definition	Yes No
Imaging Data
Item#4	?>>>Whether more than one institution is involved as a diagnostic imaging data source for radiomics analysis. Multi-center	Yes No
Item#5	?>>>Whether the source of the radiomics data is an imaging technique that reflects established standardization approaches, such as acquisition protocol guidelines (e.g., PI-RADS specifications). Clinical translatability of the imaging data source for radiomics analysis	Yes No
Item#6	?>>>Whether the image acquisition protocol is clearly reported to ensure the replicability of the method. Imaging protocol with acquisition parameters	Yes No
Item#7	?>>>Whether the time interval between the diagnostic imaging exams (used as an input for the radiomics analysis) and the outcome measure/reference standard acquisition is appropriate to validate the presence or absence of target conditions of the radiomics analysis at the moment of the diagnostic imaging exams. The interval between imaging used and reference standard	Yes No
Segmentation CPlease Note: This entire section is conditional. It is applicable only for studies including region/volume of interest labeling or segmentation. In case of no segmentation in the study, the exclusion of this section from scoring will not affect the final score in the percentage scale.
Condition#1	?>>>"Segmentation" refers to i) Fine delineation of a region or volume of interest; ii) Rough delineation with bounding boxes; or, iii) cropping the image around a region of interest. Does the study include segmentation?	Yes No
Condition#2	?>>>"Fully automated segmentation" refers to segmentation process without any human intervention. Does the study include fully automated segmentation?	Yes No
Item#8	?>>>Whether the rules or the method of the segmentation are defined (e.g., margin shrinkage, peri-tumoral sampling, details of segmentation regardless of whether manual, semi-automated or automated methods are used). In the case of DL-based radiomics, the segmentation can refer to the rough delineation with bounding boxes or cropping the image around a region of interest. Transparent description of segmentation methodology	Yes No
Item#9	?>>>If a segmentation technique that does not require any sort of human intervention is used, examples of the results should be presented and a formal assessment of its accuracy compared to domain expert annotations included in the study (e.g., DICE score or Jaccard index compared with a radiologist’s semantic annotation). Any intervention to the annotation in terms of volume or area should be considered as the use of a semi-automated segmentation technique. This item also applies to the use of segmentation models previously validated on other datasets. Formal evaluation of fully automated segmentation CPlease Note: This item is conditional. It is applicable only for studies implementing automated segmentation. In the case of manual or semi-automated segmentation, the exclusion of this item from scoring will not affect the final score on the percentage scale.	Yes No
Item#10	?>>>Whether final segmentation in the test set is produced by a single reader (manually or with a semi-automated tool) or an entirely automated tool, to better reflect clinical practice. Test set segmentation masks produced by a single reader or automated tool	Yes No
Image Processing and Feature Extraction
Condition#3	?>>>"Hand-crafted radiomic features" (i.e., traditional radiomic features) are created in advance by human experts or mathematicians. Does the study include hand-crafted feature extraction?	Yes No
Item#11	?>>>Whether preprocessing of the images is appropriately performed considering the imaging modality (e.g., gray level normalization for MRI, image registration in case of multiple contrasts or modalities) and feature extraction techniques (i.e., 2D or 3D) that are used. For instance, in the case of large slice thickness (e.g., ≥5 mm), extreme upsampling (e.g., 1 x 1 x 1 mm3) of the volume might be inappropriate. In such a case, 2D feature extraction could be preferable, ensuring in-plane isotropy of the pixels. On the other hand, achieving isotropic voxel values should be targeted in 3D feature extraction, to allow for texture feature rotational invariance. Also, whether gray level discretization parameters (bin width, along with resulting gray level range, or bin count) are described in full detail. Description of different image types used (e.g., original, filtered) should also be included (e.g., high and low pass filter combinations for wavelet decomposition, sigma values for Laplacian of Gaussian edge enhancement filtering). If the image window is fixed, it should be clarified. Appropriate use of image preprocessing techniques with transparent description	Yes No
Item#12	?>>>Whether a standardized software (e.g., compliant with IBSI) was used for feature extraction, including information on the version number. Use of standardized feature extraction software C*Please Note: This item is conditional. It is applicable only for studies with hand-crafted radiomic features. In the case of DL studies, the exclusion of this item from scoring will not affect the final score in the percentage scale.	Yes No
Item#13	?>>>Whether feature types (e.g., hand-crafted, deep features) and feature classes (for hand-crafted) are described. Also, if a default configuration statement is provided for the remaining feature extraction parameters. A file presenting the complete configuration of these settings should be included in the study materials (e.g., parameter file such as in YAML format, screenshot if a dedicated file for this is not available for the software). In the case of DL, neural network architecture along with all image operations should be described. Transparent reporting of feature extraction parameters, otherwise providing a default configuration statement	Yes No
Feature Processing
Condition#4	?>>>"Tabular data" refers to data that is organized in a table with rows and columns. Does the study include tabular data?	Yes No
Condition#5	?>>>"End-to-end deep learning" refers to the use of deep learning to directly process the image data and produce a classification or regression model. Does the study include end-to-end deep learning?	Yes No
Item#14	?>>>Whether unstable features are removed via test-retest, reproducibility analysis by analysis of different segmentations, or stability analysis [i.e., image perturbations]. Instability may be due to random noise introduced by manual or even automated image segmentation or exposed in a scan-rescan setting. The specific methods used should be clearly presented, with specific results for each component in multi-step feature removal pipelines. Removal of non-robust features C*Please Note: This item is conditional. It is applicable only for the studies having tabular data (i.e., numeric radiomic features in a tabulated format, which is usually seen in hand-crafted and some deep learning-based studies as deep features).	Yes No
Item#15	?>>>Whether dimensionality is reduced by selecting the more informative features such as with algorithm-based feature selection (e.g., LASSO coefficients, Random Forest feature importance), univariate correlation, collinearity, or variance analysis. The specific methods used should be clearly presented, with specific results for each component in multi-step feature removal pipelines. Removal of redundant features C*Please Note: This item is conditional. It is applicable only for the studies having tabular data (i.e., numeric radiomic features in a tabulated format, which is usually seen in hand-crafted and some deep learning-based studies as deep features).	Yes No
Item#16	?>>>Whether the number of instances and features in the final training data set is appropriate according to the research question and modeling algorithm. This should be demonstrated by statistical means, empirically through consistency of performance in validation and testing, or based on previous evidence in the literature. Appropriateness of dimensionality compared to data size C*Please Note: This item is conditional. It is applicable only for the studies having tabular data (i.e., numeric radiomic features in a tabulated format, which is usually seen in hand-crafted and some deep learning-based studies as deep features).	Yes No
Item#17	?>>>Whether DL pipeline consistency of performance has been assessed in a test-retest setting, for example by a scan-rescan approach, use of segmentations by different readers, or stability analysis [i.e., image perturbations]. The specific methods used should be clearly presented. Robustness assessment of end-to-end deep learning pipelines C*Please Note: This item is conditional. It is applicable only for the studies employing a DL pipeline for the entire feature extraction, processing and modeling process (e.g., computer vision models such as convolutional networks or transformers), without explicit conversion of DL-based image features to a tabular format for further processing or analysis.	Yes No
Preparation for Modeling
Item#18	?>>>Whether the training-validation-test data split is done at the very beginning of the analysis pipeline, prior to any processing step. Data split should be random but reproducible (e.g., fixed random seed), preferably without altering outcome variable distribution in the test set (e.g., using a stratified data split). Moreover, the data split should be on the patient level, not the scan level (i.e., different scans of the same patient should be in the same set). Proper data partitioning should guarantee that all data processing (e.g., scaling, missing value imputation, oversampling or undersampling) is done blinded to the test set data. These techniques should be exclusively fitted on training (or development) data sets and then used to transform test data at the time of inference. If a single training-validation data split is not done and a resampling technique (e.g., cross-validation) is used instead, test data should always be handled separately from this. Proper data partitioning process	Yes No
Item#19	?>>>Whether potential confounding factors were analyzed, identified if present, and removed if necessary (e.g., if it has a strong influence on generalizability). These may include different distributions of patient characteristics (e.g., gender, lesion stage or grade) across sites or scanners. Handling of confounding factors	Yes No
Metrics and Comparison
Item#20	?>>>Whether appropriate accuracy metrics are reported, such as AUC for Receiver Operating Characteristics (ROC) or Precision-Recall (PRC) curves and confusion matrix-derived accuracy metrics (e.g., specificity, sensitivity, precision, F1 score) for classification tasks; MSE, RMSE, and MAE for regression tasks. For classification tasks, the confusion matrix should always be included, to allow the calculation of additional metrics. If training a DL network, loss curves should be presented. Use of appropriate performance evaluation metrics for task	Yes No
Item#21	?>>>Whether uncertainty measures are included in the analysis, such as 95% confidence interval (CI), standard deviation (SD), or standard error (SE). Report on methodology to derive that distribution (ie. bootstrapping with replacement, etc). Consideration of uncertainty	Yes No
Item#22	?>>>Whether the final model’s calibration is assessed. Calibration assessment	Yes No
Item#23	?>>>Use of a single imaging set (such as a single MRI sequence rather than multiple, or a single phase in a dynamic contrast-enhanced scan) should be preferred, as multi-parametric imaging may unnecessarily increase data dimensionality and risk of overfitting. Therefore, in the case of multi-parametric studies, uni-parametric evaluations should also be performed to justify the need for a multi-parametric approach by formally comparing their performance (e.g., DeLong’s or McNemar’s tests). This item is also intended to reward studies that experimentally justify the use of more complex models compared to simpler alternatives, in regard to input data type. Use of uni-parametric imaging or proof of its inferiority	Yes No
Item#24	?>>>Whether a non-radiomic method that is representative of the clinical practice is included in the analysis for comparison purposes. Non-radiomic methods might include semantic features, RADS or RECIST scoring, and simple volume or size evaluations. If no non-radiomics method is available, proof of improved diagnostic accuracy (e.g., improved performance of a radiologist assisted by the model’s output) or patient outcome (e.g., decision analysis, overall survival) should be provided. In any case, the comparison should be done with an appropriate statistical method to evaluate the added practical and clinical value of the model (e.g., DeLong’s test for AUC comparison, decision curve analysis for net benefit comparison, Net Reclassification Index). Furthermore, in case of multiple comparisons, multiple testing correction methods (e.g., Bonferroni) should be considered in order to reduce the false discovery rate provided that the statistical comparison is done with a frequentist approach (rather than Bayesian). Comparison with a non-radiomic approach or proof of added clinical value	Yes No
Item#25	?>>>Whether a comparison with a simple baseline reference model (such as a Zero Rules/No Information Rate classifier) was performed. Use of machine learning methods should be justified by proof of increased performance. In any case, the comparison should be done with an appropriate statistical method (e.g., DeLong’s test for AUC comparison, Net Reclassification Index). Furthermore, in case of multiple comparisons, multiple testing correction methods (e.g., Bonferroni, Benjamini–Hochberg, or Tukey) should be considered in order to reduce the false discovery rate provided that the statistical comparison is done with a frequentist approach (rather than Bayesian). Comparison with simple or classical statistical models	Yes No
Testing
Item#26	?>>>Whether the model is tested on an independent data set that is sampled from the same source as the training and/or validation sets. Internal testing	Yes No
Item#27	?>>>Whether the model is tested with independent data from other institution(s). This also applies to the studies validating the previously published models trained at another institution. External testing	Yes No
Open Science
Item#28	?>>>Whether any imaging, segmentation, clinical, or radiomics analysis data is shared with the public. Data availability	Yes No
Item#29	?>>>Whether all scripts related to automatic segmentation and/or modeling are shared with the public. These should include clear instructions for their implementation (e.g., accompanying documentation, tutorials). Code availability	Yes No
Item#30	?>>>Whether the final model is shared in the form of a raw model file or as a ready-to-use system. If automated segmentation was employed, the corresponding trained model should also be made available to allow replication. These should include clear instructions for their usage (e.g., accompanying documentation, tutorials). Model availability	Yes No
Total METRICS score:		0.0%
? 0≤score<20%, “very low”; 20≤score<40%, “low”; 40≤score<60%, “moderate”; 60≤score<80%, “good”; 80≤score≤100%, “excellent” quality Quality category:
? Enter an ID for the publication (e.g., Pub1, Pub2; paper DOI; or first author name and year). This field is OPTIONAL but it is expected to be useful to manage series of papers such as in systematic reviews. Publication ID:

If you publish any work which uses this tool, please cite the following publication:

Kocak B, Akinci D'Antonoli T, Mercaldo N, et al. METhodological RadiomICs Score (METRICS):
a quality scoring tool for radiomics research endorsed by EuSoMII. Insights Imaging. 2024;15(1):8.
Published 2024 Jan 17. doi:10.1186/s13244-023-01572-w