DOME-ML

Find the complete set of guidelines in the GitHub repository or use the DOME Registry service

Broad topic	Be on the lookout for	Consequences	Recommendation(s)
Data	Data size & quality Appropriate partitioning, dependence between train and test data. Class imbalance No access to data	Data not representative of domain application. Unreliable or biased performance evaluation. Cannot check data credibility.	Data size & distribution is representative of the domain. Requirement Independence of optimization (training) and evaluation (testing) sets. Requirement This is especially important for meta algorithms, where independence of multiple training sets must be shown to be independent from the evaluation (testing) sets. Release data preferably using appropriate long-term repositories, including exact splits Requirement
Optimization	Overfitting, underfitting and illegal parameter tuning Imprecise parameters and protocols given.	Over/under optimistic performance reported. Models noise or miss relevant relationships. Results are not reproducible.	Clear statement that evaluation sets were not used for feature selection, pre-processing steps or parameter tuning. Requirement Appropriate metrics to prove no over/under fitting, i.e. comparison of training and testing error. Requirement Release definitions of all algorithmic hyper-parameters, parameters and optimization protocol. Requirement For neural networks, release definitions of train and learning curves. Recommendation Include explicit model validation techniques, such as N-fold Cross validation. Recommendation
Model	Unclear if black box or transparent model No access to: resulting source code, trained models & data Execution time is impractical	A transparent model shows no explainable behaviour Cannot cross compare methods, reproducibility, & check data credibility. Model takes too much time to produce results	Describe the choice of black box / interpretable model. If interpretable show examples of it doing so. Requirement. Release of: documented source code + models + executable + UI/webserver + software containers. Recommendation Report execution time averaged across many repeats. If computationally tough compare to similar methods Recommendation
Evaluation	Performance measures inadequate No comparisons to baselines or other methods Highly variable performance.	Biased performance measures reported. The method is falsely claimed as state-of-the-art. Unpredictable performance in production.	Compare with public methods & simple models (baselines). Requirement Adoption of community validated measures and benchmark datasets for evaluation. Requirement Comparison of related methods and alternatives on the same dataset. Recommendation Evaluate performance on a final independent hold-out set. Recommendation Confidence intervals/error intervals to gauge prediction robustness. Requirement