Find the complete set of guidelines and more in the GitHub repository and CodeOcean code capsule

Broad topicBe on the lookout forConsequencesRecommendation(s)
Data
  • Data size & quality
  • Appropriate partitioning, dependence between train and test data.
  • Class imbalance
  • No access to data
  • Data not representative of domain application.
  • Unreliable or biased performance evaluation.
  • Cannot check data credibility.

Data size & distribution is representative of the domain. Requirement

Independence of optimization (training) and evaluation (testing) sets. Requirement

This is especially important for meta algorithms, where independence of multiple training sets must be shown to be independent from the evaluation (testing) sets.

Release data preferably using appropriate long-term repositories, including exact splits Requirement

Optimization
  • Overfitting, underfitting and illegal parameter tuning
  • Imprecise parameters and protocols given.
  • Over/under optimistic performance reported.
  • Models noise or miss relevant relationships.
  • Results are not reproducible.

Clear statement that evaluation sets were not used for feature selection, pre-processing steps or parameter tuning. Requirement

Appropriate metrics to prove no over/under fitting, i.e. comparison of training and testing error. Requirement

Release definitions of all algorithmic hyper-parameters, parameters and optimization protocol. Requirement

For neural networks, release definitions of train and learning curves. Recommendation

Include explicit model validation techniques, such as N-fold Cross validation. Recommendation

Model
  • Unclear if black box or transparent model
  • No access to: resulting source code, trained models & data
  • Execution time is impractical
  • A transparent model shows no explainable behaviour
  • Cannot cross compare methods, reproducibility, & check data credibility.
  • Model takes too much time to produce results

Describe the choice of black box / interpretable model. If interpretable show examples of it doing so. Requirement.

Release of: documented source code + models + executable + UI/webserver + software containers. Recommendation

Report execution time averaged across many repeats. If computationally tough compare to similar methods Recommendation

Evaluation
  • Performance measures inadequate
  • No comparisons to baselines or other methods
  • Highly variable performance.
  • Biased performance measures reported.
  • The method is falsely claimed as state-of-the-art.
  • Unpredictable performance in production.

Compare with public methods & simple models (baselines). Requirement

Adoption of community validated measures and benchmark datasets for evaluation. Requirement

Comparison of related methods and alternatives on the same dataset. Recommendation

Evaluate performance on a final independent hold-out set. Recommendation

Confidence intervals/error intervals to gauge prediction robustness. Requirement