A significant challenge facing high-throughput phenotyping of in-vivo knockout mice is ensuring phenotype calls are robust and reliable. Central to this problem is selecting an appropriate statistical analysis that models both the experimental design (the workflow and the way control mice are selected for comparison with knockout animals) and the sources of variation. Recently we proposed a mixed model suitable for small batch-oriented studies, where controls are not phenotyped concurrently with mutants. Here we evaluate this method both for its sensitivity to detect phenotypic effects and to control false positives, across a range of workflows used at mouse phenotyping centers. We found the sensitivity and control of false positives depend on the workflow. We show that the phenotypes in control mice fluctuate unexpectedly between batches and this can cause the false positive rate of phenotype calls to be inflated when only a small number of batches are tested, when the effect of knockout becomes confounded with temporal fluctuations in control mice. This effect was observed in both behavioural and physiological assays. Based on this analysis, we recommend two approaches (workflow and accompanying control strategy) and associated analyses, which would be robust, for use in high-throughput phenotyping pipelines. Our results show the importance in modelling all sources of variability in high-throughput phenotyping studies.