The Average Hazard Ratio – A Good Effect Measure for Time-to-event Endpoints when the Proportional Hazard Assumption is Violated?
Summary Background: In many clinical trial applications, the endpoint of interest corresponds to a time-to-event endpoint. In this case, group differences are usually expressed by the hazard ratio. Group differences are commonly assessed by the logrank test, which is optimal under the proportional hazard assumption. However, there are many situations in which this assumption is violated. Especially in applications were a full population and several subgroups or a composite time-to-first-event endpoint and several components are considered, the proportional hazard assumption usually does not simultaneously hold true for all test problems under investigation. As an alternative effect measure, Kalbfleisch and Prentice proposed the so-called ‘average hazard ratio’. The average hazard ratio is based on a flexible weighting function to modify the influence of time and has a meaningful interpretation even in the case of non-proportional hazards. Despite this favorable property, it is hardly ever used in practice, whereas the standard hazard ratio is commonly reported in clinical trials regardless of whether the proportional hazard assumption holds true or not. Objectives: There exist two main approaches to construct corresponding estimators and tests for the average hazard ratio where the first relies on weighted Cox regression and the second on a simple plug-in estimator. The aim of this work is to give a systematic comparison of these two approaches and the standard logrank test for different time-toevent settings with proportional and nonproportional hazards and to illustrate the pros and cons in application. Methods: We conduct a systematic comparative study based on Monte-Carlo simulations and by a real clinical trial example. Results: Our results suggest that the properties of the average hazard ratio depend on the underlying weighting function. The two approaches to construct estimators and related tests show very similar performance for adequately chosen weights. In general, the average hazard ratio defines a more valid effect measure than the standard hazard ratio under non-proportional hazards and the corresponding tests provide a power advantage over the common logrank test. Conclusions: As non-proportional hazards are often met in clinical practice and the average hazard ratio tests often outperform the common logrank test, this approach should be used more routinely in applications.