Using machine learning to estimate the heterogeneity of treatment effects (HTE) in randomized experiments is a common practice when seeking to understand how units differ in their response to a treatment. These HTE models are often used by Web platforms to construct personalized policies (for instance, by enabling a feature only for those users who are estimated to have a positive treatment effect). Unfortunately, many HTE models have no guarantees on the calibration of subgroup effects: that effect estimates of a particular size will, on average, be that size. We provide a simple way to calibrate black-box estimates of HTEs to known unbiased average effect estimates, ensuring that sign and magnitude will approximate experimental benchmarks. Our method is broadly in the vein of the popular Platt (1999) technique used in supervised learning. It requires no additional data beyond that necessary for estimating HTEs, and it can be trivially scaled to arbitrarily large datasets. Our technique enables the use of stacking to ensemble estimates from multiple HTE models based on their out-of-sample estimates, allowing for better performance than the constituent models in the ensemble. Simulation evidence shows that our method is effective at improving HTE estimates. We show that this matters in the real-world by applying our method to an experiment run at Facebook.