Why don’t we frame everything in terms of a linear regression model?
$$ y_i = \alpha + \beta_1 D_i + \beta_2X_i + \varepsilon_i $$
Identification in this model depends on $\text{Cov}(X_i, \varepsilon_i)$.
It’s not clear how $\varepsilon_i$ relates to the potential outcomes and therefore it’s hard to think about $\varepsilon$ and therefore $\text{Cov}(X_i, \varepsilon_i)$