How CounterFact evaluates a decision policy
CounterFact uses logged decision data to evaluate a candidate policy before rollout and shows whether the evidence is strong enough to rely on.
Logged decision data
- Decisions
- Actions
- Context
- Outcomes
- Outcome horizon
Candidate policy
The proposed action rule or assignment.
CounterFact evaluation
Checks whether the logged decisions can support a pre-rollout read of the candidate policy.
Outcome read
What appears to change under the candidate policy.
Evidence verdict
Whether the logged evidence supports that read.
Next step
What to do with the result.
The result page also shows Why this verdict, Evaluation summary, and Estimate comparison as supporting detail.
What the result means
CounterFact separates the result into an Outcome read, which says what appears to change under the candidate policy, and an Evidence verdict, which says how strongly the logged data supports trusting that read. A promising read with weak evidence is not enough to act on, while a no-clear-change read with strong evidence can still be useful.
What CounterFact checks before trusting the read
- Policy coverage
- Does the logged behavior cover the candidate policy?
- Data readiness
- Are decisions, actions, context, and outcomes usable?
- Estimator agreement
- Do multiple estimators point to the same read?
- Precision check
- Is the interval clear enough to interpret?
- Robustness
- Does the result hold under stress and sensitivity checks?
- Outcome maturity
- Are outcomes measured over the right horizon?
What the verdicts mean
Reliable
Strong offline read; still validate before rollout.
Directional
Useful for prioritization, not deployment proof.
Limited Evidence
Logging or data gaps block a trustworthy read.
Outside Scope
The setup does not fit this evaluation approach.
A no-clear-change Outcome read can still be useful when the evidence supports that the candidate policy is unlikely to move the outcome much.
Limited Evidence and Outside Scope are honest verdicts, not failures.
What CounterFact does not claim
CounterFact does not guarantee production impact or replace rollout monitoring. Not every dataset is suitable. Sometimes the right next step is better logging or instrumentation.
See the result shape with demo data.
Related articles
Practical notes on logging quality, available actions, and where offline evaluation can fail.
Why Offline Wins Keep Dying in A/B Tests
Why strong offline metrics often fail in live tests, and what to examine before trusting them.
. Opens in a new tab.Candidate Sets: The Invisible Boundary of Offline Evaluation
Why knowing which actions were available at decision time is one of the most dangerous failure modes in offline evaluation.
. Opens in a new tab.The Hidden Foundation of Offline Policy Evaluation
What propensities are, why they need to be logged at decision time, and what breaks when they are missing or wrong.
. Opens in a new tab.Beyond A/B Testing: How Off-Policy Evaluation Transforms Recommendation Systems
An earlier framing of CounterFact's roots in recommendation and ranking evaluation.
. Opens in a new tab.