Why Most Backtests Lie: Five Pitfalls That Make Strategies Look Better Than They Are

Test 100 random strategies on the same data and the best one will look like genius. That is not a cautionary hypothetical. It is arithmetic, and it is the reason most backtests lie.

A Sharpe ratio of 2.0 sounds impressive. It implies returns roughly twice the volatility, the kind of risk-adjusted performance that justifies a hedge-fund letter. But on a single random pull from US equity data, with a few hundred strategy variations tried quietly in the background, a Sharpe of 2.0 is about what flipping coins would give you. The strategy is not the edge. The search is.

This is not a fringe view. Harvey, Liu and Zhu’s 2016 paper “…and the Cross-Section of Expected Returns” in the Review of Financial Studies catalogues 296 published “factors” that purport to predict equity returns and argues most are false discoveries from multiple testing. Their recommended significance threshold for a new factor is t > 3.0, not the conventional t > 2.0, and many famous anomalies do not clear it. McLean and Pontiff (2016) reconstructed 97 published anomalies and found returns were 26% lower in the post-sample-pre-publication window and 58% lower after publication. Some of that decay is real arbitrage, but a substantial share is publication bias collapsing back to truth.

The five pitfalls below are the mechanical reasons backtests overstate future returns. Most retail backtesting tools let you fall into all five at once.

Pitfall 1: Multiple testing, also known as selection bias on the historical record

The mechanism is simple. Try one strategy on pure noise and you get one mediocre Sharpe ratio. Try a thousand on the same noise and the single best of those thousand will look great, by luck alone. Bailey, Borwein, Lopez de Prado and Zhu (2014), “Pseudo-Mathematics and Financial Charlatanism”, show that roughly 1,000 random trials on pure noise produce an apparent Sharpe ratio of about 3 even when the true expected return is zero. Their “deflated Sharpe ratio” (Bailey and Lopez de Prado, 2014) corrects an observed Sharpe for the number of trials run, the sample length, and the skewness and kurtosis of returns. A deflated Sharpe of zero means the strategy is statistically indistinguishable from a lucky search.

Lopez de Prado’s 2018 piece “The 10 Reasons Most Machine Learning Funds Fail” extends this critique into the ML era. Backtest overfitting is one of his canonical failure modes, alongside cross-validation leakage and naive labeling. The point is that even sophisticated practitioners, with PhDs and clean data, ship strategies that decay the day they go live because the search effort was never disclosed.

Defense. Pre-register a hypothesis with an economic rationale. Use cross-validation that respects time order. Apply a Bonferroni correction or, better, the deflated Sharpe ratio. If you cannot count your own trials, count generously: include the parameter sweeps, the ticker filters, the lookback windows, and every “let me just try one more thing” that did not make it into the final notebook.

Pitfall 2: Survivorship bias

The mechanism is that historical databases that drop delisted, bankrupted or merged companies overstate the universe’s returns. Brown, Goetzmann and Ross (1992) modelled this for mutual funds and showed survivorship can inflate apparent returns substantially; the canonical hedge-fund evidence (Brown, Goetzmann, Ibbotson and Ross, 1999) suggests an inflation of 1 to 4 percentage points per year and Sharpe ratios up to half a point higher. Carpenter and Lynch (1999) and Carhart, Carpenter, Lynch and Musto (2002) document similar effects for fund-performance studies.

For equities, the issue bites hardest in long backtests. If your stock screener pulls today’s S&P 500 and runs it back to 1990, you are testing a portfolio of survivors. The Enrons and Lehmans were silently removed. A “value” or “small-cap” strategy backtested this way looks heroic because it omits the names that went to zero.

Defense. Use a survivorship-bias-free database (CRSP for academics, Norgate or Sharadar for retail) that includes delisted tickers with full price histories. If your data source does not, document the limitation in the backtest output and discount your results. As a sanity check, run the same backtest on a current-membership universe and a delisted-included universe; the gap is the survivorship premium you accidentally captured.

Backtest overfitting — The "best of N tries" Sharpe ratio grows with the number of strategies tested even when none has a real edge. After Bailey and Lopez de Prado (2014).

Pitfall 3: Look-ahead bias

The mechanism is using information at time t that was not actually available at time t. The classic offenders are earnings restatements (the GAAP figure you see today is not what was filed on the original 10-Q), index reconstitutions (the S&P 500 membership for January 1995 is not what people thought it was at the time), dividend and split adjustments applied retroactively, and macroeconomic releases revised months later (initial GDP prints differ materially from final estimates).

A subtler version is using daily close prices to trade at the close. If your setup depends on the close, you cannot trade that bar without execution slippage or a peek into next-bar data. The same problem afflicts intraday strategies that use the high-low range as if it were known at the open.

Defense. Use point-in-time data sources where the value at any historical timestamp reflects what was visible at that timestamp, not a later revision. If point-in-time is unavailable, build in a strict t-1 lag: patterns computed on day t can only be acted on at day t+1’s open. For event data (earnings, dividends), use the announcement timestamp, not the period-end timestamp. The discipline costs basis points; the absence of it costs your strategy.

Pitfall 4: In-sample versus out-of-sample fitting

The mechanism is tuning parameters on the same data you use to evaluate the strategy. Choose a lookback window and a stop-loss threshold by gridding over the 1990 to 2020 sample, then “test” on 1990 to 2020, and you have not tested anything; you have measured how well your parameters fit the realised path. Pesaran and Timmermann (1995, 2002) explored this in the context of forecasting US stock returns and demonstrated that out-of-sample R-squared collapses dramatically once you stop letting the model see the test data.

Walk-forward analysis is the standard fix. You train on, say, 1990 to 2000, deploy parameters from 2001, retrain at the end of each year, and roll forward. Combinatorial purged cross-validation (Lopez de Prado, 2018) is a more rigorous variant for cases where data leaks across folds. Either way, the principle is the same: parameters chosen on data X must be evaluated only on data not-X, with a strict embargo to handle autocorrelation.

Defense. Hold out a final out-of-sample window and never look at it until the very end. If results on it disappoint, do not iterate; instead, downgrade your conviction and start the search again with a fresh hypothesis. Anything else is laundering data through hindsight.

Pitfall 5: Regime change and non-stationarity

The mechanism is that the data-generating process changes over time. A pairs trade in oil and natural gas worked beautifully through 2007, then broke when shale fundamentally changed the supply curve. A 1990s momentum strategy that delivered double-digit alpha decayed steadily through the 2000s as quants arbitraged it away. The 60/40 portfolio’s diversification benefit collapsed in 2022 when stocks and bonds sold off together. None of these regime breaks were predictable from the in-sample data, but each one rendered prior backtests irrelevant.

McLean and Pontiff (2016) is the canonical evidence here at the cross-sectional level: 97 published anomalies, average returns 58% lower post-publication. Some of that is arbitrage (real alpha gets traded away as soon as it is documented), some is publication bias unwinding, and some is genuine non-stationarity. For long backtests, the older the data, the more you should suspect that today’s market is not the market that produced the returns.

Defense. Test across multiple regimes explicitly. If your strategy needs the entire 1990 to 2020 sample to look profitable, ask what its Sharpe was in the worst rolling five-year window. Look at strategy decay: is the alpha declining over time, even before publication? Use shorter, more recent samples to estimate robustness. And accept that any backtest projecting 30 years forward is implicitly assuming the future looks like the past, which is the assumption most likely to break the strategy.

Practical implication. The good news for self-directed investors is that the discipline these papers prescribe does not require a quant desk. TradeWave detects repeating seasonal patterns over a lookback you choose, from 1 to 99 years, on either the calendar or the election cycle - so you can re-run the same pattern over different windows and watch whether a tendency holds in recent decades or only survives because of one ancient regime. Each pattern reports its hit-rate and an auditable record, not a verdict. A seasonal effect is a tendency, not a promise, and the point of varying the lookback is to make your own skepticism do the work. The hard part is remembering to apply it when a result looks too good.

Putting the pitfalls together

The five inflations compound. A backtest that commits all of them at once (multiple testing across hundreds of variations, a survivor-only universe, retroactively adjusted prices, in-sample parameter tuning, and a sample drawn from a single bull regime) can plausibly turn a true Sharpe of zero into a reported Sharpe of 2 or 3. This is rarely malicious. It is the default behaviour of unreflective backtesting, and most retail “I built a Sharpe-3 strategy” claims commit at least three of the five.

So the literature converges on a short checklist for self-defense:

Write the hypothesis first, in plain English, and explain why it should work economically before you touch the data.
Pick the validation scheme before you compute any returns: walk-forward windows, embargo periods, the metric that decides success.
Count your trials and correct for them. If you cannot run the deflated Sharpe formula directly, at minimum require t > 3 rather than t > 2.
Replicate with point-in-time data and a survivorship-free universe before you believe a published result.
Distrust the recent past most: a tendency that worked from 1990 to 2018 but stalled from 2019 to today is probably dying, not on sale.

None of this is exotic. Real edge requires either an economically motivated hypothesis tested honestly, or a sample large enough that the statistical bar is genuinely hard to clear by chance. The discipline that separates working quant practice from exhibitionist backtesting is just the willingness to count your trials, hold out real out-of-sample data, use point-in-time prices, test across regimes, and accept that most “discoveries” do not survive contact with the future.