An investor pulls up Apple. The chart tool shows that AAPL has gained ground from June through August in 8 of the last 10 years. The average summer return over that decade is 4.2 percent. The investor concludes there is a summer rally pattern in AAPL and starts planning a trade.
The problem with this story is not the data. The numbers are correct. The problem is that “8 out of 10” and “4.2 percent average” do not, by themselves, mean what they appear to mean. The math required to interpret them honestly is rarely done by retail investors and is often done sloppily by professionals. This article walks through the toolkit you need to read seasonal patterns without lying to yourself.
The setup matters because seasonality is one of the most heavily researched, most heavily discussed, and most consistently overinterpreted areas of financial pattern analysis. Some of the patterns are real. Most of what looks compelling is noise wearing a seasonal costume.
1. Sample size and statistical power
Stock returns are noisy. The standard deviation of monthly returns on the S&P 500 is roughly 4 percent; on individual large-cap stocks closer to 7 to 9 percent. The annualized standard deviation of stock returns sits around 15 percent for the broad market and considerably higher for individual names.
Suppose a true seasonal effect exists with a mean of 2 percent and the same noise as the rest of the market. How many observations would you need to reliably distinguish that effect from zero?
The answer is given by a power calculation. With a desired statistical power of 80 percent and a significance level of 5 percent, the standard formula gives you roughly 30 to 35 independent observations.
For an annual seasonal effect, those 30 observations are 30 years. For a monthly seasonal effect, the relevant noise is monthly so you need 30 months that contain the season - which for “the month of August” is also 30 years.
A 10-year sample, in this regime, has roughly 35 percent power to detect a true 2-percent effect. That means that even when the effect genuinely exists, you have a 65 percent chance of failing to detect it. The flip side, which matters more here, is that when you do see a significant-looking pattern in 10 years of data, the size of that observed effect is heavily biased upward by chance. Ioannidis’s 2005 PLOS Medicine paper “Why Most Published Research Findings Are False” walks through exactly this dynamic in the medical literature. The financial application is the same: small samples produce loud, exaggerated, often spurious patterns.
The practical takeaway is brutal. Most retail seasonality charts sit in the underpowered region. A 10-year window is a hint, not evidence. A 20-year window is starting to be informative. A 30-year window, with adjustments for what comes next, is where conclusions become defensible.
2. What a p-value actually means
The standard tool for evaluating a pattern is a p-value: a probability that summarizes the evidence against the hypothesis that the true effect is zero. P-values are also one of the most consistently misinterpreted concepts in applied statistics.
The correct definition: a p-value of 0.05 means that, assuming the true effect is zero, you would observe data this extreme or more extreme by chance roughly 5 percent of the time.
The incorrect interpretation that almost everyone slides into: “there is a 95 percent chance the effect is real.”
Steven Goodman’s 2008 paper “A Dirty Dozen: Twelve P-Value Misconceptions” in Seminars in Hematology lists the standard errors. The most damaging is the inversion: confusing P(data given no effect) with P(effect given data). These are not the same thing, and Bayes’ theorem is the bridge between them. Without specifying a prior probability that the effect is real, you cannot translate one into the other.
For seasonality this matters because the prior probability of any given pattern being real is low. Hundreds of patterns have been examined. A handful are real. If you start with a prior of, say, 10 percent that a randomly chosen seasonal pattern is genuine, then a single p=0.05 test increases your posterior probability to about 33 percent. Not 95 percent. Not even close.
A useful rule of thumb: when someone says “this is statistically significant at p=0.05,” ask “compared to what null hypothesis, with how many tests run before this one was published, and what was the prior probability that the effect was real?” Most of the time the speaker has not thought about the second or third question.
3. Multiple testing: the factor zoo
Suppose you do not have a single seasonal hypothesis but rather a tool that lets you slice and dice. You can pick any ticker, any start date, any end date, any holding period. How many tests do you have available to you?
Take a single ticker. Try every monthly start and stop combination: that is 12 times 12 minus 12 = 132 windows. Try multiple holding periods, multiple averaging methods, multiple universes. The number of distinct tests grows fast.
If the true effect is zero, you expect roughly 5 percent of your tests to come back “significant” at p=0.05 by chance alone. That means in 132 monthly windows on one ticker, you expect to see about 7 windows that look statistically interesting purely from noise.
The financial-economics literature has the same problem at scale. Campbell Harvey, Yan Liu and Heqing Zhu, in their 2016 paper “…and the Cross-Section of Expected Returns” in the Review of Financial Studies, surveyed the literature on cross-sectional return predictability. They found 296 published candidate factors. Their argument: after accounting for the multiple-testing problem - the fact that researchers tested many factors, published the ones that worked, and discarded the ones that did not - the conventional t > 2 statistical threshold is not enough. Many of the 296 factors do not survive a more stringent multiple-testing-adjusted hurdle.
The implication for the individual investor is simple. Whenever someone presents a seasonal pattern with a “statistically significant” tag, the relevant question is: how many alternatives were examined before this one was selected? If the answer is “I checked dozens of windows and this is the best one,” the headline p-value is meaningless.
4. Bonferroni and the false discovery rate
There are two practical ways to adjust for multiple testing.
The Bonferroni correction is the simple version: divide your significance threshold by the number of tests. If you tested 100 windows at a desired alpha of 0.05, treat 0.0005 as the new threshold for any one of them. This controls the family-wise error rate, the probability that any of your tests is a false positive.
Bonferroni is easy and conservative. Conservative means you will miss some real effects (low power) but you will also strongly suppress false ones. For exploratory pattern hunting, this is the right side to err on.
The False Discovery Rate (FDR) approach, due to Yoav Benjamini and Yosef Hochberg in their 1995 Journal of the Royal Statistical Society paper “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” is more nuanced. Rather than controlling the chance of any false positive, FDR controls the proportion of your “discoveries” that are false. The Benjamini-Hochberg procedure ranks your p-values and accepts as significant the largest k such that the k-th smallest p-value is below k/N times your desired FDR.
In practice, FDR control is what serious empirical asset-pricing work uses. The Harvey-Liu-Zhu paper above explicitly uses these techniques. For an individual investor evaluating seasonal patterns, even a back-of-envelope adjustment helps: divide the headline alpha by the rough number of tests examined, and treat the result as the meaningful threshold.
5. Regime breaks
A 30-year sample sounds like enough data, but only if those 30 years are reasonably IID - independently and identically distributed. Financial markets are not.
Several documented regime shifts complicate seasonal analysis:
The day-of-week effect, where Mondays underperformed and Fridays outperformed, was robust through the 1980s and largely disappeared after 1990. Connolly’s 1989 work in the Journal of Financial and Quantitative Analysis flagged the disappearance, and subsequent updates have confirmed it.
The pre-2008 commodity-equity correlation was meaningfully different from the post-2008 correlation. The financialization of commodity markets, partly through ETFs, changed the relationship.
The pre-2010 correlation between US equity and Treasury yields was different from the post-2010 correlation. Quantitative easing changed the conditional behavior of both.
For seasonal analysis, this means a 30-year sample is sometimes really two 15-year samples that happen to be glued together. A pattern that exists in the first half but not the second half is probably not a pattern; it is a regime that ended. A pattern that exists in both halves with similar magnitude is much more credible.
The practical step is to test in subsamples. Split the sample period in half, or into thirds, or by event boundaries (pre-2000, 2000 to 2008, post-2008). If the effect is consistent across subsamples, it is more likely real. If it appears only in one subsample, the safer interpretation is that you have found a regime, not a seasonality.
6. Look-ahead and survivorship bias
Two further hazards bear mentioning. The first is look-ahead bias: using information in your test that would not have been available in real time. Survivors of the dot-com bust still trade on the exchanges; the failures do not. If you test a pattern using only currently-listed companies, you have implicitly conditioned on survival.
The second is the familiar pitfall of using revised data. Earnings restatements, dividend adjustments, and ticker changes all happen retroactively. A backtest that uses today’s data to make decisions in 2010 is not a backtest of what was knowable in 2010. The full treatment of these problems belongs in a separate discussion of backtesting hygiene; we covered them in our piece on why backtests fail.
7. A defensible workflow
The recipe for evaluating a candidate seasonal pattern without fooling yourself:
-
State the hypothesis before looking at the data. Pre-specifying what you are testing makes you accountable. “I am testing whether AAPL has positive returns from June 1 to August 31” is a statement that can be falsified. “I sliced through the data and found a window that looked good” is not.
-
Pre-specify the sample period and universe. If you are testing a US-equities seasonal pattern, decide upfront whether the test runs from 1990 to 2024 or some other window, and which universe (S&P 500, Russell 3000, NYSE since 1970). Decide before you see the result.
-
Apply the test. Calculate the effect, the standard error, the p-value, and the implied effect size. Report all three.
-
Replicate in a different sample. The cleanest validation is an out-of-sample test on a sample period or universe that was not used to develop the hypothesis. McLean and Pontiff’s 2016 Journal of Finance paper “Does Academic Research Destroy Stock Return Predictability?” found that anomaly returns drop by an average of 26 percent post-publication and 58 percent post-publication after accounting for trading costs. Out-of-sample is where most patterns reveal themselves.
-
If the effect requires explanations after the fact, treat it as a hypothesis to test, not a confirmed pattern. “This works because of X” is a story; whether X is true is a separate question.
How to read a per-year bar chart honestly
When you see a bar chart of returns by year for a given window, four things matter more than the headline average:
Win rate. How often is the bar positive? An 8-of-10 win rate sounds great, but at 10 trials the binomial confidence interval on that win rate is roughly 50 to 95 percent. You cannot rule out a 50/50 underlying coin flip from 8/10.
Magnitude when positive vs when negative. A pattern that is +5 percent in good years and -2 percent in bad years has a different risk profile from one that is +3 percent in good years and -8 percent in bad years. The asymmetry matters as much as the win rate.
Consistency over decades. If you can split the sample into pre-2008 and post-2008, do so. Patterns that hold across both subsamples deserve more weight.
Sensitivity to outliers. Recompute the mean removing the single largest year. If the result drops by 50 percent, the original average was driven by one observation, and any forward-looking confidence in the pattern should be dialed back accordingly.
Closing
Skepticism is your edge. The market is full of investors falling for seasonal patterns, repeating averages without confidence intervals, and confusing 10-year windows with valid samples. The disciplined investor outperforms partly because they reject 90 percent of what looks compelling. The remaining 10 percent, having survived a real test, is worth taking seriously.
The goal of statistical literacy in markets is not to never trade a pattern. It is to trade only the patterns that have survived honest scrutiny, sized for the residual uncertainty that always remains. That is a smaller list than the one your charting tool will offer you. That is the point.