Duped by the Birthday Paradox

A well-known counterintuitive fact about probability is the birthday problem – the idea that the odds of two people in a group sharing a birthday increases much more rapidly with the size of the group than most people expect. Most people’s intuition seems to be roughly linear: odds increase more or less uniformly with each additional person, and it’s only when you reach the 250s or so that it gets truly likely. This is probably a result of self-centered thinking: that’s truer of the addition of a marginal extra person to the group with respect to the subject (i.e. your own birthday) than it is of the group as a whole. Of the group as a whole, though, each new addition has n chances to match his own birthday (where n is the size of the group), so in reality chances that two people in a group will share a birthday increase pretty rapidly as the group grows. The trick is that we’re operating over pairs rather than individuals, and with each new individual the number of pairs increases dramatically.

I was thinking about that because one of my Facebook friends has a birthday today. I don’t have many Facebook friends as these things go – just 144. And yet a lot of them seem to share birthdays.

  1. I share one with someone.
  2. There’s a pair in May (25th).
  3. There’s a pair in June (20th).
  4. There’s a pair in July (11th).
  5. There’re two pairs in August (1st and 11th).
  6. There’s a pair in September (26th).
  7. There’s a pair (21st) and a triple (6th) in October.
  8. There’s a pair (6th) in November.
  9. There’s a pair in December (20th).
  10. There’re two pairs in January (9th and 16th).
  11. There’re four pairs in February (5th and 16th and 17th and 24th)

So, if you’re keeping count, that’s 35 people out of 144 who have a duplicated birthday. In other words, for every 4 people on my friends list, there is a duplicated birthday.

Incidentally, that triple is is something you would expect by chance. For the generalized version of the problem, the probability of getting a triple in a group of 144 people is a hair less than 95%. So it’s really not unexpected, even though you might expect it to be unexpected.

I would have said that’s well above what you would expect by chance, but then I had a friend take a look at his Facebook account, and he got an even higher number. He has significantly more friends than me (415), and so he gets cool things like quadruples that I can only dream of, but when you break it down, there are 241 people (67 pairs, 25 triples, 8 4-tuples) who have duplicated birthdays, meaning more than half of his friends have duplicates.

Two data points isn’t a trend, but it is at least suggestive that my intuition was off about how many shared birthdays there should be.

The punchline? When you think about the problem carefully, my 35 shared birthdays (and my friend’s 241) both actually undershoot the mathematically expected numbers by a bit. Here’s how it works:

  1. The probability that someone shares my birthday is $\frac{1}{365}$
  2. SO, The probability that someone does not share my birthday is $1 – \frac{1}{365}$
  3. In a group of n people, the probability that none of them share my birthday is just that iterated n-1 (because I’m in the group too!) times:

$$(1 – \frac{1}{365})^{n-1}$$

  1. Implying that the probability that at least one person shares my birthday is:

$$(1 – (1 – \frac{1}{365})^{n-1})$$

  1. Which, by the definition of these things, is also the percentage of people in a group of n people that we expect to share birthdays:

$$n(1 – (1 – \frac{1}{365})^{n-1})$$

And that’s it. There are nitpicks you could make about biology and leap years and twins and stuff like that. This isn’t an accurate model of human birthday distributions. But it’s an accurate statement of how many duplicates you could expect to draw out of a hat with 365 chits if you drew n times.

And, wouldncha know? It comes up with 47 for a group of 144:

$$144(1 – (1 – \frac{1}{365})^{143}) = 46.7$$

And, wouldncha know? It comes up with 282 for a group of 415:

$$415(1 – (1 – \frac{1}{365})^{414}) = 281.7$$

In both cases, we actually undershoot.

A quick Python program to show that it works:

from random import randint

def numDuplicates(n,N=365):
  birthdays = [randint(1,N) for r in range(n)]
  return len([i for i in [birthdays.count(j) for j in birthdays] if i > 1])

l = [numDuplicates(144) for i in range(1000)]

That does 1000 trials at a time and averages the result. When I run this repeatedly, I get 46 and 47 almost exclusively, which is as it should be.

numDuplicates does all the work, of course. It takes the size of the group (n) and the number of days in our year (N, defaults to 365) and first generates a list of n random integers from 1 to N to represent everyone’s (randomly chosen) "birthday." Then it makes a second list that lists how many times each member is duplicated in the first list:

[birthdays.count(j) for j in birthdays]

Then it throws away any member of that list that isn’t greater than one:

[i for i in [birthdays.count(j) for j in birthdays] if i > 1]

This gives us only the members of the list that are duplicates. All that remains is to count those:

len([i for i in [birthdays.count(j) for j in birthdays] if i > 1])

That’s the story of how I got duped by the birthday paradox, even though I already had a good intuition about how it works.

3 thoughts on “Duped by the Birthday Paradox

  1. I love the birthday paradox. I actually found an analog useful in a political argument once. Specifically someone was using the fact that the likelihood of an individual American being subjected to a terrorist attack is very small to imply government anti-terrorism policies were vastly over prioritized. However one must realize that the government is tasked with defending all American interests which includes its citizens. In principle all of us. The likelihood that some American will be subject to a terrorist act in say the scope of a year is actually quite high as evidenced by the nightly news.

    It really is interesting how so many of or self perspective biases impact on our ability to think clearly and avoid fallacious reasoning.

  2. That’s a good explanation of the diverging interests between individual citizens and government agencies on the terrorism issue. An individual citizen is extremely unlikely to be the victim of a terrorist attack, and much more likely to be the victim of police abuse, so it’s in his interests to strengthen civil rights protections against the authorities. The authorities are more likely to be castigated for failing to prevent an attack than they are to be abused by other authorities, so they naturally prioritize getting the tools they need to help them prevent attacks.

    I’m not sure I see how the birthday paradox applies to whether terrorism preventing is over-prioritized/-funded in general, though. Was your approach that the marginal cost of an additional (unprevented) terrorist attack is less than the distributed cost of prevention? I think that argument is probably easy to win against the average citizen just because they either (a) overestimate the marginal cost of an additional (unprevented) terrorist attack or (b) underestimate the distributed cost of preventing each attack up to that one (or, if you prefer, overestimate the effectiveness of existing methods at preventing attacks). In reality, I don’t know how to resolve that issue, since we don’t have access to all the information we need to do the real cost-benefit, but I tend toward the camp that thinks we overprioritize terrorism prevention too, so I’d be interested to hear the argument.

  3. Thanks for correcting me. The analogy I drew was not strictly to the probabality calculations of the birthday paradox. A better example is the liklihood that some individual among all participants will win a lottery pool twice. Rather I was drawing a comparison to the as you say “self-centered thinking” of individuals when estimating their individual probability of being a victim of a terrorist attack rather than the liklihood of some American or American interest being subject to an attack. In this case if we can assume a simplified independent probability that a given individual is attacked, then the liklihood of some individual among all being attacked is just additive. Nothing complex there.

    Of course the priority debate is a very reasonable one to have in general. My point was simply that the debate over appropriate priority of anti-terrorism efforts should not really be based on the probability that an individual American would be subject to a terrorist attack. Your analysis is much more sophisticated and apropos to the debate. While I think we can reasonably discuss how many resources are directed to the issue, I tend to favor a very aggresive stance on anti-terrorism efforts simply because I think the proclivity of terrorists is to escalate when they sense weakness in the target. This is especially true when there is no realistic possibility of accomodation or negotiation with the group. So for example, the strategy of dealing with the terrorists of the Northern Ireland would need to be different that required to deal with ISIS.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>