The Hannah Montana Problem

A technique for uncovering double lives so effective that "Hannah Montana: The Movie" would have been a short film. Learn the step-by-step math behind the likelihood of exclusive relationships and the optimized algorithm’s code for scale.


A client provides you the guest lists of thousands of parties in the USA. He assures you that at each party every attendee met every other attendee at least once. The client suspects that one guest, Hannah Montana, is not who she purports to be. In fact, he is convinced that Hannah Montana is another celebrity in disguise. His budget can afford a limited number of private investigations. It’s your job to find the celebrities most likely masquerading as Hannah Montana.

This is a question of cooccurrence. If two people meet at a party, we can be certain that they are different people. Our search is narrowed to people that have never attended the same party as Hannah Montana. To rank this group you count the number of parties attended by each. Your logic says that those who attend more parties without meeting Hannah Montana are more likely to be the double-life celebrity.

You rank them and give your client this final list. Months pass by and you think about the problem you had tackled. You wonder, What is the likelihood any of those people were Hannah Montana? The raw counts felt contextless, and the percent of parties each attended wasn’t that helpful. You get to work.

 

The Likelihood of an Exclusive Relationship

An exclusive relationship can be summarized as “if one is true, the other must be false“. If there is an exclusive relationship between Hannah Montana’s attendance and another partygoer’s, then when Hannah Montana is there the other is not. If we can estimate the likelihood there is an exclusive relationship between her and the other, then we have a proxy for the likelihood that they are the same person.

There are three factors that really influence your suspicion that someone is Hannah Montana:

  1. The number of parties Hannah attended

  2. The number of parties the other attended

  3. The total number of parties

If there were 1000 parties, Hannah attends 3 and the other attends 2, you would expect that they never end up at the same party. But if the attendances were 500 and 400, you would become very suspicious. We therefore need a function that accepts those three variables.

$$P(a_1, a_2, N)$$

We are left with a combinations problem. How many combinations of 3 parties are there? These are all of the party combinations for Hannah, and it’s the traditional combinations formula.

$${N \choose a_1}$$

If Hannah attends 3 parties, then there are 997 parties the other can attend without running into Hannah.

$${N - a_1 \choose a_2}$$

We know that this is a subset of the party combinations the other could be attending. The other person could have attended

$${N \choose a_2}$$

With these we can find the percent of the other partygoer’s combinations that do not include any of Hannah’s 3 attended parties.

$$\frac{N - a_1 \choose a_2}{N \choose a_2}$$

Values closer to 1 mean that there are lots of parties that do not overlap with any of Hannah’s parties. Values closer to 0 mean that there are very few party combinations that do not intersect with Hannah’s parties.

Another interpretation is that values closer to 1 means there is low confidence in an exclusive relationship; it was very probable that the other person attended unique parties. Values close to 0 mean there is high confidence in an exclusive relationship; the coincidence that the other person attended that handful of Hannah-less parties is too great.

$$P(a_1, a_2, N) = 1 - \frac{N - a_1 \choose a_2}{N \choose a_2}$$

This is the likelihood of an exclusive relationship.

 

Generalization

The above proof outlined the formula for two elements having an exclusive relationship. But what if we’d like to check if there is an exclusive relationship between three elements: a relationship where all three cannot exist at the same time as one another? What if we wanted to generalize to an exclusive relationship between an arbitrary number of elements?

For conceptual assistance, your client’s investigators found that Hannah Montana was actually living a double life and her true name is Miley Cyrus. But he isn’t done. The private detectives noted there were significant periods of time where neither Hannah Montana nor Miley Cyrus were accounted for. He believes there is something even more sinister going on: Hannah Montana is living not a double life, but a triple life. You want to apply your new formula, but it’s not quite ready. You need to generalize it to support more than two samples of parties.

We now have a function that accepts an array of integers as one argument and an integer as another. The array of integers are each a suspect’s number of parties attended.

$$P(A, N)$$

$$A = [a_1 a_2 … a_n]$$

After calculating the likelihood of Hannah and Miley not occurring together, what are the odds that the third only goes to parties that the former two do not attend? Just as before, we must deduct the parties attended by the previous people.

$${N - a_1 - a_2 \choose a_3}$$

What proportion of potential party attendances is this for the third person?

$$\frac{N - a_1 - a_2 \choose a_3}{N \choose a_3}$$

If we multiply this by the proportion for Miley, we have a number that states the likelihood that Miley only attends parties that Hannah does not AND the third only attends parties that neither Hannah nor Miley attend.

$$\frac{N - a_1 \choose a_2}{N \choose a_2}\cdot\frac{N - a_1 - a_2 \choose a_3}{N \choose a_3}$$

This logic continues for any arbitrary number of people. As a result, our formula becomes the product function

$$P(A, N) = 1 - \prod_{i=1}^{n}{\frac{N - \sum_{j=1}^{i-1}{a_j}\choose a_i}{N \choose a_i}}$$

Note that the first value of the array will be choosing from the entire population of parties because no parties have been claimed yet, so this will always come out to 1.

 

Optimization

You discover that your client is Oswald Granger, an investigative journalist hell-bent on unraveling the Hannah Montana’s double life. While you found little evidence that Miley was living a third life, Oswald’s single-minded determination was perverted into a variety of paranoia. He adds more guest lists and rants about how Miley is living under the guise of three, four, even five celebrities. He loved your likelihood list and feels it adds credibility to his claims. As lists are added and the number of people in on this ploy grows, your formula’s computation becomes slower and slower, even crashing from overflow errors. To keep up with the requests, you crack open a notepad and try to optimize your formula.

There are two principles to be aware of when optimizing for scientific computing:

  1. Fewer operations are better

  2. Division is the devil

Here is the formula for choosing.

$${n \choose r}=\frac{n!}{r!(n-r)!}$$

Factorials are hugely expensive computations, so minimizing the number and size of them is key.

Numerator

$$\prod_{i=1}^{n}{N - \sum_{j=1}^{i-1}{a_j}\choose a_i}$$

Start by expanding the numerator of the product function.

$$\frac{N!}{a_1!(N-a_1)!}\cdot\frac{(N-a_1)!}{ a_2!(N - a_1 - a_2)!}\cdot\frac{(N-a_1-a_2)!}{a_3!(N-a_1-a_2-a_3)!}\cdot\frac{(N-a_1-a_2-a_3)!}{a_4!(N-a_1-a_2-a_3-a_4)!}$$

Begin by identifying the cancellation pattern: $$(n-r)!$$is always canceled in the next sample's numerator.

$$\frac{N!}{a_1!}\cdot\frac{1}{a_2!}\cdot\frac{1}{a_3!}\cdot\frac{1}{a_4!(N-a_1-a_2-a_3-a_4)!}$$

leaving us with

$$\frac{N!}{a_1!a_2!a_3!a_4!(N-a_1-a_2-a_3-a_4)!}$$

Denominator

$$\prod_{i=1}^{n}{N \choose a_i}$$

Can be expanded to

$$\frac{N!}{a_1!(N-a_1)!}\cdot\frac{N!}{a_2!(N-a_2)!}\cdot\frac{N!}{a_3!(N-a_3)!}\cdot\frac{N!}{a_4!(N-a_4)!}$$

And simplified as

$$\frac{(N!)^4}{ a_1!a_2!a_3!a_4!(N-a_1)!(N-a_2)!(N-a_3)!(N-a_4)!}$$

Together

Multiply the simplified numerator and denominator

$$\frac{N!}{a_1!a_2!a_3!a_4!(N-a_1-a_2-a_3-a_4)!}\div \frac{(N!)^4}{ a_1!a_2!a_3!a_4!(N-a_1)!(N-a_2)!(N-a_3)!(N-a_4)!}$$

$$\frac{N!}{a_1!a_2!a_3!a_4!(N-a_1-a_2-a_3-a_4)!}\cdot \frac{ a_1!a_2!a_3!a_4!(N-a_1)!(N-a_2)!(N-a_3)!(N-a_4)!}{(N!)^4}$$

Eliminate the cancellation pattern and deduct from the exponent

$$\frac{1}{(N-a_1-a_2-a_3-a_4)!}\cdot \frac{(N-a_1)!(N-a_2)!(N-a_3)!(N-a_4)!}{(N!)^3}$$

$$\frac{(N-a_1)!(N-a_2)!(N-a_3)!(N-a_4)!}{(N!)^3(N-a_1-a_2-a_3-a_4)!}$$

Factorial Property

Consider the property

$$(N-1)!=\frac{N!}{N}$$

This property can be generalized to say

$$(N-a)!=\frac{N!}{N\cdot(N-1)\cdot...\cdot(N-a+1)}$$

$$=\frac{N!}{\prod_{i=1}^a{N-i+1}}$$

$$=\frac{N!}{\prod_{i=N-a+1}^N{i}}$$

Together (Cont.)

$$\frac{(N-a_1)!(N-a_2)!(N-a_3)!(N-a_4)!}{(N!)^3(N-a_1-a_2-a_3-a_4)!}$$

Substitute factorial property where possible

$$\frac{\frac{N!}{\prod_{i=N-a_1+1}^N{i}}\cdot\frac{N!}{\prod_{i=N-a_2+1}^N{i}}\cdot\frac{N!}{\prod_{i=N-a_3+1}^N{i}}\cdot\frac{N!}{\prod_{i=N-a_4+1}^N{i}}}{\frac{N!}{\prod_{i=N-S_1-S_2-S_3-S_4+1}^N{i}}(N!)^3}$$

Simplify

$$\frac{\frac{(N!)^4}{\prod_{i=N-a_1+1}^N{i}\cdot\prod_{i=N-a_2+1}^N{i}\cdot\prod_{i=N-a_3+1}^N{i}\cdot\prod_{i=N-a_4+1}^N{i}}}{\frac{(N!)^4}{\prod_{i=N-a_1-a_2-a_3-a_4+1}^N{i}}}$$

$$\frac{\prod_{i=N-a_1-a_2-a_3-a_4+1}^N{i}}{\prod_{i=N-a_1+1}^N{i}\cdot\prod_{i=N-a_2+1}^N{i}\cdot\prod_{i=N-a_3+1}^N{i}\cdot\prod_{i=N-a_4+1}^N{i}}$$

This can be conceptualized as the product of the last values up to a number.

$$H(N, a)=\prod_{x=N-a+1}^N{x}$$

Substitute this function within our optimized function.

$$\frac{H(N, a_1 + a_2 + a_3 + a_4)}{H(N, a_1)\cdot H(N,a_2)\cdot H(N, a_3)\cdot H(N,a_4)}$$

And with that we can generalize.

$$P(A,N) = 1 - \frac{H(N, \sum_{i=1}^{n}{a_i})}{\prod_{i=1}^{n}{H(N, a_i)}}$$

 

We have minimized the number of divisions and, in some cases, reduced the total number of operations. However, there are many redundancies in these calculations.

 

Algorithm Optimization

At this point the most expensive operation is in the H function. Large values of a mean lots of multiplications. Many of these multiplications will be redundant. For example,

$$H(10, 4)=10*9*8*7$$

and

$$H(10, 5)=10*9*8*7*6$$

It can be stated that

$$H(10, 5)=H(10, 4) * 6$$

Sorting the values of A is paramount. After sorting, we can use the results of H for smaller values of A as inputs for calculations of H for larger values of A.

Values of A may not be adjacent integers like 4 and 5. Therefore, remembering the last value of a used to calculate H and the result of the computation is required. With this, we only have to calculate the H up to that a.

$$H(10, 4)=10*9*8*7$$

$$H(10, 6)=10 * 9 * 8 * 7 * 6 * 5$$

$$H(10, 6)=H(10, 4) * 6 * 5$$

$$=H(10, 4) * H(6, 2)$$

 

Code

Additional language implementations (Java, Rust, TypeScript, etc.) can be found on GitHub.

 

Benchmarking

The benchmarking experiment was run with the following configuration

  • Macbook Pro 2016

  • macOS Monterey 12.5.1

  • Python 3.10.4

  • 2.9GHz Dual-Core Intel Core i5

The benchmarking code and documentation can be found on GitHub. The following parameters were used for the data below

  • Trials per population size: 50

  • Iterations per trial: 50

  • Population sizes: 10 thru 2500

The dotted, magenta line is the runtime change from unoptimized to Formula & Algorithm optimized
Population Sampled is the sum of the array divided by the population size

From the plots above we can confidently state that

  1. The optimized formula & algorithm exceeds the unoptimized formula’s performance as the population size increases.

  2. The optimized formula & algorithm exceeds the unoptimized formula’s performance as the percent of population sampled increases.

  3. The optimized formula alone offers little improvement over the unoptimized formula.

  4. Optimization strategies’ performances converge as the number of elements in the array increases.

  5. The result of the algorithm is largely unrelated to the performance of any optimization strategy.

 

Extras

  • I initially ran into the Hannah Montana problem at work. I was tasked with taking a dozens of sales tables each with a dozen to more than one hundred columns. I had to conform each of these tables to a standardized table so their contents could be compared.

    These tables have similar content: sale prices, date of sale, etc. However, the names could vary wildly. Only a few weeks into the job, I didn’t know much about the datasets. The sales terminology used across many of the column names was outside of my knowledge domain. This was made more difficult when Sale Price could also be Price or Sale Amount or simply Amount. Twenty tables may use Sale Price while only two use Amount. Checking or discovering which columns may be related became an arduous task, evaluating how likely it was they were related was even more difficult. It was this obstacle that prompted me to develop the original formula for the likelihood of an exclusive relationship between columns of many tables.

    Weeks later I revisited the formula and derived the generalization and optimizations.

  • The Likelihood of an Exclusive Relationship and its generalization assume that the order of samples does not affect the result.

    $$P(a_1, a_2, N) = 1 - \frac{N - a_1 \choose a_2}{N \choose a_2}$$

    $$P(a_2, a_1, N) = 1 - \frac{N - a_2 \choose a_1}{N \choose a_1}$$

    $$P(a_1, a_2, N) = P(a_2, a_1, N)$$

    $$1 - \frac{N - a_2 \choose a_1}{N \choose a_1} = 1 - \frac{N - a_1 \choose a_2}{N \choose a_2}$$

    $$\frac{N - a_1 \choose a_2}{N \choose a_2} = \frac{N - a_2 \choose a_1}{N \choose a_1}$$

    $${N - a_1 \choose a_2}{N \choose a_1} = {N - a_2 \choose a_1}{N \choose a_2}$$

    Apply the combinations formula

    $$\frac{(N - a_1)!}{a_2!(N - a_1 - a_2)!}\cdot\frac{N!}{a_1!(N - a_1)!} = \frac{(N - a_2)!}{a_1!(N - a_2 - a_1)!}\cdot\frac{N!}{a_2!(N - a_2)!}$$

    Eliminate the factorials shared by numerator/denominator.

    $$\frac{N!}{a_1!a_2!(N - a_1 - a_2)!} = \frac{N!}{a_2!a_1!(N - a_2 - a_1)!}$$

    $$\frac{N!}{N!} = \frac{a_1!a_2!(N - a_1 - a_2)!}{a_2!a_1!(N - a_2 - a_1)!}$$

    $$1=1$$

    The commutative property holds true.

  • $$a \text{, } a_i$$

    Sample

    The parties that Hannah Montana and Miley Cyrus attend can each be described as a sample of the population of parties.

    $$A$$

    An array of sample sizes for each partygoer.

    $$n$$

    The number of sample sizes in the array.

    $$N$$

    Population Size

    The total number of parties one may attend.

 
Next
Next

Cagen