Analysing the COVID-19 “Binary Search” testing algorithm via Monte Carlo method (Part 2)

3 min readApr 11, 2020

Link to Part 1. (You won’t understand much of what this post is about if you don’t read Part 1)

Where we are, and where we are going

So now we have a bare bones system that takes the code for an algorithm, it’s input space, and spits out a graph of T(N). Great! Now let’s get down to implementing the actual COVID-19 testing algorithm.

Implementing the algorithm

Here’s what I’ve gathered from the tweet above.

Take N people. Mix their blood, and check for COVID.
If positive, we split the group into two of N/2 each. Recursively test.
If negative, stop testing. None of these people are positive (probably… since COVID testing isn’t really 100% accurate. But we have to make assumptions.), so no more tests are required. This is where we’re supposed to gain performance.

Now that we have the algorithm, we need the input space. Let K% of the tested population be actually infected with COVID-19. To generate a sample from this space we need a data structure that can do the following

Store the infected/non-infected status of N people, out of which K% are infected
Supports bool pooled_test() that will tell us if the group as a whole is infected
Supports something like(Group, Group) half_split() returns two new groups with N/2 elements each.

To generate a sample, we will generate a group like this and mark K% of them as infected.

Writing up the data structure

We store the indices of the infected. When queried, given left and right bounds, we check via binary search if there are elements in the array that lie between left and right. This gives us logarithmic query time, and group splitting is just done down the middle, so splitting takes constant time.

Might as well try to keep the complexity low, given that Monte Carlo is such an expensive place. (Money Printer go brrrrr)

Writing up the algorithm

Pretty simple stuff.

Link to final notebook

Last visit to Monte Carlo

Doesn’t look like a 10x improvement to me, more like 2x. But well, I’m not really a doctor so I will reserve my judgement. And that too at 5% infection rate. Which is a low estimate of the population that will be tested, because if you have gathered up people for testing, wouldn’t you have guessed that more of them are infected?

Also looks quite linear to me. Question to the reader: Why the squiggly nature of the line? (Hint: floors and ceilings in binary search)

Conclusion

And there ends our brief trip to Monte Carlo. Now we have a probabilistic algorithm performance analyser.

Some possible improvements

Fit complexity classes, via linear regression with changed bases. For example, if we hypothesise it’s a logarithmic function, fit a linear regression with bases as log(N) and 1.
Parallelize. I didn’t see much performance improvement while attempting this, maybe because of memory allocation overheads, but please feel free to message me if you have good ideas!

And that brings us to the end of the series! Feel free to send ideas and PRs to the GitHub repo.