Analysing the COVID-19 “Binary Search” testing algorithm via Monte Carlo method (Part 2)
Link to Part 1. (You won’t understand much of what this post is about if you don’t read Part 1)
Where we are, and where we are going
So now we have a bare bones system that takes the code for an algorithm, it’s input space, and spits out a graph of T(N)
. Great! Now let’s get down to implementing the actual COVID-19 testing algorithm.
Implementing the algorithm
Here’s what I’ve gathered from the tweet above.
- Take
N
people. Mix their blood, and check for COVID. - If positive, we split the group into two of
N/2
each. Recursively test. - If negative, stop testing. None of these people are positive (probably… since COVID testing isn’t really 100% accurate. But we have to make assumptions.), so no more tests are required. This is where we’re supposed to gain performance.
Now that we have the algorithm, we need the input space. Let K%
of the tested population be actually infected with COVID-19. To generate a sample from this space we need a data structure that can do the following
- Store the infected/non-infected status of
N
people, out of whichK%
are infected - Supports
bool pooled_test()
that will tell us if the group as a whole is infected - Supports something like
(Group, Group) half_split()
returns two new groups withN/2
elements each.
To generate a sample, we will generate a group like this and mark K%
of them as infected.
Writing up the data structure
We store the indices of the infected. When queried, given left
and right
bounds, we check via binary search if there are elements in the array that lie between left
and right
. This gives us logarithmic query time, and group splitting is just done down the middle, so splitting takes constant time.
Might as well try to keep the complexity low, given that Monte Carlo is such an expensive place. (Money Printer go brrrrr)
Writing up the algorithm
Pretty simple stuff.
Last visit to Monte Carlo
Doesn’t look like a 10x
improvement to me, more like 2x
. But well, I’m not really a doctor so I will reserve my judgement. And that too at 5%
infection rate. Which is a low estimate of the population that will be tested, because if you have gathered up people for testing, wouldn’t you have guessed that more of them are infected?
Also looks quite linear to me. Question to the reader: Why the squiggly nature of the line? (Hint: floors and ceilings in binary search)
Conclusion
And there ends our brief trip to Monte Carlo. Now we have a probabilistic algorithm performance analyser.
Some possible improvements
- Fit complexity classes, via linear regression with changed bases. For example, if we hypothesise it’s a logarithmic function, fit a linear regression with bases as
log(N)
and1
. - Parallelize. I didn’t see much performance improvement while attempting this, maybe because of memory allocation overheads, but please feel free to message me if you have good ideas!
And that brings us to the end of the series! Feel free to send ideas and PRs to the GitHub repo.