Modern Descartes - Essays by Brian Leehttp://moderndescartes.com/essays/I seek, therefore I am.en-usTue, 12 Dec 2017 18:38:16 -0000AlphaGo Zero: an analysishttp://moderndescartes.com/essays/agz<p>Where to begin?</p>
<p>I think the most mindblowing part of this paper is just how simple the core idea is: if you have an algorithm that can trade computation time for better assessments, and a way to learn from these better assessments, then you have a way to bootstrap yourself to a higher level. There's a lot of deserved buzz around how this is an extremely general RL technique that can be used for anything. </p>
<p>I think there are two things special to Go that make this RL technique viable.</p>
<p>First - Go piggybacks off of the success of convolutional networks. Out of the many kinds of networks, convnets for image processing are definitely the kind of network that has clear theoretical justification and the most real-world success. Most games have a more arbitrary, less spatial/geometric sense of gameplay, and would require carefully designed network architectures. Go gets to use a vanilla convnet architecture with almost no modification.</p>
<p>Second - Monte Carlo Tree Search (MCTS) is the logical successor of minimax for turn-based games that have a large branching factor. MCTS has been investigated by computer Go researchers for 10 years now, and the community has had a long time to understand how MCTS behaves in favorable and unfavorable positions, and to discover algorithmic optimizations like virtual losses (more on this later). Another algorithmic breakthrough like MCTS will be needed to handle games that have a continuous time dimension.</p>
<h2>Convnets and TPUs</h2>
<p>What makes convnets so appropriate for Go? Well, a Go board is a reasonably large square grid, and there's nothing particularly special about any point, other than its distance from the edge of the board. That's exactly the kind of input that convnets do well on.</p>
<p>The recent development of residual network architectures has also allowed convnets to scale to ridiculous depths. The original AlphaGo used 12 layers of 3x3 convolutions, which meant that information could only have propagated a <a href="https://en.wikipedia.org/wiki/Taxicab_geometry">Manhattan distance</a> of 24. To compensate, a set of handcrafted input features helped information propagate through the 19x19 board, by computing ladders and shared liberties of a long chain. But with resnets, the 40 (or 80) convolutions apparently eliminates the need for these handcrafted features. Resnets are clearly an upgrade from regular convnets, and it's unsurprising that the new AlphaGo uses them.</p>
<p>The downside of convnets is the rather large amount of computation they require. Compared to a fully connected network of the same size on a 19x19 board, convnets require roughly 14500 times fewer parameters[1]. The number of add-multiplies doesn't actually change, though, and the efficiency gains instead encourage researchers to just make their convnets bigger. The net result is a crapton of required computation. That, in turn, implies a lot of TPUs.</p>
<p><img src="/static/hummer.jpg" title="The modern convnet is like a Hummer: a complete gas guzzler" style="display: block; margin: 0 auto; width:100%;"/>
Fig. 1: The modern convnet</p>
<p>A lot of press has focused on the use of "4 TPUs" by AGZ, which is the number of TPUs that is used by the competition version of AGZ. This is the wrong thing to focus on. The important number is 2000 TPUs. This is the number of TPUs used during the self-play bootstrapping phase[2]. If you want to trade computation time for improved play, you'll need a lot of sustained computation time! There is almost certainly room for more efficient use of computation.</p>
<p>[1] Each pixel in the 19x19x256 output of one conv layer only looks at a 3x3x256 slice of the previous layer, and additionally, each of the 19x19 points share the same weights. Thus, the equivalent fully connected network would have used 361^2 / 9 = 14480x more weights.</p>
<p>[2] This number isn't given in the paper, but it can be extrapolated from their numbers (5 million self-play games, 250 moves per game, 0.4s thinking time per move = 500 million compute seconds = 6000 compute days, which was done in 3 real days. Therefore, something like 2000 TPUs in parallel. Aja Huang has confirmed this number.</p>
<h2>MCTS with Virtual Losses</h2>
<p>In addition to having a lot of TPUs, it's important to optimize things so that the TPU is at 100% utilization. The traditional MCTS update algorithm, as described by the AGZ paper itself, goes like this:</p>
<ul>
<li>Pick a new variation in the game tree to explore, balancing between reading more deeply into favorable variations, and exploring new variations.</li>
<li>Ask the neural network what it thinks of that position. (It will return a value estimation, and the most likely followup moves).</li>
<li>Record the followup moves and value estimate. Increment the visit counts for all positions leading to that variation.</li>
</ul>
<p>Unfortunately, the algorithm as described here is an inherently sequential process. Because MCTS is deterministic, rerunning step 1 will always return the same variation, until the updated value estimates and visit counts have been incorporated. And yet, the supporting information in the AGZ paper describes batching up positions in multiples of 8, for optimal TPU throughput. </p>
<p>The simplest way to achieve this sort of parallelism is to just play 8 games in parallel. However, there's a few things that make this approach less simple than it initially seems. First is the ragged edges problem: what happens when your games don't end after the same number of moves? Second is the latency problem: for the purposes of delivering games to the training process, you would rather have 1 game completed every minute, than 8 games completed every 8 minutes. Third is that this method of parallelism severely hamstrings your competition strength, where you want to focus all of your computation time on one game.</p>
<p>So we'd like to figure out how to get a TPU (or four) to operate concurrently on one game tree. That method is virtual losses, and it works as follows:</p>
<ul>
<li>Pick a new variation in the game tree to explore, balancing between reading more deeply into favorable variations, and exploring new variations. Increment the visit count for all positions leading to that variation.</li>
<li>Ask the neural network what it thinks of that position. (It will return a value estimation, and the most likely followup moves).</li>
<li>Record the followup moves and value estimate, realigning value estimates to the standard MCTS algorithm.</li>
</ul>
<p>The method is called virtual losses, because by incrementing visit counts without adding the value estimate, you are adding a value estimate of "0" - or in other words, a (virtual) loss. The net effect is that you can now rerun the first step because it will give you a different variation each time. Therefore, you can run this step repeatedly to get a batch of 8 positions to send to the TPU, and even while the TPU is working, you can continue to run the first step to prepare the next batch.</p>
<p>It seems like a small detail, but virtual losses allow MCTS to scale horizontally by a factor of about 50x. 32x comes from 4 TPUs and a batch size of 8 all working on one game tree, and another 2x for allowing both CPU and TPU to work concurrently.</p>
<h2>MCTS in the Random Regime</h2>
<p>The second mindblowing part of the AGZ paper was that this bootstrapping actually worked, even starting from random noise. The MCTS would just be averaging a bunch of random value estimates at the start, so how would it make any progress at all?</p>
<p>Here's how I think it would work:</p>
<ul>
<li>A bunch of completely random games are played. The value net learns that just counting black stones and white stones correlates with who wins. MCTS is also probably good enough to figure out that if black passes early, then white will also pass to end the game and win by komi, at least according to Tromp-Taylor rules. The training data thus lacks any passes, and the policy network learns not to pass.</li>
<li>MCTS starts to figure out that capture is a good move, because the value net has learned to count stones -> the policy network starts to capture stones. Simultaneously, the value net starts to learn that spots surrounded by stones of one color can be considered to belong to that color.</li>
<li>???</li>
<li>Beat Lee Sedol.</li>
</ul>
<p>Joking aside, the insight here is that there are only two ways to impart true signal into this system: during MCTS (if two passes are executed), and during training of the value net. The value net will therefore be the first to learn; MCTS will then play moves that guide the game towards high value states (taking the opponent's moves into consideration); and only then will the policy network start outputting those moves. The resulting games are slightly more complex, allowing the value network learn more sophisticated ideas. In a way, the whole thing reminds me of Fibonacci numbers: your game strength is a combination of the value network from last generation and the policy network from two generations past.</p>
<h2>Conclusion</h2>
<p>AlphaGo Zero's reinforcement learning algorithm is an accomplishment that I think should have happened a decade from now, but was made possible today because of Go's amenability to convnets and MCTS.</p>
<h2>Miscellaneous thoughts</h2>
<ul>
<li>
<p>The first AlphaGo paper trained the policy network first, then froze those weights while training the weights of a value branch. The current AlphaGo concurrently computes both policy and value halves, and trains with a combined loss function. This is really elegant is several ways: the two objectives regularize each other; it halves the computation time required; and it integrates perfectly with MCTS, which requires evaluating both policy and value parts for each variation investigated.</p>
</li>
<li>
<p>The reason 400 games were played to evaluate whether a candidate was better than its predecessor, is because 400 gives statistical significance at exactly 2 standard deviations for a 55% win rate threshold. In particular, the variance of a binomial process is N<em>p</em>(1-p), so if the null hypothesis is p = 0.5 (the candidates do not differ), then the expected variance is 400 * 0.5 * 0.5 = 100. The standard deviation is thus +/- 10, and 200 wins +/- 10 corresponds to a 50% +/- 2.5% win rate. A 55% win rate is therefore 2 standard deviations.</p>
</li>
<li>
<p>There were a lot of interconnected hyperparameters, making this far from a drop-in algorithm:</p>
<ul>
<li>the number of moves randomly sampled at the start to generate enough variation , compared to the size of the board, the typical length of game, and the typical branching factor/entropy introduced by each move.</li>
<li>the size of the board, the dispersion factor of the Dirichlet noise, the \(L_2\) regularization strength that causes the policy net's outputs to be more disperse, the number of MCTS searches that were used to overcome the magnitude of the priors, and the \(P_{UCT}\) constant (not given in paper) used to weight the policy priors against the MCTS searches.</li>
<li>the learning rate, batch size, and training speed as compared to the rate at which new games were generated.</li>
<li>the depth of old games that were revisited during training, compared to the rate at which new version of AGZ were promoted.</li>
</ul>
</li>
<li>
<p>On the subject of sampling old games - I wonder if it's analogous to the way that strong Go players have an intuitive sense of whether a group is alive. They're almost always right, but maybe they need to read it out every so often to keep their intuition sharp.</p>
</li>
</ul>http://moderndescartes.com/essays/agzTrip Report: World AI Go Openhttp://moderndescartes.com/essays/waigo<p>From Aug 14-Aug 18, I was in Ordos City, China, to compete in the first World AI Go Open, with <a href="https://github.com/brilee/MuGo">MuGo</a>, toy go AI that I'd built for fun. Despite coming in 11th place out of 12 contestants, I had a blast there and I'm happy to have met so many other Go AI developers. Much thanks to the WAIGO organizers for organizing this event and sponsoring many teams, including myself, to come to China.</p>
<p>I couldn't figure out a way to weave all these thoughts into a narrative, so here's a brain dump instead.</p>
<h2>Cool Tidbits</h2>
<ul>
<li>I got to meet all of the other Go AI developers. There was an interesting mix of company-sponsored teams, research groups, and solo developers. I was most struck by how long some of these people had been at Go AI - Lim Jaebum and Tristan Cazenave had apparently met at a Go AI competition 20 (!!!) years ago. A lot of people were poking to see if I'd come to the next major Go AI meetup in Tokyo. I'll have to see how much vacation time I have to spare :)</li>
<li>Mok Jinseok 9p interviewed me for a Korean documentary on Go AIs. He is apparently fluent in Korean, English, Chinese, and Japanese. It made me wonder: is this what happens when you take naturally intelligent humans and completely deprive them of a normal education? (Context: in Asia, aspiring professionals completely forgo normal schooling to dedicate 12 hours a day to studying Go, starting from age ~10).</li>
<li>I found out that Yuu Mizuhara, programmer of Panda Sensei (world's best tsumego solver AI) was sitting next to me at dinnertime. I then witnessed a hilarious exchange where Lim Jaebum said that his friend had invented a <a href="https://sahwal.com">tsumego generator</a>. We opened up Panda Sensei on one phone and the tsumego generator on another phone, and they faced off against each other. Panda sensei won handily :D</li>
<li>OracleWQ, the 12th place bot, used alpha-beta minimax search on top of a value network only. The value network seemed to have picked up some weird quirks, like favoring positions with stones on the second line. I think this was the effect of exhaustively searching all moves, whether sensible or not, and then picking whichever move happened to get the value network to output a slightly noisier/higher evaluation. Without a strong prior imposed by a policy network, the AI played very strange moves.</li>
<li>Ordos City is north of the Great Wall, and therefore Mongolian in culture. Their cuisine is pretty heavily milk-based, and almost all of the dishes I ate had some milk / yogurt mixed in. I saw some unique instruments and throat singing performances at the closing ceremony. I also stopped by <a href="https://en.wikipedia.org/wiki/Mausoleum_of_Genghis_Khan">the Mausoleum of Genghis Khan</a>.</li>
</ul>
<h2>Technical Tidbits</h2>
<ul>
<li>Lim Jaebum, author of DolBaram, works for some Korean Go servers. He writes algorithms that figure out at the end of the game which groups are alive/dead/in seki. He noted that it was actually quite easy to judge group status for high-level games, but it was a different matter entirely to judge beginners' games! He noted that a useful heuristic is that if you run many Monte Carlo simulations on a board, and a group lives/dies 50% of the time, then it's probably in seki. That info can then be fed as a feature plane to a neural network. A more complicated heuristic to avoid upsetting seki is to never fill in the second-to-last liberty of a group, unless it matched a known dead shape. (There are only 20 or so of them.)</li>
<li>It seemed that Hideki Kato (coauthor of DeepZenGo), among others, shared the opinion that AlphaGo may have weaknesses that are not on display in life and death, capturing races, seki, because AlphaGo may have learned to avoid such situations. In some ways, this is appropriate - we don't think that getting better at drunk driving is a reasonable solution to a high rate of car accidents; instead, we learn to avoid drunk driving. But Hideki expressed hope that we could solve the problem properly.</li>
<li>Hideki also pointed out that the current generation of Go AIs (excluding the latest versions of AlphaGo) had a very human playstyle, partly because they had been bootstrapped from human games. Thus, the maximum strength of human game-bootstrapped AIs is probably top human professional level, plus a little bit.</li>
<li>I asked Hideki why we didn't use only policy + value nets, without random rollouts. He responded that currently, value nets were unable to evaluate L+D and that rollouts were necessary for that purpose. Tristan Cazenave had in fact brought a Go AI that used only policy/value nets. It did respectably well, but was eliminated in the quarterfinals.</li>
</ul>
<h2>Funny Tidbits</h2>
<ul>
<li>I was picked up from the Ordos City airport by Zhao Baolong 2p. I asked him who the strongest Go pro he'd ever defeated was, and he replied that he'd once defeated Ke Jie when Ke was 12 years old. Zhao went pro the same year Ke did, so I guess it's tough to live in that shadow.</li>
<li>Lim Jaebum spoke some English, but preferred to speak in Korean. I ended up translating his Korean to English, and then my partner translated my English into Chinese. This game of telephone happened back and forth whenever Jaebum wanted to ask a question.</li>
<li>Our prize money was delivered in cash (and 20% of the prize money was deducted for taxes, so it's not like they were trying to do things under the table). I ended up flying home with an inch-thick wad of 100CNY bills. I can only imagine that Hideki, DeepZenGo's only representative flew home with several bricks of money.</li>
<li>I was asked to sign some fans and books and t-shirts, for the first time in my life. Woo!</li>
<li>On our flight out of Ordos, our airplane taxied down a runway, then made a u-turn and took off from the same runway. I guess you can do that when your airport sees 1 plane / hour...</li>
</ul>
<h2>FAQs</h2>
<ul>
<li>So you won a game? <a href="https://github.com/yenw/computer-go-dataset/tree/master/AI/The%201st%20World%20AI%20Go%20Open%202017">Yes, the records are here</a>. I recommend not looking at the game I won, actually. It's sort of painful to look at. Instead, check out MuGo vs. Abacus.</li>
<li>When did I start working on MuGo? How long have I been working on it? I started the project last June and worked on it until September or so. Since then, I've worked on it on and off.</li>
<li>What's the deal with Ordos? Is it actually a ghost town? Yes. On the ride from the airport, it was a good 5 minutes before we saw our first car on the road. The traffic lights at the intersections had literally been turned off, and all 10 lanes of highway sat virtually empty. The apartment complex we stayed at the first night easily had capacity for 50,000 people, but judging by the lights in the windows at night, only 5% of the apartments were occupied. The restaurant we went to that night was devoid of customers, and our plates/teacups came shrink-wrapped, as if we were the first customers they had ever served. [correction: apparently this is a thing in China and has nothing to do with Ordos. Yes, they re-shrink-wrap the plates after washing.] The rest of the week went fine, though. Overall, Ordos was quite beautiful, and felt to me like an empty amusement park that was regularly cleaned and maintained and staffed despite the lack of parkgoers.</li>
</ul>http://moderndescartes.com/essays/waigoThe Goose That Stopped Laying Golden Eggshttp://moderndescartes.com/essays/golden_goose<p>There are many companies that have had great brands and have subsequently destroyed their reputation by pushing shoddy products. <a href="https://www.reddit.com/r/AskReddit/comments/3o7p5z/which_company_or_brand_no_longer_has_the_quality/">This reddit thread</a> suggests brands like Craftsman, Pyrex, CNN, Breyers, RadioShack, among others. Was it just incompetence that led these companies to kill the goose that laid their golden eggs, or was there some other force at work?</p>
<p>In some cases, incompetence was certainly at work. But it takes a strange kind of incompetence to be able to build a strong reputation for quality over multiple decades, only to decide one day that you didn't like making nice things any more. What if we instead assume that these companies were competent? What could explain their actions?</p>
<h2>Dealing with an aging goose</h2>
<p>Consider the following: If you had a golden goose that was getting too old to lay eggs, what would you do? If you did nothing, you'd get a few golden eggs, and then you'd have an ordinary goose. You might wonder if there was something more you could do with your goose. You might even kill your goose in hopes of some gold, since you don't have much to lose.</p>
<p>The aging goose in this story is the brand name, and the golden eggs are the sales revenue. Each of these formerly reputable brands has had their target market shrink in a foreseeable way. Who needs quality tools (Craftsman) to fix things when cheap disposable stuff is so common? Who needs quality bakeware (Pyrex) when the women who would have been using them are now working jobs and using that income to eat out? Who needs a quality TV channel (CNN) now when nobody watches TV anymore? You cannot monetize a quality brand when there aren't enough people interested in quality products.</p>
<p>To make the metaphor more explicit, here's how you would slaughter your aging goose. Build subpar products and sell them at full price. In the decade it takes for people to catch on to your deception, you'll have pocketed a good sum of money and simultaneously have lost all of your brand reputation. This seems like a decent explanation for why a company might one day decide it didn't want to make nice things anymore.</p>
<h2>Goose demographics</h2>
<p>An interesting question is whether golden geese are living longer or shorter nowadays, and whether their numbers have changed. In the age of internet commerce and ubiquitous 5 star rating systems, it's unclear whether brands are meaningless or even more important than before.</p>
<p>There are a few different effects that are at play here.</p>
<ul>
<li>The increased efficiency of rating systems makes is possible for consumers to get unexpectedly decent products at a good price, thinning the top end of the market. This decreases the number of viable brands.</li>
<li>Brand liquidation strategies don't work as well as they used to, because bad press gets around so much more quickly. Brands are more likely to fade into obscurity than leave a trail of betrayed devotees.</li>
<li>To the extent that an online marketplace's rating algorithms are trustworthy, the marketplace itself becomes a meta-brand.</li>
<li>To the extent that online rating algorithms are untrustworthy, brand reputation reemerges as the only way people can trust what they're buying. (An interesting case study here is AmazonBasics. AmazonBasics targets products with razor-thin margins, so it seems unlikely that they intend to make much profit from them. It's more like that it exists to provide customers a reliable product. So AmazonBasics is somewhat of a hedge against Amazon's rating algorithms not doing their job.)</li>
</ul>
<p>In general, I think that online marketplaces will eventually solve the ratings trustworthiness issues, leading to brand reputation being meaningless for indicating quality products. The result seems to be (from my limited vantage point) that brands are no longer about reliability and value, and more about lifestyle. It may also be that there are more brands than ever before, but they are targeting niche markets, thanks to the power of the internet.</p>
<h2>Other things to do with your goose</h2>
<p>If you don't like killing geese, there are a few other things you can do with your aging goose.</p>
<p>One strategy is to pivot into the Golden Goose Circus business, and charge spectators to look at the goose that could lay golden eggs. (For example: Michael Jordan retired and made more money from endorsements than he earned during his career.)</p>
<p>If you're a believer in cryonics, you could freeze your goose in liquid nitrogen and hope to revive him in the future (i.e. gracefully retire your brand name, and try to bring it back as "retro style" in a decade or two).</p>
<p>Anyway, the word "goose" has reached semantic saturation, so I'll stop with the goose analogies now.</p>http://moderndescartes.com/essays/golden_gooseThe asymptotic value of 2n choose nhttp://moderndescartes.com/essays/2n_choose_n<p>What is the asymptotic value of \({2n \choose n}\) as \(n\) goes to infinity?</p>
<p>There's a stupid solution that involves plugging in <a href="https://en.wikipedia.org/wiki/Stirling%27s_approximation">Stirling's approximation</a> three times and cleaning up the ensuing mess, but I thought of a better solution.</p>
<p>Let's start by figuring out the ratio of two adjacent terms. Taking \(n = 4, 5\) as an example, we have:</p>
<p>\({8 \choose 4} = \frac{8!}{4!4!}\)</p>
<p>\({10 \choose 5} = \frac{10!}{5!5!} = {8 \choose 4} \cdot \frac{10 \cdot 9}{5 \cdot 5}\)</p>
<p>So it seems that as we increment \(n\), we're multiplying by \(\frac{2n(2n-1)}{n^2} \approx 4\). Therefore, the dominant term in the asymptotic growth rate is \(4^n\).</p>
<p>Can we do better? Yes. Let's take a closer look at the approximation we made in the last step.</p>
<p>The approximation we took was to multiply by \(\frac{2n}{2n-1}\) and pretend it didn't happen. And that's for a single step, for \(n \rightarrow n+1\). When you aggregate all of these small errors from 1 to \(n\), you get an extended product:</p>
<p>\(P = \frac{2}{1} \cdot \frac{4}{3} \cdot \frac{6}{5} \cdot \frac{8}{7}\cdot\cdot\cdot \frac{2n}{2n-1}\)</p>
<p>So we're overestimating by a factor of \(P\). How can we estimate the value of this product? Well, it would be nice if we could cancel out the numerator and denominator of adjacent terms... What if we take the complementary series to fill in the gaps?</p>
<p>\(P' = \frac{3}{2} \cdot \frac{5}{4} \cdot \frac{7}{6} \cdot \frac{9}{8}\cdot\cdot\cdot \frac{2n-1}{2n-2}\)</p>
<p>\(P \cdot P' = \frac{2}{1} \cdot \frac{3}{2} \cdot \frac{4}{3} \cdot \frac{5}{4}\cdot\cdot\cdot \frac{2n-1}{2n-2} \cdot \frac{2n}{2n-1} = 2n \)</p>
<p>By multiplying these two series together, everything cancels out perfectly, in a zipper-like fashion. Our next approximation is to say that, since these two infinite series are complementary, they each contribute a half of the final product. Each component series is therefore worth \(P \approx P' \approx \sqrt{2n}\), and our improved asymptotic value is \(\frac{4^n}{\sqrt{2n}}\).</p>
<p>It's definitely not true that the two halves are equal in value, though. As it turns out, there's an infinite series that describes the divergence between these two halves: the <a href="https://en.wikipedia.org/wiki/Wallis_product">Wallis product</a>. There's a nifty proof of this product by Euler - see the Wikipedia article for details.</p>
<p>\(W = \frac{P}{P'} = \frac{2}{1} \cdot \frac{2}{3} \cdot \frac{4}{3} \cdot \frac{4}{5} \cdot \frac{6}{5} \cdot \frac{6}{7} \cdot\cdot\cdot = \frac{\pi}{2}\)</p>
<p>Using the Wallace product, we can upgrade our approximation to an equality: \(P \cdot P' \cdot W = P^2 = 2n \cdot \frac{\pi}{2} = \pi n\)</p>
<p>The actual asymptotic value is therefore \(\frac{4^n}{\sqrt{\pi n}}\). This value can be confirmed by the brute-force Stirling's approximation solution.</p>http://moderndescartes.com/essays/2n_choose_nMy Path to Machine Learninghttp://moderndescartes.com/essays/my_ml_path<p>3 years ago, I dropped out of graduate school and started working at HubSpot as a backend software engineer. After 2 years at HubSpot, I left to focus on machine learning, and now I'm employed as an engineer working on ML at Verily. This is the story of how I made my transition.</p>
<h2>The AlphaGo Matches</h2>
<p>I remember DeepMind's announcment that it had defeated a professional in the same way most people remember 9/11. My coworker ran into me in the hallway and said, "Hey Brian, did you hear that they beat a Go professional on a 19x19?". The last I'd heard, the best Go AI needed a four stone handicap - a significant difference in strength from top players - so my immediate response was, "How many stones handicap?". He said none. I didn't believe him, but I rushed to my desk nonetheless to check it out. My jaw dropped as I read about AlphaGo's 5-0 victory against Fan Hui. Somehow, Go AI had made a huge leap in strength with the addition of neural networks.</p>
<p>The next week saw intense debate among my Go playing friends about whether AlphaGo would be able to beat Lee Sedol in two months. I was of the rare opinion that it was over - humanity would lose. Of course, that didn't stop me from watching all five games live. I remember after Lee's third straight loss, the sinking feeling I had in my stomach that AlphaGo was just strictly better than humans - its neural networks had captured the essence of human intuition while backing it up with the extensive search that machines are so good at.</p>
<p>Go had gained its reputation as the ultimate anti-computer game because of its resistance to rules. For every Go proverb that suggests one rule, there's another proverb that suggests the opposite rule. The one constant in Go is balance, as exemplified by the wry proverb, "If either player has all four corners, then Black [traditionally the weaker player] should resign". Yet AlphaGo had managed to master Go on its own.</p>
<p>This idea that human intuition could be replicated by an algorithm is a scary one - how many jobs resist automation, simply because we can't specify a set of rules in software? What if we could just train a neural network on a million examples, and have it figure out a set of rules on its own? That was the world that AlphaGo's success was suggesting. <a href="http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html">Ilya Sutskever suggests</a> that if, somewhere in the world, there exists a savant who could do some task in a split second, then there probably exists a neural network that could do the same. I wanted to understand how machines could develop an intuition that mimicked a human's intuition.</p>
<h2>Learning the basics</h2>
<p>After leaving HubSpot in May 2016, I decided that my next project would be to fully understand the theory behind AlphaGo. I started with Michael Nielsen's excellent online textbook <a href="http://neuralnetworksanddeeplearning.com/">Neural Networks and Deep Learning</a>. Even though the material covered is pretty old, relative to the state of the art, the theory is explained well enough that I had no trouble understanding how and why the latest papers improved on the basic networks shown in the textbook.</p>
<p>I was lucky enough to have seen many of these ideas before in other contexts. Backpropagation, and TensorFlow's automated differentiation on directed acyclic graphs, are both hinted at by <a href="https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-16.html#%_sec_2.3.2">SICP's discussion of symbolic differentiation</a>. Convolution operations were basic ideas in the field of image processing, which I had gotten familiar with during my grad school research.</p>
<p>I found <a href="http://colah.github.io/">Chris Olah's blog</a> and <a href="http://karpathy.github.io/">Andrej Karpathy's blog</a> to be very useful resources in understanding the capabilities and workings of neural networks. More recently, Google Brain has started publishing a <a href="http://distill.pub/">series of visualizations on deep learning</a>. The <a href="https://www.tensorflow.org/tutorials/deep_cnn">TensorFlow tutorials</a> are also very instructive. It's also useful to read the primary literature on ArXiV. (There's no particular reading list here; I dug through citations whenever I was curious about something.)</p>
<p>I spent a lot of time reading older papers on convnets and on previous attempts to write Go AIs - MCTS, as well as low-level considerations, like efficient representations of Go boards and quick computations of board updates and hashes. I also looked through the source code of <a href="https://github.com/pasky/pachi">Pachi</a> and its baby version <a href="https://github.com/pasky/michi">Michi</a>. Armed with this knowledge, I started implementing my own <a href="https://github.com/brilee/MuGo">mini-AlphaGo program, MuGo</a>.</p>
<p>The first time I got my policy network to work, my mind was blown. The moves were not particularly good, but the moves followed the basic patterns of local shape. As I implemented more input features, I could see deficiencies in play being fixed one by one. Along the way, I solved lots of miscellaneous engineering problems, like how to handle a training set that was much larger than my RAM, or how to store the processed training set on my SSD.</p>
<p>These months culminated with the US Go Congress in August, where I met Aja Huang (AlphaGo's lead programmer) and several other computer Go people. That was quite an inspirational week. </p>
<h2>The Recurse Center</h2>
<p>Based in NYC, the <a href="https://www.recurse.com">Recurse Center</a> is a self-directed educational retreat for programmers. I would describe RC as a place for people to learn all the things or work on all the projects they never had the time or courage to try. I attended for 3 months, starting in mid-August immediately after the Go Congress, and ending in mid-November of 2016.</p>
<p>I came in wanting to focus on ML and to work on projects with other people who were similarly inclined. Unsurprisingly, ML was a trendy topic there, alongside Rust and Haskell. I ended up figuring out many of my neural network issues with the help of other Recursers (numerical stability issues; <a href="/essays/initializing_weights_DCNN">weight initialization issues</a>; <a href="/essays/convnet_edge_detection">seemingly useless input features</a>), and I in turn helped advise others on designing and debugging their models' performance.</p>
<p>Some other projects I did with other Recursers included</p>
<ul>
<li>implementing a RNN</li>
<li>reading + discussing the Gatys style transfer paper</li>
<li>helping optimize a Chess engine (written in Go, funnily enough)</li>
<li>getting familiar with various profiling tools and <a href="/essays/flamegraphs">performance visualization techniques</a></li>
<li>digging into <a href="/essays/bitpacking_compression">why my dataset for MuGo compressed poorly</a></li>
<li>rewriting MuGo's board representation</li>
<li>giving a few talks (on AlphaGo; on <a href="/essays/why_rest">REST</a>)</li>
<li>implementing coroutines in C</li>
<li>implementing HyperLogLog in Go</li>
</ul>
<p>RC's main benefit was in improving my general programming level. Recursers tend to come from such a wide variety of backgrounds, and have expertise in the most unusual of systems. For example, I had never done any GPU programming before, having relied to TensorFlow's nice wrappers. But I had little difficulty finding a few other people who had graphics expertise who were happy to work on something basic with me. Another group of Recursers were working on independently implementing the <a href="https://raft.github.io/">Raft consensus protocol</a> in various languages, with the goal of getting their implementations talking to each other. I would have loved to join in if I had more time.</p>
<p>I'd gladly go to RC again the next time I get an opportunity. If the stuff I did sounds fun and you have 2-3 months of free time, you should <a href="https://www.recurse.com/apply">apply to RC too</a>.</p>
<h2>Finding a Job</h2>
<p>As RC came to an end, I thought about what sort of ML job would be best for me. I had learned quite a bit about the different aspects of ML via my MuGo project - obtaining and processing data sources, feature engineering and model design, and scaling up training of the model. I decided that I wanted a project of similar breadth that was more difficult in some of these aspects. That could mean more machines for training, handling much larger quantities of data, or using new models. I also wanted to be in an environment where I could learn from more experienced practioners of ML.</p>
<p>This limited my options to the big companies that were doing ML on a larger scale: Google, OpenAI, Facebook, Uber ATC, Microsoft Research, Baidu, and of course DeepMind, the company behind AlphaGo. There were a handful of other companies doing "big data" systems that also seemed interesting: Heap and Kensho, among others. I sent off my resume and pinged my friends at these companies.</p>
<p>In the meantime, I started brushing up on the basics of ML that I'd skipped because of my focus on neural networks and AlphaGo. I spent the most time on <a href="https://www.amazon.com/dp/3319110799">Axler's Linear Algebra Done Right</a>, but I also watched all of <a href="https://www.coursera.org/learn/machine-learning/">Andrew Ng's Coursera class</a>, and skimmed through the <a href="http://projects.iq.harvard.edu/stat110/handouts">final exams for Harvard's introductory statistics class</a>. I continued to read papers that popped up via HN or through other recommendation sources like <a href="https://jack-clark.net/import-ai/">Jack Clark's Import AI newsletter</a>.</p>
<p>Here are notes from my interviews:</p>
<p><strong>Kensho</strong>: Applied here because their company had a heavy academic bent. I think I slipped through the cracks as they switched their hiring pipeline software, so I didn't finish my interviews.</p>
<p><strong>Heap</strong>: Applied here because of their experience working with large quantities of semistructured data. They had the most difficult yet practical interviews of any of the companies I'd applied to, with the exception of OpenAI. They were a pleasure to work with, but I ultimately turned down their offer for Google.</p>
<p><strong>Baidu</strong>: Applied here mostly because Andrew Ng was working there. No response. I didn't know anybody here; resume probably went straight to the trash.</p>
<p><strong>MS Research</strong>: Applied here because I'd seen some interesting research on language translation using RNNs. Initially, there was no response. I then had Dan Luu refer me, and I got a hilarious email from not even a recruiter, but a contractor whose job it was to recruit people to talk to recruiters. This contractor asked me to submit my personal information via Word Doc. Some of the fields in this Word document were literally checkboxes - I wasn't sure if the intent was for me to print out the documents, check the boxes, and send the scanned document, or for me to replace ☐ 'BALLOT BOX' (U+2610) with ☑ 'BALLOT BOX WITH CHECK' (U+2611). I did the latter but it probably didn't matter.</p>
<p><strong>Facebook</strong>: Applied here because Facebook has some intriguing technology with automatic image labeling + captioning, as well as newsfeed filtering problems. I had a friend in the Applied ML department refer me, and got a call from a very skeptical-sounding recruiter who seemed fixated on how many years of experience I had. I got a call back the next day saying they were proceeding with more experienced candidates. This annoyed me because clearly the referral had obligated them to call me, but the recruiter came in with preconceived notions and didn't attempt to gain any bits of information during our conversation.</p>
<p><strong>DeepMind</strong>: Applied here because of AlphaGo and other basic research around deep learning techniques. Started with some quizzes on math and CS - didn't have much trouble with these. The quizzes were literally quizzes, involving ~50 questions covering the breadth of an undergraduate education. I guess the UK takes their undergraduate education more seriously than in the US.</p>
<p>I told DeepMind that I wanted a Research Engineer position, which is a research assistant who knows enough about both software engineering and research to accelerate the research. They agreed that I was suitable and gave me additional quizzes on undergraduate statistics, and a coding test that explored the efficient generation and utilization of random numbers. I really liked that coding test - it was reasonable and I learned something useful.</p>
<p>Then the interview process went somewhat off the rails; I talked to at least 4 different engineering managers of various kinds with no indication of when the process would be over. This was in addition to the 3 people I had already had quizzes with. I actually lost count of how many people I talked to. Eventually I ran out of intelligent questions to ask, and when the last manager asked if I had any questions for him, I sat in silence for a minute, then literally asked the question, "so when is this process going to be over?". I guess my rejection was the answer to that question.</p>
<p><strong>OpenAI</strong>: Applied here because of their talent and their mission (<a href="https://openai.com/about/">AI for the masses and not just for AmaGooBookSoft</a>). I didn't get a response until I pinged Greg Brockman, who I knew via the Chemistry Olympiad in high school. It's a small world. I chatted briefly with one of their research engineers, and then was flown in for a weeklong onsite. Apparently this was common; I met many others there who were there for many weeks working on various projects. The idea is to get to know each candidate as much as possible, and to also spread the wealth, in terms of opportunities to work with world-class researchers. </p>
<p>The week I spent at OpenAI was incredible, and I regret not clearing more of my calendar to spend another week there. I got to meet many of the top researchers in the world, and I learned something new at every lunch and dinner.</p>
<p><strong>Uber ATC</strong>: Applied here because self-driving cars. I started off interviewing for a ML role, but I think I flunked my conversational interview because I referred to the k-means clustering algorithm as sort of hacky. I meant this in the sense that it was a one-off algorithm that didn't really share many ideas with other models in ML, and because its behavior was <a href="https://en.wikipedia.org/wiki/K-means%2B%2B">highly dependent on initial randomization</a>, among other quirks. But I was too tired that day to express this more eloquently. Uber ATC is Serious Business and they don't do Hacky Things there.</p>
<p>Anyway, they were nice enough to let me continue interviewing for a regular engineering position, and I was happy enough with that, since I would probably be working closely with many aspects of their self-driving car project, anyway. I did another phone screen which focused more on coding style and testing than on algorithms. The onsite, again, was mostly discussion: my past projects, my thoughts on effective software processes, system design conversations. There was just the one old-school hard algorithmic question. Given the nature of what they were building, their engineers tended to be low-level embedded C++ engineers, and they seemed to lean on the conservative side. Again, the impression was of Serious Business. They appeared to have a healthy work environment, in contrast to the Susan Fowler exposé that erupted recently. Perhaps Uber ATC was geographically isolated from the drama at Uber HQ. I got an offer from Uber but ultimately declined in favor of Google.</p>
<p><strong>Google</strong>: I had a friend refer me, and got a call back from a recruiter who explained the process to me. I was going to be scheduled for a phone screen, but was then fast-tracked to the onsites, for whatever reason. (Maybe my Google Search history? My blog? High school math competitions? Who knows.) The onsite consisted of four algorithmic interviews of various kinds and one system design interview. All in all, the interview problems were not as hard as I had been expecting them to be. The questions never explicitly asked about data structures, but their solutions lent themselves to thinking about data structures. I enjoyed chatting with each of my interviewers, and it repeatedly blew my mind how large Google was, and the scale at which it operated. This part of the interview process was pleasantly speedy, in contrast to the rumors I had heard about Google's recruitment pipeline being a mess.</p>
<p>After my interview, the team fit process at Google was actually quite a pleasure to work through. I had an incredible recruiter who talked through what I wanted to do, and who went through great lengths to find the right teams to talk with. My eventual choice to work at Google was in no small part due to his efforts. The recruiter's incentive structure was quite interesting, actually. He claimed that his bonus depended solely on getting me to join Google, and a different committee would be responsible for negotiating my offer. Thus, it set up a relationship where my recruiter was actually incentivized to argue to the compensation committee on my behalf. Maybe the good cop/bad cop routine was designed to let my guard down, but it was nice to work with someone who wasn't simultaneously trying to sell the company while stonewalling my negotiating attempts.</p>
<p>As mentioned earlier, the team I'll be joining at Google is actually within <a href="https://verily.com">Verily</a>, which is one of the Alphabet companies.</p>
<h2>The Path Ahead</h2>
<p>It's been one year now since AlphaGo first declared victory over Lee Sedol, and it's been quite a journey since then.</p>
<p>I realized while at OpenAI that I have so much more to learn. I had been reading papers and implementing toy systems, but there was a fundamental maturity of understanding that I lacked. I guess that's just to be expected; almost every researcher I met at OpenAI had a PhD in ML and had spent at least 5 years thinking about these ideas.</p>
<p>I left OpenAI's office with the impression that after reading the entirety of <a href="http://www.deeplearningbook.org/">Goodfellow/Bengio/Courville's DL book</a> and another year or two of practical experience, I might begin to actually understand ML. But it'll be tough. Chapter 2 (linear algebra) summarizes all of Linear Algebra Done Right in a dense 22 pages, then goes on to mention more things I hadn't known about. Given that that's the chapter I expected to understand most thoroughly, I'm afraid to think of how many ideas in the other chapters will just fly over my head because I can't fully appreciate the basics.</p>
<p>I ran into Ian Goodfellow later and he told me that those introductory chapters, although dense, are probably the most important chapters because they will stay relevant for decades to come, while the other chapters will be outdated in a few years' time. So I'll just have to slog through them.</p>
<p>Along those lines, I think the most important advice I can give to other people is to not shy away from the math. In many cases, if you understand the basic statistical, numerical computation, and probabilistic ideas, the applications to ML are one simple step away.</p>http://moderndescartes.com/essays/my_ml_pathManipulating Probability Distribution Functionshttp://moderndescartes.com/essays/probability_manipulations<h2>Introduction</h2>
<p>A probability distribution function (PDF) is a function that describes the relative likelihood that a given event will occur. PDFs come in two flavors: discrete and continuous. Discrete PDFs are called probability mass functions (PMF).</p>
<p>In this essay I'll talk about different ways you can manipulate PDFs. I'll then use those manipulations to answer some questions: "Given independent samples from a distribution, what is the distribution describing the max of these samples?", and "If a process is memoryless, then how long will I wait for the next event to happen? How many events will happen in a given period of time?" (Answer: the exponential distribution and the Poisson distribution.)</p>
<h2>Basics</h2>
<p>PDFs are generally normalized. This means that each PDF comes with a constant factor such that it integrates to 1. Similarly, the terms in a PMFs will add up to 1. </p>
<p>Given a PDF, you can integrate it to get a cumulative distribution function (CDF). If a PDF describes the probability that an outcome is exactly X, then a CDF describes the probability that an outcome is less than or equal to X. The complementary CDF, sometimes called the tail probability, is 1 minus the CDF, and describes the probability that an outcome is greater than X.</p>
<p>By reversing the integral (i.e. by taking a derivative), you can turn a CDF back into a PDF.</p>
<p>Given two independent events X and Y, the probability of both x and y happening is P(x, y) = P(x)P(y).</p>
<p>Given two nonindependent events X and Y, the probability of both X and Y happening is P(x, y) = P(x | y)P(y) = P(y | x)P(x). P(x | y) notates "the probability of x happening, given that y has already happened". Incidentally, the symmetry of x and y in the above equations is the proof of <a href="https://en.wikipedia.org/wiki/Bayes'_theorem">Bayes' rule</a>.</p>
<p>Given two independent PDFs X(x) and Y(y), the likelihood of an outcome z = x + y is given by the convolution of X and Y. This is a fancy way of saying "the sum of the probabilities for all combinations of x, y that add up to z". So, if you have two dice with PMF [1/6, 1/6, 1/6, 1/6, 1/6, 1/6], then the outcome 10 can be achieved in three ways - (6, 4), (5, 5), or (4, 6) - and the probability of each of those cases is 1/36. Overall, the likelihood of rolling a 10 with two dice is thus 3/36.</p>
<p>Formalized, a convolution looks like this for the discrete case:</p>
<p>$$(P * Q)(z) = \Sigma P(x)Q(z-x)$$</p>
<p>and like this for the continuous case:</p>
<p>$$(P * Q)(z) = \int P(x)Q(z-x)dx$$</p>
<p>With these tools, we should be able to answer the order statistic problem and derive the Poisson distribution.</p>
<h2>Order statistics</h2>
<p>Consider the following: If the strength of an earthquake is drawn randomly from some PDF, and we had 5 earthquakes, then what is the PDF that describes the strength of the largest earthquake? Or, a similar question that comes up in the analysis of <a href="/essays/hyperloglog">HyperLogLog</a>, or also Bitcoin mining: Given N random numbers, what is the PMF describing the maximum number of leading zeros in their binary representations? Both questions are examples of order statistics.</p>
<p>Let's call our original PDF P(x). It would be great if we could just exponentiate our PDF, so that the solution is \(P(x)^N\). Alas, this isn't quite what we want, because \(P(x)^N\) gives us the probability that <em>all</em> \(N\) earthquakes are exactly of strength \(x\), rather than the probability that the <em>biggest</em> earthquake is strength \(x\).</p>
<p>Instead, let's integrate our PDF to get the CDF \(Q(x)\), describing the probability that an earthquake's strength is less than or equal to \(x\). Now, when we exponentiate \(Q(x)\), \(Q(x)^N\) describes the probability that all \(N\) earthquakes have strength less than or equal to \(x\).</p>
<p>Now that we have a CDF that describes the probability that all events are less than or equal to \(x\), we can take its derivative to get a new PDF describing the probability that an event is exactly equal to \(x\). \(\frac{d}{dx} [Q(x)^N]\) is our final answer.</p>
<p>Let's use this to solve the leading zeros problem described earlier. Flajolet's HyperLogLog paper gives the solution to this problem without much explanation (left as an exercise to the reader, etc.).</p>
<p><img src="/static/hyperloglog_maxzeros_eq.png" title="Flajolet's equation for likelihood of k max zeros from v draws" style="display: block; margin: 0 auto;"/></p>
<p>The PMF describing the probability of having \(k\) leading zeros is \(P(k) = 2^{-(k+1)}\), or [1/2, 1/4, 1/8, 1/16...]. The "integral" of this series are the partial sums, which are \(Q(k) = 1 - 2^{-(k+1)}\). Exponentiating gives \(Q(k)^N = (1 - 2^{-(k+1)})^N\). Finally, "taking the derivative" of a PMF is equivalent to subtracting adjacent terms, yielding MaxZeros\((k, N) = (1 - 2^{-(k+1)})^N - (1 - 2^{-k})^N\). That looks like what's in the paper! (The off by one error is because the paper defines their \(k\) as being the number of leading zeros plus one.)</p>
<h2>Memoryless processes</h2>
<p>A memoryless process is one for which previous events (or lack of events) don't affect subsequent events. For example, a slot machine should in theory be memoryless. For a memoryless process what is the PDF that describes the time in between events?</p>
<p>Let's start by writing an equation that describes the memoryless property. Call our PDF \(f(t)\), where \(t\) is the time to the next event. Let's say some time \(h\) passes without an event happening. From here, the probability that an event happens at time t+h is \(f(t+h) = f(t)\left(1 - \int_0^h f(s) ds\right)\). This follows from P(x, y) = P(x | y)P(y), where event x is "event happens at time t+h" and event y is "h time passed with no event happening"</p>
<p>If we call \(f(t)\)'s integral \(F(t)\), then we can simplify as follows:</p>
<p>$$f(t+h) - f(t) = -f(t)(F(h) - F(0))$$</p>
<p>Now, divide both sides by \(h\). This looks familiar - both sides contain the definition of a derivative!</p>
<p>$$\frac{f(t+h) - f(t)}{h} = -f(t)\frac{F(h) - F(0)}{h}$$</p>
<p>Since this equation is valid for all positive \(h\), we take the limit as \(h\) goes to zero, and substitute the appropriate derivatives, yielding \(f'(t) = -f(t)f(0)\). Since \(f(0)\) is a constant, we'll call that constant \(\lambda\) and the only function \(f(t)\) satisfying this differential equation is \(f(t) = \lambda e^{-\lambda x}\).</p>
<p>This is called the <a href="https://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>. Lambda is a parameter that specifies the rate of events. The above derivation shows that the exponential distribution is the <em>only</em> distribution satisfying the memoryless property.</p>
<h2>Deriving the Poisson Distribution</h2>
<p>The exponential distribution answers the question, "How long do I wait until the next event happens?". A similar, related question is, "In one unit of time, how many events can I expect to occur?" The answer to this question is given by the <a href="https://en.wikipedia.org/wiki/Poisson_distribution">Poisson distribution</a>.</p>
<p>How might we use the exponential distribution to deduce the Poisson equation? Let's start with the easiest case. The probability that 0 events happen within 1 time unit is equal to the probability that the wait time is greater than 1 time unit. </p>
<p>$$P(0) = \int_1^\infty \lambda e^{-\lambda x}dx = e^{-\lambda}$$</p>
<p>Moving up, there are many ways that 1 event can happen - the wait time could be 0.2, 0.3, 0.7, or anything less than 1. But importantly, it's not "at least 1 event", but rather "exactly 1 event". So we have to make sure that in the remaining time, no events happen. So what we end up with is a sort of convolution, where we sum over all possible combination of events x, 1 - x, where x is the time until first event, and there is no event in the remaining 1 - x time.</p>
<p>$$
\begin{align}
P(1) &= \int_0^1 \lambda e^{-\lambda x}\left(\int_{1-x}^\infty \lambda e^{-\lambda t}dt\right) dx \\
&= \int_0^1 \lambda e^{-\lambda x} e^{-\lambda (1-x)} dx \\
&= \int_0^1 \lambda e^{-\lambda} dx\\
&= \lambda e^{-\lambda}\\
\end{align}
$$ </p>
<p>More generally, if we had some PDF describing the wait time for k events, then we could use the same strategy - sum over all combination of events x, 1 - x, where x is the time until k events and there is no event in the remaining 1 - x time. As it turns out, deducing the PDF describing wait time to k events is pretty easy to do: take the convolution of the exponential distribution with itself, (k - 1) times. The first convolution gives you the time to 2 events, the second convolution gives you the time to 3 events, and so on. These repeated convolutions give rise to the <a href="https://en.wikipedia.org/wiki/Erlang_distribution">Erlang distribution</a>. The repeated convolutions aren't that hard to calculate so I'll leave them as an exercise for the reader :)</p>
<p>$$\textrm{Exp * Exp * Exp (k times)} = \frac{\lambda^k x^{k-1}}{(k-1)!}e^{-\lambda x}$$</p>
<p>Substituting the Erlang distribution into our calculation above, we have:</p>
<p>$$
\begin{align}
P(k) &= \int_0^1 \frac{\lambda^k x^{k-1}}{(k-1)!}e^{-\lambda x} \left(\int_{1-x}^\infty \lambda e^{-\lambda t}dt\right) dx\\
&= \int_0^1 \frac{\lambda^k x^{k-1}}{(k-1)!}e^{-\lambda x} e^{-\lambda (1-x)} dx\\
&= \frac{\lambda^k}{(k-1)!e^\lambda} \int_0^1 x^{k-1} dx\\
&= \frac{\lambda^k}{k!e^\lambda}\\
\end{align}
$$</p>
<p>And there we have it: the Poisson distribution.</p>http://moderndescartes.com/essays/probability_manipulationsNotes on Musical Notationhttp://moderndescartes.com/essays/musical_notation<p>I love video game music. I also love playing piano, especially classical music. So it's only natural that I've tried to find sheet music for some of my favorite video game music. It exists, but unfortunately it tends not to be done very well. Typically, the notes aren't all correct, but even if they are, they're often notated improperly. Music notation is intricate enough that you need years of experience to internalize the patterns in the notes.</p>
<p>Here are some notes on common notation mistakes I see in transcriptions. As you'll see, properly executed music notation is very dense with semantic information.</p>
<p>I'll assume you have a rudimentary understanding of music notation (treble and bass clefs, note durations, accidentals) and scales/circle of fifths.</p>
<h2>Properly indicating voice leading</h2>
<p>Voice leading is the movement of "voices", or individual notes, within a chord. For example, in this two chord progression, the B resolves upwards to a C, and the F resolves downwards to an E. The musical tension generated by the B-F tritone turns into a C-E major third consonance.</p>
<p><img src="/static/musical_notation/B65-C.png" title="B65 to C" style="display: block; margin: 0 auto;"/>
<audio controls style="display:block; margin: auto"><source src="/static/musical_notation/B65-C.ogg" type="audio/ogg"></audio></p>
<p>When notating music, you should always follow this rule: sharps resolve upwards, and flats resolve downwards! If your name is not Maurice Ravel and you find yourself applying a sharp, and then immediately cancelling it with a natural, then you've probably made a mistake.</p>
<p>When operating in a key signature that already has many flats or sharps, never make use of alternate notes that happen to spell the same. If you need to sharp an F sharp, then you write F double sharp, not G natural. Same goes for notes that are only half-steps apart. If you need to sharp a B, then you write B sharp, not C natural. </p>
<p>In this chord progression, flats are used when the note resolves downwards, and sharps are used when the note resolves upwards. The same chord progression is shown in B, C, and Db major, so that you can see how accidentals interact with existing key signatures.</p>
<p><img src="/static/musical_notation/chord_progression.png" title="A harmonic progression" style="display: block; margin: 0 auto; max-width: 100%"/>
<img src="/static/musical_notation/chord_progression2.png" title="A harmonic progression" style="display: block; margin: 0 auto; max-width: 100%"/>
<img src="/static/musical_notation/chord_progression3.png" title="A harmonic progression" style="display: block; margin: 0 auto; max-width: 100%"/>
<audio controls style="display:block; margin: auto"><source src="/static/musical_notation/chord_progression.ogg" type="audio/ogg"></audio></p>
<p>Diminished 7th chords tend to be full of accidentals, and the correct choice of how to write them depends on how that diminished 7th chord resolves.</p>
<p><img src="/static/musical_notation/diminished7-down.png" title="B7dim to C" style="display: block; margin: 0 auto;"/></p>
<p><img src="/static/musical_notation/diminished7-up.png" title="B7dim to C6" style="display: block; margin: 0 auto;"/></p>
<p>When done properly, an experienced musician will subconsciously make use of the information contained within the choice of accidentals to predict the chord that comes next. When this expectation is violated, it instead leaves a sense that something is not quite right.</p>
<p>To get some more practice, I'd recommend analyzing Beethoven's Moonlight sonata. You'll see that every accidental resolves in the direction it is expected to.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/YmVCcF42f-0" frameborder="0" allowfullscreen style="display:block; margin: auto"></iframe>
<h2>Adding reminder accidentals</h2>
<p>The official rule is that accidentals are reset with every new bar of music. But often, you'll see a note redundantly labelled with an accidental, to remind you that an accidental from a previous measure is no longer in effect. This is okay and even encouraged. </p>
<h2>Respecting voices</h2>
<p>Traditionally, piano music is written on two staffs. Normally, these two staffs are thought of as "left hand" and "right hand". This is often true for simple music, but in general this is the wrong way to think about it. Instead, think of the two staffs as a playground for different voices to move around on. There may be many voices, and the pianist decides which hand will play each voice. It is more important for music notation to capture the intent of the music, than it is to recommend a division of the notes between left and right hands.</p>
<p>For example, this excerpt from a transcription of To Zanarkand is incorrect, because the first measure implies that the lower voice suddenly stops, while the upper voice suddenly picks up some harmonization.</p>
<p><img src="/static/musical_notation/tozanarkand_bad.png" title="To Zanarkand, incorrectly transcribed" style="display: block; margin: 0 auto; max-width: 100%"/></p>
<p>Instead, this reworked version leaves both voices intact, and adds some phrasing for good measure.</p>
<p><img src="/static/musical_notation/tozanarkand_good.png" title="To Zanarkand, correctly transcribed" style="display: block; margin: 0 auto; max-width: 100%"/></p>
<p><audio controls style="display:block; margin: auto"><source src="/static/musical_notation/tozanarkand.ogg" type="audio/ogg"></audio></p>
<p>When a voice transitions across staffs, cross-staff beams should be used as appropriate. It's up to the editor to decide what placement of notes on ledger lines will result in the least visual clutter.</p>
<p>Using the Moonlight Sonata again as an example, we have three voices: the lower base line, the undulating middle voice, and the upper melody voice. You'll see in the sheet music that the middle voice jumps around the staffs as is convenient, even though in practice, the middle voice is always played by the right hand. And whenever the melody happens to fall on the bass clef, the treble clef does not get a rest mark - since after all, the melody isn't resting!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/YmVCcF42f-0" frameborder="0" allowfullscreen style="display:block; margin: auto"></iframe>
<h2>Respecting the beat</h2>
<p>Occasionally your music will be syncopated, and the notes will fall off-beat, instead of on beat. In such cases, your musical notation must still respect the beat!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/QImFm4Y_QPM" frameborder="0" allowfullscreen style="display:block; margin: auto"></iframe>
<p>In Beethoven's Appassionata (timestamp 0:51), we have a sequence of quarter notes in the right hand, notated using a row of quarter notes. So far, so good. But the left hand also consists of a sequence of quarter notes - but every third quarter note is spliced into two eight notes, and then tied back together again. What gives? </p>
<p>The answer is that the time signature is 12/8, so we have four beats each consisting of three eighth notes. By splitting up the left hand's quarter notes, their relationship to the beat of the music is emphasized. This split-tied notation is in fact easier for me to read than the alternative.</p>
<p>If your music has syncopation, don't let the notated music lose sight of the beat! If you ignore the beat, then it generates a lot of confusion when it comes to sightreading the music.</p>http://moderndescartes.com/essays/musical_notationUsing Linear Counting, LogLog, and HyperLogLog to Estimate Cardinalityhttp://moderndescartes.com/essays/hyperloglog<h2>Introduction</h2>
<p>Cardinality estimation algorithms answer the following question: in a large list or stream of items, how many unique items are there? One simple application is in counting unique visitors to a website.</p>
<p>The simplest possible implementation uses a set to weed out multiple occurrences of the same item:</p>
<pre><code>def compute_cardinality(items):
return len(set(items))
</code></pre>
<p>Sets are typically implemented as hash tables, and as such, the size of the hash table will be proportional to the number of unique items, \(N\), contained within it. This solution takes \(O(N)\) time and \(O(N)\) space.</p>
<p>Can we do better? We can, if we relax the requirement that our answer be exact! After all, who really cares if you have 17,000 unique IPs or 17,001 unique IPs visiting your website?</p>
<p>I'll talk about Linear Counting and LogLog, two simple algorithms for estimating cardinality. HyperLogLog then improves on LogLog by reusing ideas from Linear Counting at very low and very high load factors. </p>
<h2>Linear Counting</h2>
<p>Sets are inefficient because of their strict correctness guarantees. When two items hash to the same location within the hash table, you'd need some way to decide whether the two items were actually the same, or merely a hash collision. This typically means storing the original item and performing a full equality comparison whenever a hash collision occurs.</p>
<p>Linear Counting instead embraces hash collisions, and doesn't bother storing the original items. Instead, it allocates a hash table with capacity B bits, where B is the same scale as N, and initializes the bits to 0. When an item comes in, it's hashed, and the corresponding bit in the table is flipped to 1. By relying on hashes, deduplication of identical items is naturally handled. Linear Counting is still \(O(N)\) in both space and time, but is about 100 times more efficient than the set solution, using only 1 bit per slot, while the set solution uses 64 bits for a pointer, plus the size of the item itself.</p>
<p>When \(N \ll B\), then collisions are infrequent, and the number of bits set to 1 is a pretty good estimate of the cardinality.</p>
<p>When \(N \approx B\), collisions are inevitable. But the trick is that by looking at how full the hash table is, you can estimate how many collisions there should have been, and extrapolate back to \(N\). (Warning: this extrapolation only works if hash function outputs suitably random bits!)</p>
<p>When \(N \gg B\), every bit will be set to 1, and you won't be able to estimate \(N\).</p>
<p>How can we extrapolate the table occupancy back to \(N\)? Let's call the expected fraction of empty slots after \(n\) unique items \(p_n\). When adding the \(n+1\)th unique item to this hash table, with probability \(p_n\), a new slot is occupied, and with probability \((1-p_n)\), a collision occurs. If you write this equation and simplify, you end up with \(p_{n+1} = p_n (1 - 1/B)\). Since \(p_0 = 1\), we have \(p_N = (1 - 1/B)^N\).</p>
<p>Solving this equation for \(N\), we end up with our extrapolation relationship:</p>
<p>$$N = \frac{\log p_N}{\log (1 - 1/B)} \approx -B\log p_N$$</p>
<p>where the second approximation comes from the first term of the Taylor expansion \(\log(1 + x) = x - x^2 / 2 + x^3 / 3 - x^4 / 4 \ldots \).</p>
<p>This is a very simple algorithm. The biggest problem with this approach is that you need to know roughly how big \(N\) is ahead of time so that you can allocate an appropriately sized hash table. The size of the hash table is typically chosen such that the load factor, \(N/B\) is between 2 to 10.</p>
<h2>LogLog</h2>
<p>LogLog uses the following technique: count the number of leading zeros in the hashed item, then keep track of the max leading zeros seen so far. Intuitively, if a hash function outputs random bits, then the probability of getting \(k\) leading zeros is about \(2^{-k}\). On average, you'll need to process ~32 items before you see one with 5 leading zeros, and ~1024 items before you see one with 10 leading zeros. If the most leading zeros you ever see is \(k\), then your estimate is then simply \(2^k\).</p>
<p>Theoretically speaking, this algorithm is ridiculously efficient. To keep track of \(N\) items, you merely need to store a single number that is about \(\log N\) in magnitude. And to store that single number requires \(\log \log N\) bits. (Hence, the name LogLog!)</p>
<p>Practically speaking, this is a very noisy estimate. You might have seen an unusually large or small number of leading zeros, and the resulting estimate would be off by a factor of two for each extra or missing leading zero. </p>
<p>This is a plot of probability of having x leading zeros, with 1024 distinct items. You can see that the probability peaks at around x = 10, as expected - but the distribution is quite wide, with x = 9, 11 or 12 being at least half as likely as x = 10. (The distribution peaks 0.5 units higher than expected, so this implies that \(2^k\) will overshoot by a factor of \(2^{0.5}\), or about 1.4. This shows up as a correction factor later.)</p>
<p><img src="/static/loglog_maxzeros.png" title="LogLog max leading zero distribution" style="display: block; margin: 0 auto; max-width: 100%"/></p>
<p>That the distribution tends to skew to the right side is even more troublesome, since we're going to be exponentiating \(k\), accentuating those errors.</p>
<p>To reduce the error, the obvious next step is to take an average. By averaging the estimates given by many different hash functions, you can obtain a more accurate \(k\), and hence a more reliable estimate of \(N\).</p>
<p>The problem with this is that hash functions are expensive to compute, and to find many hash functions whose output is pairwise independent is difficult. To get around these problems, LogLog uses a trick which is one of the most ingenious I've ever seen, effectively turning one hash function into many.</p>
<p>The trick is as follows: let's say that the hash function outputs 32 bits. Then, split up those bits as 5 + 27. Use the first 5 bits to decide which of \(2^5 = 32\) buckets you'll use. Use the remaining 27 bits as the actual hash function. From one 32-bit hash function, you end up with one of 32 27-bit hash functions. (The 5 + 27 split can be fiddled with, but for the purposes of this essay I'll use 5 + 27.)</p>
<p>Applied to LogLog, this means that your \(N\) items are randomly assigned to one of 32 disjoint subsets, and you can keep 32 separate max-leading-zero tallies. The maximum number of leading zeros you can detect will drop from 32 to 27, but this isn't that big of a deal, as you can always switch to a 64-bit hash function if you really need to handle large cardinalities.</p>
<p>The final LogLog algorithm is then:</p>
<ul>
<li>hash each item.</li>
<li>use the first 5 bits of the hash to decide which of 32 buckets to use</li>
<li>use the remaining 27 bits of the hash to update the max-leading-zeros count for that bucket</li>
<li>average the max leading zeros seen in each bucket to get \(k\)</li>
<li>return an estimate of \(32 \alpha \cdot 2^k\), where \(\alpha \approx 0.7\) is a correction factor</li>
</ul>
<h2>HyperLogLog</h2>
<p>HyperLogLog improves on LogLog in two primary ways. First, it uses a harmonic mean when combining the 32 buckets. This reduces the impact of unusually high max-leading-zero counts. Second, it makes corrections for two extreme cases - the small end, when not all of the 32 buckets are occupied, and the large end, when hash collisions cause underestimates. These corrections are accomplished using ideas from Linear Counting.</p>
<p>The original LogLog algorithm first applies the arithmetic mean to \(k_i\), then exponentiates the mean, which ends up being the same as taking the geometric mean of \(2^{k_i}\). (\(k_i\) is the max-leading-zeros count for bucket \(i\).) HyperLogLog instead takes the harmonic mean of \(2^{k_i}\). The last two steps of LogLog are changed as follows:</p>
<ul>
<li>take the harmonic mean \(HM = \frac{32}{\Sigma 2^{-k_i}}\), over each bucket's count \(k_i\).</li>
<li>return an estimate \( E = 32 \alpha \cdot HM\)</li>
</ul>
<p>The harmonic mean tends to ignore numbers as they go to infinity: 2 = HM(2, 2) = HM(1.5, 3) = HM(1.2, 6) = HM(1, infinity). So in this sense it's a good way to discount the effects of exponentiating a noisy number. I don't know if it's the "right" way, or whether it's merely good enough. Either way, it constricts the error bounds by about 25%.</p>
<p>Bucketing helps LogLog get more accurate estimates, but it does come with a drawback - what if a bucket is untouched? Take the edge case with 0 items. The estimate should be 0, but is instead \(32\alpha\), because the best guess for "max leading zeros = 0" is "1 item in that bucket".</p>
<p>We can fix this by saying, "if there are buckets with max-leading-zero count = 0, then return the number of buckets with positive max-leading-zero count". That sounds reasonable, but what if some buckets are untouched, and other buckets are touched twice? That sounds awfully similar to the Linear Counting problem!</p>
<p>Recall that Linear Counting used a hash table, with each slot being a single bit, and that there was a formula to compensate for hash collisions, based on the percent of occupied slots. In this case, each of the 32 buckets can be considered a slot, with a positive leading zero count indicating an occupied slot.</p>
<p>The revised algorithm merely appends a new step:</p>
<ul>
<li>if the estimate E is less than 2.5 * 32 and there are buckets with max-leading-zero count of zero, then instead return \(-32 \cdot \log(V/32)\), where V is the number of buckets with max-leading-zero count = 0.</li>
</ul>
<p>The threshold of 2.5x comes from the recommended load factor for Linear Counting.</p>
<p>At the other extremum, when the number of unique items starts to approach \(2^{32}\), the range of the hash function, then collisions start becoming significant. How can we model the expected number of collisions? Well, Linear Counting makes another appearance! This time, the hash table is of size \(2^{32}\), with \(E\) slots occupied. After compensating for collisions, the true number of unique elements is \(2^{32}\log(1 - E/2^{32})\). When \(E \ll 2^{32}\), this expression simplifies to simply \(E\). This correction is not as interesting because if you are playing around the upper limits of 32-bit hash functions, then you should probably just switch to a 64-bit hash function. </p>
<p>The final HyperLogLog algorithm is as follows:</p>
<ul>
<li>hash each item.</li>
<li>use the first 5 bits of the hash to decide which of 32 buckets to use</li>
<li>use the remaining 27 bits of the hash to update the max-leading-zeros count for that bucket</li>
<li>take the harmonic mean \(HM = \frac{32}{\Sigma 2^{-k_i}}\), over each bucket's count \(k_i\).</li>
<li>let \( E = 32 \alpha \cdot HM\)</li>
<li>if \(E < 2.5 * 32\) and number of buckets with zero count V > 0: return \(-32 \cdot \log(V/32)\)</li>
<li>else if \( E > 2^{32}/30\) : return \(-2^{32}\log(1 - E/2^{32})\)</li>
<li>else: return E</li>
</ul>
<h2>Distributing HyperLogLog</h2>
<p>Distributing HyperLogLog is trivial. If your items are spread across multiple machines, then have each machine calculate the max-leading-zero bucket counts for the items on that machine. Then, combine the bucket counts by taking the maximum value for each bucket, and continue with the combined buckets.</p>
<hr />
<h2>References</h2>
<ul>
<li>Whang et al - A Linear-Time Probabilistic Counting Algorithm for Database Applications</li>
<li>Flajolet et al - HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm</li>
<li>Heule, Nunkesser, Hall - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm</li>
</ul>http://moderndescartes.com/essays/hyperloglogEstimating the Birthday Paradox Cutoffhttp://moderndescartes.com/essays/birthday_paradox<h2>Introduction</h2>
<p>The Birthday Paradox states that in a room of 23 people, it is more likely than not that two people have the same birthday. It is a "paradox" because 23 is (unexpectedly) much smaller than 365, the range of possible birthdays. The birthday paradox shows up in many places in computer science, so it's not just a fun fact.</p>
<p>In this essay I'll talk about how to estimate this cutoff point for arbitrary ranges, and for arbitrary probabilities.</p>
<h2>Derivation</h2>
<p>Let's start by calling the size of the space \(N\). We'd like to figure out the \(k\) such that \(k\) randomly chosen numbers from the range \(1...N\) will have a probability \(p\) of avoiding collision. For the original birthday paradox, \(N = 365\), \(p = 0.5\), and we would like to reproduce \(k = 23\).</p>
<p>We'll select numbers one by one to calculate the probability of a collision. The first selection avoids collision with probability 1, but the second selection has a slightly smaller \(\frac{N-1}{N}\) probability of avoiding collision. Assuming that no collision occurred after the second selection, the third selection has \(\frac{N-2}{N}\) probability of avoiding collision, and so on until the \(k\)th selection with \(\frac{N-k+1}{N}\) probability. Taking the product of each event's probability, we find that the probability of avoiding all collisions is:</p>
<p>$$\frac{N}{N}\cdot \frac{N-1}{N}\cdot\cdot\cdot \frac{N-k+1}{N}$$</p>
<p>At this point, we pull out our first trick: the binomial theorem approximation, when \(|n\epsilon| \ll 1\). </p>
<p>$$(1 + \epsilon)^n = 1 + n\epsilon + \frac{n(n-1)}{2}\epsilon^2 + ... \approx 1 + n\epsilon$$</p>
<p>We can make the above approximation because if \(n\epsilon \ll 1\), then \(n^2\epsilon^2 \ll n\epsilon\) and can be ignored. In our case, \(\epsilon = -\frac{1}{N}\).</p>
<p>For example, \(\left(\frac{364}{365}\right)^2 = 0.994528\), while \(0.994521 = \frac{363}{365}\). Almost identical!</p>
<p>Our product of probabilities is thus approximately equal to</p>
<p>$$..\approx \left(\frac{N-1}{N}\right)^0 \cdot \left(\frac{N-1}{N}\right)^1 \cdot \left(\frac{N-1}{N}\right)^2 \cdot \cdot \cdot \left(\frac{N-1}{N}\right)^{k-1} = \left(\frac{N-1}{N}\right)^\frac{k(k-1)}{2}$$</p>
<p>While this approximation is valid for each individual term in the product, it might not be valid in the aggregate. In other words, it's true that \(0.99^5 \approx 0.95\), but it's less true that \(0.99^{25} \approx 0.75\), and very untrue that \(0.99^{50} \approx 0.5\). And if we want to solve our problem for arbitrary \(p\), we'll run into this problem for certain! To overcome this problem, we'll use one of the many limits involving \(e\):</p>
<p>$$e = lim_{n\to\infty}\left(1 + \frac{1}{n}\right)^n$$</p>
<p>Exponentiating both sides, we get a rearranged version of this equation:</p>
<p>$$p = lim_{n\to\infty}\left(1 + \frac{1}{n}\right)^{n\ln p} = lim_{n\to\infty}\left(1 - \frac{1}{n}\right)^{-n\ln p}$$</p>
<p>The second equality follows from another application of our first trick \((1 + \epsilon)^n \approx 1 + \epsilon n\), with \(n = -1\).</p>
<p>\(N=365\) is close enough to infinity, so we can say that</p>
<p>$$p \approx \left(\frac{N-1}{N}\right)^{-N\ln p}$$</p>
<p>This looks pretty similar to our previous equation, so all that's left is to set the exponents equal, and solve:</p>
<p>$$-N\ln p = \frac{k(k-1)}{2}$$</p>
<p>$$-2N\ln p = k(k-1) \approx (k-0.5)^2$$</p>
<p>$$k = \sqrt{-2N\ln p} + 0.5$$</p>
<p>Substituting \(N = 365, p = 0.5\), we get \(k = 22.9944\), matching the known solution.</p>
<p>If you're not concerned about constant factors, all you need to remember is that for some given \(p\), \(k\) scales as \(\sqrt{N}\).</p>
<h2>Conclusion</h2>
<p>Birthday paradox problems show up whenever a randomized strategy is used to assign objects to bins, so it's worth knowing the derivation.</p>
<p>For example, some distributed systems avoid coordinating ID generation by randomly selecting IDs from a large space. In order to minimize the possibility that two actors accidentally choose the same ID, the range of IDs must be large. But how large? The equation above tells you that. Of course, this assumes that <a href="https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d">your random number generator is working properly</a></p>
<p>Another place this derivation shows up is in calculating false positive probabilities for <a href="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</a>.</p>
<p>The two key tricks in this derivation are</p>
<ul>
<li>using the Binomial approximation to simplify complex fractions into powers of a common term, \(\frac{N-1}{N}\)</li>
<li>using an equation involving \(e\) to ensure correctness as the powers of \(\left(\frac{N-1}{N}\right)^k\) grow large</li>
</ul>http://moderndescartes.com/essays/birthday_paradoxBitpacking and Compression of Sparse Datasetshttp://moderndescartes.com/essays/bitpacking_compression<h2>Introduction</h2>
<p>I'm working on a neural network to play Go. To train it, I'm using a dataset of 100,000 game records (which, when played through from start to finish, results in about 30,000,000 game positions). Each position is represented as a 19x19x28 array of floats, which is then handed to the GPU for neural network computation. It's slow and wasteful to reprocess my dataset each time I train my neural network, so I precompute each position's representation ahead of time and write it to disk.</p>
<p>Unfortunately, I discovered while scaling up my data pipeline that my processed dataset wouldn't fit on my SSD: the raw representation of the dataset is 3e7 positions * 19 * 19 * 28 floats * 4 bytes per float = 1.2 TB. I decided that to fix this problem, I'd just compress the data. My data actually consists entirely of 1s and 0s, but I'm using 32-bit floats in my arrays because TensorFlow / my GPU expect floats as input. So, in theory, I should be able to represent my data as bits instead of floats, resulting in a 32x compression. Additionally, many of the 19x19 planes are one-hot encodings with a max value of 8 - i.e., transforming [0, 1, 3, 11] into [[0,0,0,0,0,0,0,0], [1,0,0,0,0,0,0,0], [0,0,1,0,0,0,0,0], [0,0,0,0,0,0,0,1], so I expected another ~8x compression.</p>
<p>The easiest compression algorithm to call from Python was gzip, so I started there. Gzip with default parameters gave me a 120x compression ratio. By my reasoning, gzip nabbed the 32x for the float32->bit conversion and another 4x compression on top of that. That sounded about right, and I was so enamoured by the huge compression ratio that, at first, I didn't mind the 16 hours it took to preprocess the full dataset.</p>
<p>The problem with taking 16 hours to preprocess the data is that I have to rerun it every time I want to change my position representation -- an important step in improving my AI. Making a change and then waiting a whole day to find out the results is frustrating and really slows down progress. This seems like a fairly parallelizable problem, so I figured the problem was Python using just one of my six cores -- it's all the Global Interpreter Lock's fault, right? But blaming a lack of parallelism is an easy way to miss easier performance wins.</p>
<h2>Optimizing the data pipeline</h2>
<p>Recently, I decided to investigate ways to make this preprocessing faster. I started by taking a small slice of the complete dataset, which yielded 2GB of uncompressed data (17MB compressed). By profiling the data processing code, I found that processing took 19 seconds (11%), casting the numpy array to bytes took 2 seconds (1%), and compressing/writing the bytes to disk took 152 seconds (88%). That means that I should focus on optimizing the compress/write part of the code. If I were to optimize this part completely, I could speed up my preprocessing by a factor of ~8x.</p>
<p>The easiest thing to try first was swapping out gzip for another compression library. I wanted something that prioritized speed over compression, so I started with Google's <a href="https://github.com/google/snappy">snappy</a> library. Snappy compressed 9 times faster than gzip, but only achieved 10x compression, relative to gzip's 120x compression. While I was impressed by snappy's speed, I didn't want my full preprocessed dataset to take up 100GB. I'm sure snappy works well for the web / RPC traffic that Google designed it for, but it wasn't right for this task.</p>
<p>Compression algorithms have to balance speed vs compression, so I started looking for a something in between gzip and snappy. I then discovered that gzip offered multiple compression levels, and that Python's gzip wrapper defaulted to maximum compression. That explained why my original gzip implementation was so slow. Switching to <code>compresslevel=6</code> compressed 4x faster than <code>compresslevel=9</code>, and the compression ratio only dropped from 120x to 80x. Not bad. If I had stopped here, I'd be pretty happy with this tradeoff.</p>
<!--
(The raw table) markdown-tables doesn't generate the right class names, so you have to manually render them, then copypasta the rendered HTML + edit the class names, <table class="pure-table pure-table-striped">.
| compression | time to process (s) | time to convert to bytes (s) | time to compress and write (s) | total time (s) | output size (bytes) |
|--------|------:|-----:|-------:|-------:|-----------:|
| none | 18.44 | 2.26 | 9.73 | 30.82 | 2028398365 |
| gzip9 | 19.56 | 1.91 | 152.03 | 173.92 | 16985552 |
| gzip6 | 19.19 | 1.69 | 25.42 | 46.76 | 24124586 |
| snappy | 19.06 | 1.59 | 17.91 | 39.06 | 201098302 |
-->
<table class="pure-table pure-table-striped"><thead>
<tr>
<th>compression</th>
<th style="text-align:right">time to process (s)</th>
<th style="text-align:right">time to convert to bytes (s)</th>
<th style="text-align:right">time to compress and write (s)</th>
<th style="text-align:right">total time (s)</th>
<th style="text-align:right">output size (bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>none</td>
<td style="text-align:right">18.44</td>
<td style="text-align:right">2.26</td>
<td style="text-align:right">9.73</td>
<td style="text-align:right">30.82</td>
<td style="text-align:right">2028398365</td>
</tr>
<tr>
<td>gzip9</td>
<td style="text-align:right">19.56</td>
<td style="text-align:right">1.91</td>
<td style="text-align:right">152.03</td>
<td style="text-align:right">173.92</td>
<td style="text-align:right">16985552</td>
</tr>
<tr>
<td>gzip6</td>
<td style="text-align:right">19.19</td>
<td style="text-align:right">1.69</td>
<td style="text-align:right">25.42</td>
<td style="text-align:right">46.76</td>
<td style="text-align:right">24124586</td>
</tr>
<tr>
<td>snappy</td>
<td style="text-align:right">19.06</td>
<td style="text-align:right">1.59</td>
<td style="text-align:right">17.91</td>
<td style="text-align:right">39.06</td>
<td style="text-align:right">201098302</td>
</tr>
</tbody>
</table>
<h2>Manually compressing the data</h2>
<p>But wait, there's more! I had been assuming that because gzip was a compression algorithm, it was supposed to be able to figure out ridiculously obvious things like <a href="https://xkcd.com/257/">"my data consists entirely of 32-bit representations of 0.0 and 1.0"</a>. Apparently, this is not the case.</p>
<p>Converting my float32's to uint8's is a 4x compression. Bitpacking each value into one bit gives a 32x compression. What happens when I run gzip after bitpacking my data?</p>
<p>It turns out that gzipping after bitpacking yields a 1000x compression. Even on its highest compression settings, gzip was leaving a 8x compression on the table when applied to the raw data. <strong>It turns out that if you know the structure of your own data, you can very easily do much, much better than a generic compression algorithm.</strong> -- on both speed and compression.</p>
<p>I investigated all possible combinations of bitpacking and compression algorithms, yielding the following table. (Half bitpack refers to converting float32 to uint8.)</p>
<!--
(The raw table) markdown-tables doesn't generate the right class names, so you have to manually render them, then copypasta the rendered HTML + edit the class names, <table class="pure-table pure-table-striped">.
| bitpack | compression | time to process (s) | time to convert to bytes (s) | time to compress and write (s) | total time (s) | output size (bytes) |
|------|--------|------:|-----:|-------:|------:|-----------:|
| none | none | 18.44 | 2.26 | 9.73 | 30.82 | 2028398365 |
| none | gzip9 | 19.56 | 1.91 | 152.03 | 173.92| 16985552 |
| none | gzip6 | 19.19 | 1.69 | 25.42 | 46.76 | 24124586 |
| none | snappy | 19.06 | 1.59 | 17.91 | 39.06 | 201098302 |
| half | none | 19.43 | 0.95 | 1.67 | 22.35 | 515996085 |
| half | gzip9 | 19.75 | 0.95 | 61.10 | 82.04 | 6439748 |
| half | gzip6 | 19.39 | 0.95 | 10.31 | 30.91 | 11969127 |
| half | snappy | 18.72 | 0.90 | 3.12 | 23.03 | 46599353 |
| full | none | 20.56 | 4.01 | 0.23 | 25.01 | 64499522 |
| full | gzip9 | 19.25 | 3.94 | 9.65 | 33.04 | 2163007 |
| full | gzip6 | 19.01 | 3.97 | 1.45 | 24.65 | 2782370 |
| full | snappy | 19.20 | 3.89 | 0.46 | 23.76 | 7964800 |
-->
<table class="pure-table pure-table-striped">
<thead>
<tr>
<th>bitpack</th>
<th>compression</th>
<th style="text-align:right">time to process (s)</th>
<th style="text-align:right">time to convert to bytes (s)</th>
<th style="text-align:right">time to compress and write (s)</th>
<th style="text-align:right">total time (s)</th>
<th style="text-align:right">output size (bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>none</td>
<td>none</td>
<td style="text-align:right">18.44</td>
<td style="text-align:right">2.26</td>
<td style="text-align:right">9.73</td>
<td style="text-align:right">30.82</td>
<td style="text-align:right">2028398365</td>
</tr>
<tr>
<td>none</td>
<td>gzip9</td>
<td style="text-align:right">19.56</td>
<td style="text-align:right">1.91</td>
<td style="text-align:right">152.03</td>
<td style="text-align:right">173.92</td>
<td style="text-align:right">16985552</td>
</tr>
<tr>
<td>none</td>
<td>gzip6</td>
<td style="text-align:right">19.19</td>
<td style="text-align:right">1.69</td>
<td style="text-align:right">25.42</td>
<td style="text-align:right">46.76</td>
<td style="text-align:right">24124586</td>
</tr>
<tr>
<td>none</td>
<td>snappy</td>
<td style="text-align:right">19.06</td>
<td style="text-align:right">1.59</td>
<td style="text-align:right">17.91</td>
<td style="text-align:right">39.06</td>
<td style="text-align:right">201098302</td>
</tr>
<tr>
<td>half</td>
<td>none</td>
<td style="text-align:right">19.43</td>
<td style="text-align:right">0.95</td>
<td style="text-align:right">1.67</td>
<td style="text-align:right">22.35</td>
<td style="text-align:right">515996085</td>
</tr>
<tr>
<td>half</td>
<td>gzip9</td>
<td style="text-align:right">19.75</td>
<td style="text-align:right">0.95</td>
<td style="text-align:right">61.10</td>
<td style="text-align:right">82.04</td>
<td style="text-align:right">6439748</td>
</tr>
<tr>
<td>half</td>
<td>gzip6</td>
<td style="text-align:right">19.39</td>
<td style="text-align:right">0.95</td>
<td style="text-align:right">10.31</td>
<td style="text-align:right">30.91</td>
<td style="text-align:right">11969127</td>
</tr>
<tr>
<td>half</td>
<td>snappy</td>
<td style="text-align:right">18.72</td>
<td style="text-align:right">0.90</td>
<td style="text-align:right">3.12</td>
<td style="text-align:right">23.03</td>
<td style="text-align:right">46599353</td>
</tr>
<tr>
<td>full</td>
<td>none</td>
<td style="text-align:right">20.56</td>
<td style="text-align:right">4.01</td>
<td style="text-align:right">0.23</td>
<td style="text-align:right">25.01</td>
<td style="text-align:right">64499522</td>
</tr>
<tr>
<td>full</td>
<td>gzip9</td>
<td style="text-align:right">19.25</td>
<td style="text-align:right">3.94</td>
<td style="text-align:right">9.65</td>
<td style="text-align:right">33.04</td>
<td style="text-align:right">2163007</td>
</tr>
<tr>
<td>full</td>
<td>gzip6</td>
<td style="text-align:right">19.01</td>
<td style="text-align:right">3.97</td>
<td style="text-align:right">1.45</td>
<td style="text-align:right">24.65</td>
<td style="text-align:right">2782370</td>
</tr>
<tr>
<td>full</td>
<td>snappy</td>
<td style="text-align:right">19.20</td>
<td style="text-align:right">3.89</td>
<td style="text-align:right">0.46</td>
<td style="text-align:right">23.76</td>
<td style="text-align:right">7964800</td>
</tr>
</tbody>
</table>
<p>In this table, we see that the code to process the positions takes about 19 seconds. Interestingly enough, creating a bytearray from a numpy array of float32s (~2 seconds) is actually slower than casting that numpy array to uint8, then creating a bytearray which is 4x smaller (~1 second). Compression times varied widely, but all compression algorithms got much faster when they had less raw data to work through. </p>
<p>The clear winner was fully packing bits, followed by gzip (compression level 6). This yields a 6x smaller file, 28x faster than my original gzip implementation. The overall runtime dropped from 174 seconds to 25 seconds - a 7x speedup. Compression and writing is now so fast that there's no point in further optimizing it. Instead, my data processing code is now the slow part. I'll probably optimize this in the future; it is the same code that needs to run every time I want to evaluate a game position using my neural network.</p>
<h2>Conclusions</h2>
<ul>
<li>There are a lot of compression algorithms available. The gzip family is readily available, and you can tune the balance between compression and speed.</li>
<li>If you know the structure of your data, you can easily do a better and faster job of compressing than a generic compression algorithm.</li>
</ul>
<p>All supporting code can be found on <a href="https://github.com/brilee/MuGo/compare/compression_experiments">a git branch</a>.</p>http://moderndescartes.com/essays/bitpacking_compression