Andrej Karpathy blog Musings of a Computer Scientist. Sun, 21 Jun 2020 17:25:23 +0000 Sun, 21 Jun 2020 17:25:23 +0000 Jekyll v3.8.7 Biohacking Lite <p>Throughout my life I never paid too much attention to health, exercise, diet or nutrition. I knew that you’re supposed to get some exercise and eat vegetables or something, but it stopped at that (“mom said?) level of abstraction. I also knew that I can probably get away with some ignorance while I am young, but at some point I was messing with my health-adjusted life expectancy. So about halfway through 2019 I resolved to spend some time studying these topics in greater detail and dip my toes into some biohacking. And now?it’s been a year!</p> <div class="imgcap"> <img src="/assets/bio/subway_map.png" /> <div class="thecap">A "subway map" of human metabolism. For the purposes of this post the important parts are the metabolism of the three macronutrients (green: lipids, red: carbohydrates, blue: amino acids), and orange: where the magic happens - oxidative metabolism, including the citric acid cycle, the electron transport chain and the ATP Synthase. <a href="">full detail link.</a></div> </div> <p>Now, I won’t lie, things got a bit out of hand over the last year with ketogenic diets, (continuous) blood glucose / beta-hydroxybutyrate tests, intermittent fasting, extended water fasting, various supplements, blood tests, heart rate monitors, dexa scans, sleep trackers, sleep studies, cardio equipments, resistance training routines etc., all of which I won’t go into full details of because it lets a bit too much of the mad scientist crazy out. But as someone who has taken plenty of physics, some chemistry but basically zero biology during my high school / undergrad years, undergoing some of these experiments was incredibly fun and a great excuse to study a number of textbooks on biochemistry (I liked “Molecular Biology of the Cell?, biology (I liked Campbell’s Biology), human nutrition (I liked “Advanced Nutrition and Human Metabolism?, etc.</p> <p>For this post I wanted to focus on some of my experiments around weight loss because 1) weight is very easy to measure and 2) the biochemistry of it is interesting. In particular, in June 2019 I was around 200lb and I decided I was going to lose at least 25lb to bring myself to ~175lb, which according to a few publications is the weight associated with the lowest all cause mortality for my gender, age, and height. Obviously, a target weight is an <a href="">exceedingly blunt instrument</a> and is by itself just barely associated with health and general well-being. I also understand that weight loss is a sensitive, complicated topic and much has been discussed on the subject from a large number of perspectives. The goal of this post is to nerd out over biochemistry and energy metabolism in the animal kingdom, and potentially inspire others on their own biohacking lite adventure.</p> <p><strong>What weight is lost anyway</strong>? So it turns out that, roughly speaking, we weigh more because our batteries are very full. A human body is like an iPhone with a battery pack that can grow nearly indefinitely, and with the abundance of food around us we scarcely unplug from the charging outlet. In this case, the batteries are primarily the adipose tissue and triglycerides (fat) stored within, which are eagerly stockpiled (or sometimes also synthesized!) by your body to be burned for energy in case food becomes scarce. This was all very clever and dandy when our hunter gatherer ancestors downed a mammoth once in a while during an ice age, but not so much today with weaponized truffle double chocolate fudge cheesecakes masquerading on dessert menus.</p> <p><strong>Body’s batteries</strong>. To be precise, the body has roughly 4 batteries available to it, each varying in its total capacity and the latency/throughput with which it can be mobilized. The biochemical implementation details of each storage medium vary but, remarkably, in every case your body discharges the batteries for a single, unique purpose: to synthesize adenosine triphosphate, or ATP from ADP (alright technically/aside some also goes to the “redox power?of NADH/NADPH). The synthesis itself is relatively straightforward, taking one molecule of adenosine diphosphate (ADP), and literally snapping on a 3rd phosphate group to its end. Doing this is kind of like a molecular equivalent of squeezing and loading a spring:</p> <div class="imgcap"> <img src="/assets/bio/atpspring.svg" style="width:42%" /> <img src="/assets/bio/atpsynthesis.svg" style="width:55%" /> <div class="thecap">Synthesis of ATP from ADP, done by snapping in a 3rd phosphate group to "load the spring". Images borrowed from <a href="">here</a>.</div> </div> <p>This is completely not obvious and remarkable - a single molecule (ATP) functions as a universal $1 bill that energetically “pays for?much of the work done by your protein machinery. Even better, this system turns out to have an ancient origin and is common to all life on Earth. Need to (active) transport some molecule across the cell membrane? ATP binding to the transmembrane protein provides the needed “umph? Need to temporarily untie the DNA against its hydrogen bonds? ATP binds to the protein complex to power the unzipping. Need to move myosin down an actin filament to contract a muscle? ATP to the rescue! Need to shuttle proteins around the cell’s cytoskeleton? ATP powers the tiny molecular motor (kinesin). Need to attach an amino acid to tRNA to prepare it for protein synthesis in the ribosome? ATP required. You get the idea.</p> <p>Now, the body only maintains a very small amount ATP molecules “in supply?at any time. The ATP is quickly hydrolyzed, chopping off the third phosphate group, releasing energy for work, and leaving behind ADP. As mentioned, we have roughly 4 batteries that can all be “discharged?into re-generating ATP from ADP:</p> <ol> <li><strong>super short term battery</strong>. This would be the <a href="">Phosphocreatine system</a> that buffers phosphate groups attached to creatine so ADP can be very quickly and locally recycled to ATP, barely worth mentioning for our purposes since its capacity is so minute. A large number of athletes take Creatine supplements to increase this buffer.</li> <li><strong>short term battery</strong>. Glycogen, a branching polysaccharide of glucose found in your liver and skeletal muscle. The liver can store about 120 grams and the skeletal muscle about 400 grams. About 4 grams of glucose also circulates in your blood. Your body derives approximately ~4 kcal/g from full oxidation of glucose (adding up glycolysis and oxidative phosphorylation), so if you do the math your glycogen battery stores about 2,000 kcal. This also happens to be roughly the base metabolic rate of an average adult, i.e. the energy just to “keep the lights on?for 24 hours. Now, glycogen is not an amazing energy storage medium - not only is it not very energy dense in grams/kcal, but it is also a sponge that binds too much water with it (~3g of water per 1g of glycogen), which finally brings us to:</li> <li><strong>long term battery</strong>. Adipose tissue (fat) is by far your primary super high density super high capacity battery pack. For example, as of June 2019, ~40lb of my 200lb weight was fat. Since fat is significantly more energy dense than carbohydrates (9 kcal/g instead of just 4 kcal/g), my fat was storing 40lb = 18kg = 18,000g x 9kcal/g = 162,000 kcal. This is a staggering amount of energy. If energy was the sole constraint, my body could run on this alone for 162,000/2,000 = 81 days. Since 1 stick of dynamite is about 1MJ of energy (239 kcal), we’re talking 678 sticks of dynamite. Or since a 100KWh Tesla battery pack stores 360MJ, if it came with a hand-crank I could in principle charge it almost twice! Hah.</li> <li><strong>lean body mass :(</strong>. When sufficiently fasted and forced to, your body’s biochemistry will resort to burning lean body mass (primarily muscle) for fuel to power your body. This is your body’s “last resort?battery.</li> </ol> <p>All four of these batteries are charged/discharged at all times to different amounts. If you just ate a cookie, your cookie will promptly be chopped down to glucose, which will circulate in your bloodstream. If there is too much glucose around (in the case of cookies there would be), your anabolic pathways will promptly store it as glycogen in the liver and skeletal muscle, or (more rarely, if in vast abundance) convert it to fat. On the catabolic side, if you start jogging you’ll primarily use (1) for the first ~3 seconds, (2) for the next 8-10 seconds anaerobically, and then (2, 3) will ramp up aerobically (a higher latency, higher throughput pathway) once your body kicks into a higher gear by increasing the heart rate, breathing rate, and oxygen transport. (4) comes into play mostly if you starve yourself or deprive your body of carbohydrates in your diet.</p> <div class="imgcap"> <img src="/assets/bio/energy_metabolism_1.png" style="width:45%" /> <img src="/assets/bio/atp_recycling.png" style="width:54%" /> <div class="thecap"><b>Left</b>: nice summary of food, the three major macronutrient forms of it, its respective storage systems (glycogen, muscle, fat), and the common "discharge" of these batteries all just to make ATP from ADP by attaching a 3rd phosphate group. <b>Right</b>: Re-emphasizing the "molecular spring": ATP is continuously re-cycled from ADP just by taking the spring and "loading" it over and over again. Images borrowed from <a href="">this nice page</a>.</div> </div> <p>Since I am a computer scientist it is hard to avoid a comparison of this “energy hierarchy?to the memory hierarchy of a typical computer system. Moving energy around (stored chemically in high energy C-H / C-C bonds of molecules) is expensive just like moving bits around a chip. (1) is your L1/L2 cache - it is local, immediate, but tiny. Anaerobic (2) via glycolysis in the cytosol is your RAM, and aerobic respiration (3) is your disk: high latency (the fatty acids are shuttled over all the way from adipose tissue through the bloodstream!) but high throughput and massive storage.</p> <p><strong>The source of weight loss</strong>. So where does your body weight go exactly when you “lose it? It’s a simple question but it stumps most people, including my younger self. Your body weight is ultimately just the sum of the individual weights of the atoms that make you up - carbon, hydrogen, nitrogen, oxygen, etc. arranged into a zoo of complex, organic molecules. One day you could weigh 180lb and the next 178lb. Where did the 2lb of atoms go? It turns out that most of your day-to-day fluctuations are attributable to water retention, which can vary a lot with your levels of sodium, your current glycogen levels, various hormone/vitamin/mineral levels, etc. The contents of your stomach/intestine and stool/urine also add to this. But where does the fat, specifically, go when you “lose?it, or “burn?it? Those carbon/hydrogen atoms that make it up don’t just evaporate out of existence. (If our body could evaporate them we’d expect E=mc^2 of energy, which would be cool). Anyway, it turns out that you breathe out most of your weight. Your breath looks transparent but you inhale a bunch of oxygen and you exhale a bunch of carbon dioxide. The carbon in that carbon dioxide you just breathed out may have just seconds ago been part of a triglyceride molecule in your fat. It’s highly amusing to think that every single time you breathe out (in a fasted state) you are literally breathing out your fat carbon by carbon. There is a good <a href="">TED talk</a> and even a whole <a href="">paper</a> with the full biochemistry/stoichiometry involved.</p> <div class="imgcap"> <img src="/assets/bio/weight_loss.gif" /> <div class="thecap">Taken from the above paper. You breathe out 84% of your fat loss.</div> </div> <p><strong>Combustion</strong>. Let’s now turn to the chemical process underlying weight loss. You know how you can take wood and light it on fire to “burn?it? This chemical reaction is <em>combustion</em>; You’re taking a bunch of organic matter with a lot of C-C and C-H bonds and, with a spark, providing the activation energy necessary for the surrounding voraciously electronegative oxygen to react with it, stripping away all of the carbons into carbon dioxide (CO2) and all of the hydrogens into water (H2O). This reaction releases a lot of heat in the process, thus sustaining the reaction until all energy-rich C-C and C-H bonds are depleted. These bonds are referred to as “energy-rich?because energetically carbon reeeallly wants to be carbon dioxide (CO2) and hydrogen reeeeally wants to be water (H2O), but this reaction is gated by an activation energy barrier, allowing large amounts of C-C/C-H rich macromolecules to exist in stable forms, in ambient conditions, and in the presence of oxygen.</p> <p><strong>Cellular respiration: “slow motion?combustion</strong>. Remarkably, your body does the exact same thing as far as inputs (organic compounds), outputs (CO2 and H2O) and stoichiometry are concerned, but the burning is not explosive but slow and controlled, with plenty of molecular intermediates that torture biology students. This biochemical miracle begins with fats/carbohydrates/proteins (molecules rich in C-C and C-H bonds) and goes through stepwise, complete, slow-motion combustion via glycolysis / beta oxidation, citric acid cycle, oxidative phosphorylation, and finally the electron transport chain and the whoa-are-you-serious molecular motor - the <a href="">ATP synthase</a>, imo the most incredible macromolecule not DNA. Okay potentially a tie with the Ribosome. Even better, this is an exceedingly efficient process that traps almost 40% of the energy in the form of ATP (the rest is lost as heat). This is much more efficient than your typical internal combustion motor at around 25%. I am also skipping a lot of incredible detail that doesn’t fit into a paragraph, including how food is chopped up piece by piece all the way to tiny acetate molecules, how their electrons are stripped and loaded up on molecular shuttles (NAD+ -&gt; NADH), how they then quantum tunnel their way down the electron transport chain (literally a flow of electricity down a protein complex “wire? from food to oxygen), how this pumps protons across the inner mitochondrial membrane (an electrochemical equaivalent of pumping water uphill in a hydro plant), how this process is brilliant, flexible, ancient, highly conserved in all of life and very closely related to photosynthesis, and finally how the protons are allowed to flow back through little holes in the ATP synthase, spinning it like a water wheel on a river, and powering its head to take an ADP and a phosphate and snap them together to ATP.</p> <div class="imgcap"> <img src="/assets/bio/combustion.jpeg" style="width:57%" /> <img src="/assets/bio/combustion2.png" style="width:41%" /> <div class="thecap"><a href="">Left</a>: Chemically, as far as inputs and outputs alone are concerned, burning things with fire is identical to burning food for our energy needs. <a href="">Right</a>: the complete oxidation of C-C / C-H rich molecules powers not just our bodies but a lot of our technology.</div> </div> <p><strong>Photosynthesis: “inverse combustion?lt;/strong>. If H2O and CO2 are oh so energetically favored, it’s worth keeping in mind where all of this C-C, C-H rich fuel came from in the first place. Of course, it comes from plants - the OG nanomolecular factories. In the process of photosynthesis, plants strip hydrogen atoms away from oxygen in molecules of water with light, and via further processing snatch carbon dioxide (CO2) lego blocks from the atmosphere to build all kinds of organics. Amusingly, unlike fixing hydrogen from H2O and carbon from CO2, plants are unable to fix the plethora of nitrogen from the atmosphere (the triple bond in N2 is very strong) and rely on bacteria to synthesize more chemically active forms (Ammonia, NH3), which is why chemical fertilizers are so important for plant growth and why the Haber-Bosch process basically averted the Malthusian catastrophe. Anyway, the point is that plants build all kinds of insanely complex organic molecules from these basic lego blocks (carbon dioxide, water) and all of it is fundamentally powered by light via the miracle of photosynthesis. The sunlight’s energy is trapped in the C-C / C-H bonds of the manufactured organics, which we eat and oxidize back to CO2 / H2O (capturing ~40% of in the form of a 3rd phosphate group on ATP), and finally convert to blog posts like this one, and a bunch of heat. Also, going in I didn’t quite appreciate just how much we know about all of the reactions involved, that we we can track individual atoms around all of them, and that any student can easily calculate answers to questions such as “How many ATP molecules are generated during the complete oxidation of one molecule of palmitic acid??(<a href="">it’s 106</a>, now you know).</p> <blockquote> <p>We’ve now established in some detail that fat is your body’s primary battery pack and we’d like to breathe it out. Let’s turn to the details of the accounting.</p> </blockquote> <p><strong>Energy input</strong>. Humans turn out to have a very simple and surprisingly narrow energy metabolism. We don’t partake in the miracle of photosynthesis like plants/cyanobacteria do. We don’t oxidize inorganic compounds like hydrogen sulfide or nitrite or something like some of our bacteria/archaea cousins. Similar to everything else alive, we do not fuse or fission atomic nuclei (that would be awesome). No, the only way we input any and all energy into the system is through the breakdown of food. “Food?is actually a fairly narrow subset of organic molecules that we can digest and metabolize for energy. It includes classes of molecules that come in 3 major groups (“macros?: proteins, fats, carbohydrates and a few other special case molecules like alcohol. There are plenty of molecules we can’t metabolize for energy and don’t count as food, such as cellulose (fiber; actually also a carbohydrate, a major component of plants, although some of it is digestible by some animals like cattle; also your microbiome loooves it), or hydrocarbons (which can only be “metabolized?by our internal combustion engines). In any case, this makes for exceedingly simple accounting: the energy input to your body is upper bounded by the number of food calories that you eat. The food industry attempts to guesstimate these by adding up the macros in each food, and you can find these estimates on the nutrition labels. In particular, naive calorimetry would over-estimate food calories because as mentioned not everything combustible is digestible.</p> <p><strong>Energy output</strong>. You might think that most of your energy output would come from movement, but in fact 1) your body is exceedingly efficient when it comes to movement, and 2) it is energetically unintuitively expensive to just exist. To keep you alive your body has to maintain homeostasis, manage thermo-regulation, respiration, heartbeat, brain/nerve function, blood circulation, protein synthesis, active transport, etc etc. Collectively, this portion of energy expenditure is called the Base Metabolic Rate (BMR) and you burn this “for free?even if you slept the entire day. As an example, my BMR is somewhere around 1800kcal/day (a common estimate due to Mifflin St. Jeor for men is <em>10 x weight (kg) + 6.25 x height (cm) - 5 x age (y) + 5</em>). Anyone who’s been at the gym and ran on a treadmill will know just how much of a free win this is. I start panting and sweating uncomfortably just after a small few hundred kcal of running. So yes, movement burns calories, but the 30min elliptical session you do in the gym is a drop in the bucket compared to your base metabolic rate. Of course if you’re doing the elliptical for cardio-vascular health - great! But if you’re doing it thinking that this is necessary or a major contributor to losing weight, you’d be wrong.</p> <div class="imgcap"> <img src="/assets/bio/cookie.jpg" style="width:39%" /> <img src="/assets/bio/sweating.jpg" style="width:60%" /> <div class="thecap">This chocolate chip cookie powers 30 minutes of running at 6mph (a pretty average running pace).</div> </div> <p><strong>Energy deficit</strong>. In summary, the amount of energy you expend (BMR + movement) subtract the amount you take in (via food alone) is your energy deficit. This means you will discharge your battery more than you charge it, and breathe out more fat than you synthesize/store, decreasing the size of your battery pack, and recording less on the scale because all those carbon atoms that made up your triglyceride chains in the morning are now diffused around the atmosphere.</p> <blockquote> <p>So?a few textbooks later we see that to lose weight one should eat less and move more.</p> </blockquote> <p><strong>Experiment section</strong>. So how big of a deficit should one introduce? I did not want the deficit to be so large that it would stress me out, make me hangry and impact my work. In addition, with greater deficit your body will increasingly begin to sacrifice lean body mass (<a href="">paper</a>). To keep things simple, I aimed to lose about 1lb/week, which is consistent with a few recommendations I found in a few <a href="">papers</a>. Since 1lb = 454g, 1g of fat is estimated at approx. 9 kcal, and adipose tissue is ~87% lipids, some (very rough) napkin math suggests that 3500 kcal = 1lb of fat. The precise details of this are <a href="">much more complicated</a>, but this would suggest a target deficit of about 500 kcal/day. I found that it was hard to reach this deficit with calorie restriction alone, and psychologically it was much easier to eat near the break even point and create most of the deficit with cardio. It also helped a lot to adopt a 16:8 intermittent fasting schedule (i.e. “skip breakfast? eat only from e.g. 12-8pm) which helps control appetite and dramatically reduces snacking. I started the experiment in June 2019 at about 195lb (day 120 on the chart below), and 1 year later I am at 165lb, giving an overall empirical rate of 0.58lb/week:</p> <div class="imgcap"> <img src="/assets/bio/weight.png" /> <div class="thecap">My weight (lb) over time (days). The first 120 days were "control" where I was at my regular maintenance eating whatever until I felt full. From there I maintained an average 500kcal deficit per day. Some cheating and a few water fasts are discernable.</div> </div> <p><strong>Other stuff</strong>. I should mention that despite the focus of this post the experiment was of course much broader for me than weight loss alone, as I tried to improve many other variables I started to understand were linked to longevity and general well-being. I went on a relatively low carbohydrate mostly Pescetarian diet, I stopped eating nearly all forms of sugar (except for berries) and processed foods, I stopped drinking calories in any form (soda, orange juice, alcohol, milk), I started regular cardio a few times a week (first running then cycling), I started regular resistance training, etc. I am not militant about any of these and have cheated a number of times on all of it because I think sticking to it 90% of the time produces 90% of the benefit. As a result I’ve improved a number of biomarkers (e.g. resting heart rate, resting blood glucose, strength, endurance, nutritional deficiencies, etc). I wish I could say I feel significantly better or sharper, but honestly I feel about the same. But the numbers tell me I’m supposed to be on a better path and I think I am content with that 🤷.</p> <p><strong>Explicit modeling</strong>. Now, getting back to weight, clearly the overall rate of 0.58lb/week is not our expected 1lb/week. To validate the energy deficit math I spent 100 days around late 2019 very carefully tracking my daily energy input and output. For the input I recorded my total calorie intake - I kept logs in my notes app of everything I ate. When nutrition labels were not available, I did my best to estimate the intake. Luckily, I have a strange obsession with guesstimating calories in any food, I’ve done so for years for fun, and have gotten quite good at it. Isn’t it a ton of fun to always guess calories in some food before checking the answer on the nutrition label and seeing if you fall within 10% correct? No? Alright. For energy output I recorded the number my Apple Watch reports in the “Activity App? TLDR simply subtracting expenditure from intake gives the approximate deficit for that day, which we can use to calculate the expected weight loss, and finally compare to the actual weight loss. As an example, an excerpt of the raw data and the simple calculation looks something like:</p> <pre style="font-size:10px"> 2019-09-23: Morning weight 180.5. Ate 1700, expended 2710 (Δkcal 1010, Δw 0.29). Tomorrow should weight 180.2 2019-09-24: Morning weight 179.8. Ate 1790, expended 2629 (Δkcal 839, Δw 0.24). Tomorrow should weight 179.6 2019-09-25: Morning weight 180.6. Ate 1670, expended 2973 (Δkcal 1303, Δw 0.37). Tomorrow should weight 180.2 2019-09-26: Morning weight 179.7. Ate 2140, expended 2529 (Δkcal 389, Δw 0.11). Tomorrow should weight 179.6 2019-09-27: Morning weight nan. Ate 2200, expended 2730 (Δkcal 530, Δw 0.15). Tomorrow should weight nan 2019-09-28: Morning weight nan. Ate 2400, expended 2800 (Δkcal 400, Δw 0.11). Tomorrow should weight 2019-09-29: Morning weight 181.0. Ate 1840, expended 2498 (Δkcal 658, Δw 0.19). Tomorrow should weight 180.8 2019-09-30: Morning weight 181.8. Ate 1910, expended 2883 (Δkcal 973, Δw 0.28). Tomorrow should weight 181.5 2019-10-01: Morning weight 179.4. Ate 2000, expended 2637 (Δkcal 637, Δw 0.18). Tomorrow should weight 179.2 2019-10-02: Morning weight 179.5. Ate 1920, expended 2552 (Δkcal 632, Δw 0.18). Tomorrow should weight 179.3 </pre> <p>Where we have a few <code class="language-plaintext highlighter-rouge">nan</code> if I missed a weight measurement in the morning. Plotting this we get the following:</p> <div class="imgcap"> <img src="/assets/bio/expected_loss.png" /> <div class="thecap">Expected weight based on simple calorie deficit formula (blue) vs. measured weight (red).</div> </div> <p>Clearly, my actual weight loss (red) turned out to be slower than expected one based on our simple deficit math (blue). So this is where things get interesting. A number of possibilities come to mind. I could be consistently underestimating calories eaten. My Apple Watch could be overestimating my calorie expenditure. The naive conversion math of 1lb of fat = 3500 kcal could be off. I think one of the other significant culprits is that when I eat protein I am naively recording its caloric value under intake, implicitly assuming that my body burns it for energy. However, since I was simultaneously resistance training and building some muscle, my body could redirect 1g of protein into muscle and instead mobilize only ~0.5g of fat to cover the same energy need (since fat is 9kcal/g and protein only 4kcal/g). The outcome is that depending on my muscle gain my weight loss would look slower, as we observe. Most likely, some combination of all of the above is going on.</p> <p><strong>Water factor</strong>. Another fun thing I noticed is that my observed weight can fluctuate and rise a lot, even while my expected weight calculation expects a loss. I found that this discrepancy grows with the amount of carbohydrates in my diet (dessert, bread/pasta, potatoes, etc.). Eating these likely increases glycogen levels, which as I already mentioned briefly, acts as a sponge and soaks up water. I noticed that my weight can rise multiple pounds, but when I revert back to my typical low-carbohydrate pasketerianish diet these “fake?pounds evaporate in a matter of a few days. The final outcome are wild swings in my body weight depending mostly on how much candy I’ve succumbed to, or if I squeezed in some pizza at a party.</p> <p><strong>Body composition</strong>. Since simultaneous muscle building skews the simple deficit math, to get a better fit we’d have to understand the details of my body composition. The weight scale I use (<a href="">Withings Body+</a>) claims to estimate and separate fat weight and lean body weight by the use of <a href="">bioelectrical impedance analysis</a>, which uses the fact that more muscle is more water is less electrical resistance. This is the most common approach accessible to a regular consumer. I didn’t know how much I could trust this measurement so I also ordered three DEXA scans (a gold standard for body composition measurements used in the literature based on low dosage X-rays) separated 1.5 months apart. I used <a href="">BodySpec</a>, who charge $45 per scan, each taking about 7 minutes at one of their physical locations. The amount of radiation is tiny - about 0.4 uSv, which is the dose you’d get by eating <a href="">4 bananas</a> (they contain radioactive potassium-40). I was not able to get a scan recently due to COVID-19. Here is my body composition data visualized from both sources during late 2019:</p> <div class="imgcap"> <img src="/assets/bio/body_composition.png" /> <div class="thecap">My ~daily reported fat and lean body mass measurements based on bioelectrical impedance and the 3 DEXA scans. <br />red = fat, blue = lean body mass. (also note two y-axes are superimposed)</div> </div> <p><strong>BIA vs DEXA</strong>. Unfortunately, we can see that the BIA measurement provided by my scale disagrees with DEXA results by a lot. That said, I am also forced to interpret the DEXA scan with skepticism specifically for the lean body mass amount, which is <a href="">affected by hydration level</a>, with water showing up mostly as lean body mass. In particular, during my third measurement I was fasted and in ketosis. Hence my glycogen levels were low and I was less hydrated, which I believe showed up as a dramatic loss of muscle. That said, focusing on fat, both approaches show me losing body fat at roughly the same rate, though they are off by an absolute offset.</p> <p><strong>BIA</strong>. An additional way to see that BIA is making stuff up is that it shows me losing lean body mass over time. I find this relatively unlikely because during the entire course of this experiment I exercised regularly and was able to monotonically increase my strength in terms of weight and reps for most exercises (e.g. bench press, pull ups, etc.). So that makes no sense either ¯\<em>(?</em>/¯</p> <div class="imgcap"> <img src="/assets/bio/dexa.png" /> <div class="thecap">The raw numbers for my DEXA scans. I was allegedly losing fat. The lean tissue estimate is noisy due to hydration levels.</div> </div> <p><strong>Summary</strong> So there you have it. DEXA scans are severely affected by hydration (which is hard to control) and BIA is making stuff up entirely, so we don’t get to fully resolve the mystery of the slower-than-expected weight loss. But overall, maintaining an average deficit of 500kcal per day did lead to about 60% of the expected weight loss over the course of a year. More importantly, we studied the process by which our Sun’s free energy powers blog posts via a transformation of nuclear binding energy to electromagnetic radiation to heat. The photons power the fixing of carbon in CO2 and hydrogen in H2O into C-C/C-H rich organic molecules in plants, which we digest and break back down via a “slow?stepwise combustion in our cell’s cytosols and mitochondria, which “charges?some (ATP) molecular springs, which provide the “umph?that fires the neurons and moves the fingers. Also, any excess energy is stockpiled by the body as fat, so we need to intake less of it or “waste?some of it away on movement to discharge our primary battery and breathe out our weight. It’s been super fun to self-study these topics (which I skipped in high school), and I hope this post was an interesting intro to some of it. Okay great. I’ll now go eat some cookies, because yolo.</p> <p><br /><br /> <strong>(later edits)</strong></p> <ul> <li>discussion on <a href="">hacker news</a></li> <li>my original post used to be about twice as long due to a section of nutrition. Since the topic of <em>what</em> to each came up so often alongside <em>how much</em> to each I am including a quick TLDR on my final diet here, without the 5-page detail. In rough order of importance: Eat from 12-8pm only. Do not drink any calories (no soda, no alcohol, no juices, avoid milk). Avoid sugar like the plague, including carbohydrate-heavy foods that immediately break down to sugar (bread, rice, pasta, potatoes), including to a lesser extent natural sugar (apples, bananas, pears, etc - we’ve “weaponized?these fruits in the last few hundred years via strong artificial selection into <a href="">actual candy bars</a>), berries are ~okay. Avoid processed food (follow Michael Pollan’s heuristic of only shopping on the outer walls of a grocery store, staying clear of its center). For meat stick mostly to fish and prefer chicken to beef/pork. For me the avoidance of beef/pork is 1) ethical - they are intelligent large animals, 2) environmental - they have a large environmental footprint (cows generate a lot of methane, a highly potent greenhouse gas) and their keeping leads to a lot of deforestation, 3) health related - a few papers point to some cause for concern in consumption of red meat, and 4) global health - a large fraction of the worst offender infectious diseases are zootopic and jumped to humans from close proximity to livestock.</li> </ul> Thu, 11 Jun 2020 10:00:00 +0000 A Recipe for Training Neural Networks <p>Some few weeks ago I <a href="">posted</a> a tweet on “the most common neural net mistakes? listing a few common gotchas related to training neural nets. The tweet got quite a bit more engagement than I anticipated (including a <a href="">webinar</a> :)). Clearly, a lot of people have personally encountered the large gap between “here is how a convolutional layer works?and “our convnet achieves state of the art results?</p> <p>So I thought it could be fun to brush off my dusty blog to expand my tweet to the long form that this topic deserves. However, instead of going into an enumeration of more common errors or fleshing them out, I wanted to dig a bit deeper and talk about how one can avoid making these errors altogether (or fix them very fast). The trick to doing so is to follow a certain process, which as far as I can tell is not very often documented. Let’s start with two important observations that motivate it.</p> <h4 id="1-neural-net-training-is-a-leaky-abstraction">1) Neural net training is a leaky abstraction</h4> <p>It is allegedly easy to get started with training neural nets. Numerous libraries and frameworks take pride in displaying 30-line miracle snippets that solve your data problems, giving the (false) impression that this stuff is plug and play. It’s common see things like:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">your_data</span> <span class="o">=</span> <span class="c1"># plug your awesome dataset here </span><span class="o">&gt;&gt;&gt;</span> <span class="n">model</span> <span class="o">=</span> <span class="n">SuperCrossValidator</span><span class="p">(</span><span class="n">SuperDuper</span><span class="p">.</span><span class="n">fit</span><span class="p">,</span> <span class="n">your_data</span><span class="p">,</span> <span class="n">ResNet50</span><span class="p">,</span> <span class="n">SGDOptimizer</span><span class="p">)</span> <span class="c1"># conquer world here </span></code></pre></div></div> <p>These libraries and examples activate the part of our brain that is familiar with standard software - a place where clean APIs and abstractions are often attainable. <a href="">Requests</a> library to demonstrate:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">r</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">''</span><span class="p">,</span> <span class="n">auth</span><span class="o">=</span><span class="p">(</span><span class="s">'user'</span><span class="p">,</span> <span class="s">'pass'</span><span class="p">))</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">r</span><span class="p">.</span><span class="n">status_code</span> <span class="mi">200</span> </code></pre></div></div> <p>That’s cool! A courageous developer has taken the burden of understanding query strings, urls, GET/POST requests, HTTP connections, and so on from you and largely hidden the complexity behind a few lines of code. This is what we are familiar with and expect. Unfortunately, neural nets are nothing like that. They are not “off-the-shelf?technology the second you deviate slightly from training an ImageNet classifier. I’ve tried to make this point in my post <a href="">“Yes you should understand backprop?lt;/a> by picking on backpropagation and calling it a “leaky abstraction? but the situation is unfortunately much more dire. Backprop + SGD does not magically make your network work. Batch norm does not magically make it converge faster. RNNs don’t magically let you “plug in?text. And just because you can formulate your problem as RL doesn’t mean you should. If you insist on using the technology without understanding how it works you are likely to fail. Which brings me to?lt;/p> <h4 id="2-neural-net-training-fails-silently">2) Neural net training fails silently</h4> <p>When you break or misconfigure code you will often get some kind of an exception. You plugged in an integer where something expected a string. The function only expected 3 arguments. This import failed. That key does not exist. The number of elements in the two lists isn’t equal. In addition, it’s often possible to create unit tests for a certain functionality.</p> <p>This is just a start when it comes to training neural nets. Everything could be correct syntactically, but the whole thing isn’t arranged properly, and it’s really hard to tell. The “possible error surface?is large, logical (as opposed to syntactic), and very tricky to unit test. For example, perhaps you forgot to flip your labels when you left-right flipped the image during data augmentation. Your net can still (shockingly) work pretty well because your network can internally learn to detect flipped images and then it left-right flips its predictions. Or maybe your autoregressive model accidentally takes the thing it’s trying to predict as an input due to an off-by-one bug. Or you tried to clip your gradients but instead clipped the loss, causing the outlier examples to be ignored during training. Or you initialized your weights from a pretrained checkpoint but didn’t use the original mean. Or you just screwed up the settings for regularization strengths, learning rate, its decay rate, model size, etc. Therefore, your misconfigured neural net will throw exceptions only if you’re lucky; Most of the time it will train but silently work a bit worse.</p> <p>As a result, (and this is reeaally difficult to over-emphasize) <strong>a “fast and furious?approach to training neural networks does not work</strong> and only leads to suffering. Now, suffering is a perfectly natural part of getting a neural network to work well, but it can be mitigated by being thorough, defensive, paranoid, and obsessed with visualizations of basically every possible thing. The qualities that in my experience correlate most strongly to success in deep learning are patience and attention to detail.</p> <h2 id="the-recipe">The recipe</h2> <p>In light of the above two facts, I have developed a specific process for myself that I follow when applying a neural net to a new problem, which I will try to describe. You will see that it takes the two principles above very seriously. In particular, it builds from simple to complex and at every step of the way we make concrete hypotheses about what will happen and then either validate them with an experiment or investigate until we find some issue. What we try to prevent very hard is the introduction of a lot of “unverified?complexity at once, which is bound to introduce bugs/misconfigurations that will take forever to find (if ever). If writing your neural net code was like training one, you’d want to use a very small learning rate and guess and then evaluate the full test set after every iteration.</p> <h4 id="1-become-one-with-the-data">1. Become one with the data</h4> <p>The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns. Luckily, your brain is pretty good at this. One time I discovered that the data contained duplicate examples. Another time I found corrupted images / labels. I look for data imbalances and biases. I will typically also pay attention to my own process for classifying the data, which hints at the kinds of architectures we’ll eventually explore. As an example - are very local features enough or do we need global context? How much variation is there and what form does it take? What variation is spurious and could be preprocessed out? Does spatial position matter or do we want to average pool it out? How much does detail matter and how far could we afford to downsample the images? How noisy are the labels?</p> <p>In addition, since the neural net is effectively a compressed/compiled version of your dataset, you’ll be able to look at your network (mis)predictions and understand where they might be coming from. And if your network is giving you some prediction that doesn’t seem consistent with what you’ve seen in the data, something is off.</p> <p>Once you get a qualitative sense it is also a good idea to write some simple code to search/filter/sort by whatever you can think of (e.g. type of label, size of annotations, number of annotations, etc.) and visualize their distributions and the outliers along any axis. The outliers especially almost always uncover some bugs in data quality or preprocessing.</p> <h4 id="2-set-up-the-end-to-end-trainingevaluation-skeleton--get-dumb-baselines">2. Set up the end-to-end training/evaluation skeleton + get dumb baselines</h4> <p>Now that we understand our data can we reach for our super fancy Multi-scale ASPP FPN ResNet and begin training awesome models? For sure no. That is the road to suffering. Our next step is to set up a full training + evaluation skeleton and gain trust in its correctness via a series of experiments. At this stage it is best to pick some simple model that you couldn’t possibly have screwed up somehow - e.g. a linear classifier, or a very tiny ConvNet. We’ll want to train it, visualize the losses, any other metrics (e.g. accuracy), model predictions, and perform a series of ablation experiments with explicit hypotheses along the way.</p> <p>Tips &amp; tricks for this stage:</p> <ul> <li><strong>fix random seed</strong>. Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome. This removes a factor of variation and will help keep you sane.</li> <li><strong>simplify</strong>. Make sure to disable any unnecessary fanciness. As an example, definitely turn off any data augmentation at this stage. Data augmentation is a regularization strategy that we may incorporate later, but for now it is just another opportunity to introduce some dumb bug.</li> <li><strong>add significant digits to your eval</strong>. When plotting the test loss run the evaluation over the entire (large) test set. Do not just plot test losses over batches and then rely on smoothing them in Tensorboard. We are in pursuit of correctness and are very willing to give up time for staying sane.</li> <li><strong>verify loss @ init</strong>. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure <code class="language-plaintext highlighter-rouge">-log(1/n_classes)</code> on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.</li> <li><strong>init well</strong>. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick?loss curves where in the first few iteration your network is basically just learning the bias.</li> <li><strong>human baseline</strong>. Monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it. Alternatively, annotate the test data twice and for each example treat one annotation as prediction and the second as ground truth.</li> <li><strong>input-indepent baseline</strong>. Train an input-independent baseline, (e.g. easiest is to just set all your inputs to zero). This should perform worse than when you actually plug in your data without zeroing it out. Does it? i.e. does your model learn to extract any information out of the input at all?</li> <li><strong>overfit one batch</strong>. Overfit a single batch of only a few examples (e.g. as little as two). To do so we increase the capacity of our model (e.g. add layers or filters) and verify that we can reach the lowest achievable loss (e.g. zero). I also like to visualize in the same plot both the label and the prediction and ensure that they end up aligning perfectly once we reach the minimum loss. If they do not, there is a bug somewhere and we cannot continue to the next stage.</li> <li><strong>verify decreasing training loss</strong>. At this stage you will hopefully be underfitting on your dataset because you’re working with a toy model. Try to increase its capacity just a bit. Did your training loss go down as it should?</li> <li><strong>visualize just before the net</strong>. The unambiguously correct place to visualize your data is immediately before your <code class="language-plaintext highlighter-rouge">y_hat = model(x)</code> (or <code class="language-plaintext highlighter-rouge"></code> in tf). That is - you want to visualize <em>exactly</em> what goes into your network, decoding that raw tensor of data and labels into visualizations. This is the only “source of truth? I can’t count the number of times this has saved me and revealed problems in data preprocessing and augmentation.</li> <li><strong>visualize prediction dynamics</strong>. I like to visualize model predictions on a fixed test batch during the course of training. The “dynamics?of how these predictions move will give you incredibly good intuition for how the training progresses. Many times it is possible to feel the network “struggle?to fit your data if it wiggles too much in some way, revealing instabilities. Very low or very high learning rates are also easily noticeable in the amount of jitter.</li> <li><strong>use backprop to chart dependencies</strong>. Your deep learning code will often contain complicated, vectorized, and broadcasted operations. A relatively common bug I’ve come across a few times is that people get this wrong (e.g. they use <code class="language-plaintext highlighter-rouge">view</code> instead of <code class="language-plaintext highlighter-rouge">transpose/permute</code> somewhere) and inadvertently mix information across the batch dimension. It is a depressing fact that your network will typically still train okay because it will learn to ignore data from the other examples. One way to debug this (and other related problems) is to set the loss to be something trivial like the sum of all outputs of example <strong>i</strong>, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the <strong>i-th</strong> input. The same strategy can be used to e.g. ensure that your autoregressive model at time t only depends on 1..t-1. More generally, gradients give you information about what depends on what in your network, which can be useful for debugging.</li> <li><strong>generalize a special case</strong>. This is a bit more of a general coding tip but I’ve often seen people create bugs when they bite off more than they can chew, writing a relatively general functionality from scratch. I like to write a very specific function to what I’m doing right now, get that to work, and then generalize it later making sure that I get the same result. Often this applies to vectorizing code, where I almost always write out the fully loopy version first and only then transform it to vectorized code one loop at a time.</li> </ul> <h4 id="3-overfit">3. Overfit</h4> <p>At this stage we should have a good understanding of the dataset and we have the full training + evaluation pipeline working. For any given model we can (reproducibly) compute a metric that we trust. We are also armed with our performance for an input-independent baseline, the performance of a few dumb baselines (we better beat these), and we have a rough sense of the performance of a human (we hope to reach this). The stage is now set for iterating on a good model.</p> <p>The approach I like to take to finding a good model has two stages: first get a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss). The reason I like these two stages is that if we are not able to reach a low error rate with any model at all that may again indicate some issues, bugs, or misconfiguration.</p> <p>A few tips &amp; tricks for this stage:</p> <ul> <li><strong>picking the model</strong>. To reach a good training loss you’ll want to choose an appropriate architecture for the data. When it comes to choosing this my #1 advice is: <strong>Don’t be a hero</strong>. I’ve seen a lot of people who are eager to get crazy and creative in stacking up the lego blocks of the neural net toolbox in various exotic architectures that make sense to them. Resist this temptation strongly in the early stages of your project. I always advise people to simply find the most related paper and copy paste their simplest architecture that achieves good performance. E.g. if you are classifying images don’t be a hero and just copy paste a ResNet-50 for your first run. You’re allowed to do something more custom later and beat this.</li> <li><strong>adam is safe</strong>. In the early stages of setting baselines I like to use Adam with a learning rate of <a href="">3e-4</a>. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate. For ConvNets a well-tuned SGD will almost always slightly outperform Adam, but the optimal learning rate region is much more narrow and problem-specific. (Note: If you are using RNNs and related sequence models it is more common to use Adam. At the initial stage of your project, again, don’t be a hero and follow whatever the most related papers do.)</li> <li><strong>complexify only one at a time</strong>. If you have multiple signals to plug into your classifier I would advise that you plug them in one by one and every time ensure that you get a performance boost you’d expect. Don’t throw the kitchen sink at your model at the start. There are other ways of building up complexity - e.g. you can try to plug in smaller images first and make them bigger later, etc.</li> <li><strong>do not trust learning rate decay defaults</strong>. If you are re-purposing code from some other domain always be very careful with learning rate decay. Not only would you want to use different decay schedules for different problems, but - even worse - in a typical implementation the schedule will be based current epoch number, which can vary widely simply depending on the size of your dataset. E.g. ImageNet would decay by 10 on epoch 30. If you’re not training ImageNet then you almost certainly do not want this. If you’re not careful your code could secretely be driving your learning rate to zero too early, not allowing your model to converge. In my own work I always disable learning rate decays entirely (I use a constant LR) and tune this all the way at the very end.</li> </ul> <h4 id="4-regularize">4. Regularize</h4> <p>Ideally, we are now at a place where we have a large model that is fitting at least the training set. Now it is time to regularize it and gain some validation accuracy by giving up some of the training accuracy. Some tips &amp; tricks:</p> <ul> <li><strong>get more data</strong>. First, the by far best and preferred way to regularize a model in any practical setting is to add more real training data. It is a very common mistake to spend a lot engineering cycles trying to squeeze juice out of a small dataset when you could instead be collecting more data. As far as I’m aware adding more data is pretty much the only guaranteed way to monotonically improve the performance of a well-configured neural network almost indefinitely. The other would be ensembles (if you can afford them), but that tops out after ~5 models.</li> <li><strong>data augment</strong>. The next best thing to real data is half-fake data - try out more aggressive data augmentation.</li> <li><strong>creative augmentation</strong>. If half-fake data doesn’t do it, fake data may also do something. People are finding creative ways of expanding datasets; For example, <a href="">domain randomization</a>, use of <a href="">simulation</a>, clever <a href="">hybrids</a> such as inserting (potentially simulated) data into scenes, or even GANs.</li> <li><strong>pretrain</strong>. It rarely ever hurts to use a pretrained network if you can, even if you have enough data.</li> <li><strong>stick with supervised learning</strong>. Do not get over-excited about unsupervised pretraining. Unlike what that blog post from 2008 tells you, as far as I know, no version of it has reported strong results in modern computer vision (though NLP seems to be doing pretty well with BERT and friends these days, quite likely owing to the more deliberate nature of text, and a higher signal to noise ratio).</li> <li><strong>smaller input dimensionality</strong>. Remove features that may contain spurious signal. Any added spurious input is just another opportunity to overfit if your dataset is small. Similarly, if low-level details don’t matter much try to input a smaller image.</li> <li><strong>smaller model size</strong>. In many cases you can use domain knowledge constraints on the network to decrease its size. As an example, it used to be trendy to use Fully Connected layers at the top of backbones for ImageNet but these have since been replaced with simple average pooling, eliminating a ton of parameters in the process.</li> <li><strong>decrease the batch size</strong>. Due to the normalization inside batch norm smaller batch sizes somewhat correspond to stronger regularization. This is because the batch empirical mean/std are more approximate versions of the full mean/std so the scale &amp; offset “wiggles?your batch around more.</li> <li><strong>drop</strong>. Add dropout. Use dropout2d (spatial dropout) for ConvNets. Use this sparingly/carefully because dropout <a href="">does not seem to play nice</a> with batch normalization.</li> <li><strong>weight decay</strong>. Increase the weight decay penalty.</li> <li><strong>early stopping</strong>. Stop training based on your measured validation loss to catch your model just as it’s about to overfit.</li> <li><strong>try a larger model</strong>. I mention this last and only after early stopping but I’ve found a few times in the past that larger models will of course overfit much more eventually, but their “early stopped?performance can often be much better than that of smaller models.</li> </ul> <p>Finally, to gain additional confidence that your network is a reasonable classifier, I like to visualize the network’s first-layer weights and ensure you get nice edges that make sense. If your first layer filters look like noise then something could be off. Similarly, activations inside the net can sometimes display odd artifacts and hint at problems.</p> <h4 id="5-tune">5. Tune</h4> <p>You should now be “in the loop?with your dataset exploring a wide model space for architectures that achieve low validation loss. A few tips and tricks for this step:</p> <ul> <li><strong>random over grid search</strong>. For simultaneously tuning multiple hyperparameters it may sound tempting to use grid search to ensure coverage of all settings, but keep in mind that it is <a href="">best to use random search instead</a>. Intuitively, this is because neural nets are often much more sensitive to some parameters than others. In the limit, if a parameter <strong>a</strong> matters but changing <strong>b</strong> has no effect then you’d rather sample <strong>a</strong> more throughly than at a few fixed points multiple times.</li> <li><strong>hyper-parameter optimization</strong>. There is a large number of fancy bayesian hyper-parameter optimization toolboxes around and a few of my friends have also reported success with them, but my personal experience is that the state of the art approach to exploring a nice and wide space of models and hyperparameters is to use an intern :). Just kidding.</li> </ul> <h4 id="6-squeeze-out-the-juice">6. Squeeze out the juice</h4> <p>Once you find the best types of architectures and hyper-parameters you can still use a few more tricks to squeeze out the last pieces of juice out of the system:</p> <ul> <li><strong>ensembles</strong>. Model ensembles are a pretty much guaranteed way to gain 2% of accuracy on anything. If you can’t afford the computation at test time look into distilling your ensemble into a network using <a href="">dark knowledge</a>.</li> <li><strong>leave it training</strong>. I’ve often seen people tempted to stop the model training when the validation loss seems to be leveling off. In my experience networks keep training for unintuitively long time. One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art?.</li> </ul> <h4 id="conclusion">Conclusion</h4> <p>Once you make it here you’ll have all the ingredients for success: You have a deep understanding of the technology, the dataset and the problem, you’ve set up the entire training/evaluation infrastructure and achieved high confidence in its accuracy, and you’ve explored increasingly more complex models, gaining performance improvements in ways you’ve predicted each step of the way. You’re now ready to read a lot of papers, try a large number of experiments, and get your SOTA results. Good luck!</p> Thu, 25 Apr 2019 09:00:00 +0000 (started posting on Medium instead) <p>The current state of this blog (with the last post 2 years ago) makes it look like I’ve disappeared. I’ve certainly become less active on blogs since I’ve joined Tesla, but whenever I do get a chance to post something I have recently been defaulting to doing it on Medium because it is much faster and easier. I still plan to come back here for longer posts if I get any time, but I’ll default to Medium for everything short-medium in length.</p> <h3 id="tldr">TLDR</h3> <p><strong>Have a look at my <a href="">Medium blog</a>.</strong></p> Sat, 20 Jan 2018 11:00:00 +0000 A Survival Guide to a PhD <p>This guide is patterned after my <a href="">“Doing well in your courses?lt;/a>, a post I wrote a long time ago on some of the tips/tricks I’ve developed during my undergrad. I’ve received nice comments about that guide, so in the same spirit, now that my PhD has come to an end I wanted to compile a similar retrospective document in hopes that it might be helpful to some. Unlike the undergraduate guide, this one was much more difficult to write because there is significantly more variation in how one can traverse the PhD experience. Therefore, many things are likely contentious and a good fraction will be specific to what I’m familiar with (Computer Science / Machine Learning / Computer Vision research). But disclaimers are boring, lets get to it!</p> <h3 id="preliminaries">Preliminaries</h3> <div class="imgcap"> <img src="/assets/phd/phds.jpg" /> </div> <p>First, should you want to get a PhD? I was in a fortunate position of knowing since young age that I really wanted a PhD. Unfortunately it wasn’t for any very well-thought-through considerations: First, I really liked school and learning things and I wanted to learn as much as possible, and second, I really wanted to be like <a href="">Gordon Freeman</a> from the game Half-Life (who has a PhD from MIT in theoretical physics). I loved that game. But what if you’re more sensible in making your life’s decisions? Should you want to do a PhD? There’s a very nice <a href="">Quora thread</a> and in the summary of considerations that follows I’ll borrow/restate several from Justin/Ben/others there. I’ll assume that the second option you are considering is joining a medium-large company (which is likely most common). Ask yourself if you find the following properties appealing:</p> <p><strong>Freedom.</strong> A PhD will offer you a lot of freedom in the topics you wish to pursue and learn about. You’re in charge. Of course, you’ll have an adviser who will impose some constraints but in general you’ll have much more freedom than you might find elsewhere.</p> <p><strong>Ownership.</strong> The research you produce will be yours as an individual. Your accomplishments will have your name attached to them. In contrast, it is much more common to “blend in?inside a larger company. A common feeling here is becoming a “cog in a wheel?</p> <p><strong>Exclusivity</strong>. There are very few people who make it to the top PhD programs. You’d be joining a group of a few hundred distinguished individuals in contrast to a few tens of thousands (?) that will join some company.</p> <p><strong>Status.</strong> Regardless of whether it should be or not, working towards and eventually getting a PhD degree is culturally revered and recognized as an impressive achievement. You also get to be a Doctor; that’s awesome.</p> <p><strong>Personal freedom.</strong> As a PhD student you’re your own boss. Want to sleep in today? Sure. Want to skip a day and go on a vacation? Sure. All that matters is your final output and no one will force you to clock in from 9am to 5pm. Of course, some advisers might be more or less flexible about it and some companies might be as well, but it’s a true first order statement.</p> <p><strong>Maximizing future choice.</strong> Joining a PhD program doesn’t close any doors or eliminate future employment/lifestyle options. You can go one way (PhD -&gt; anywhere else) but not the other (anywhere else -&gt; PhD -&gt; academia/research; it is statistically less likely). Additionally (although this might be quite specific to applied ML), you’re strictly more hirable as a PhD graduate or even as a PhD dropout and many companies might be willing to put you in a more interesting position or with a higher starting salary. More generally, maximizing choice for the future you is a good heuristic to follow.</p> <p><strong>Maximizing variance.</strong> You’re young and there’s really no need to rush. Once you graduate from a PhD you can spend the next ~50 years of your life in some company. Opt for more variance in your experiences.</p> <p><strong>Personal growth.</strong> PhD is an intense experience of rapid growth (you learn a lot) and personal self-discovery (you’ll become a master of managing your own psychology). PhD programs (especially if you can make it into a good one) also offer a <em>high density</em> of exceptionally bright people who will become your best friends forever.</p> <p><strong>Expertise.</strong> PhD is probably your only opportunity in life to really drill deep into a topic and become a recognized leading expert <em>in the world</em> at something. You’re exploring the edge of our knowledge as a species, without the burden of lesser distractions or constraints. There’s something beautiful about that and if you disagree, it could be a sign that PhD is not for you.</p> <p><strong>The disclaimer</strong>. I wanted to also add a few words on some of the potential downsides and failure modes. The PhD is a very specific kind of experience that deserves a large disclaimer. You will inevitably find yourself working very hard (especially before paper deadlines). You need to be okay with the suffering and have enough mental stamina and determination to deal with the pressure. At some points you will lose track of what day of the week it is and go on a diet of leftover food from the microkitchens. You’ll sit exhausted and alone in the lab on a beautiful, sunny Saturday scrolling through Facebook pictures of your friends having fun on exotic trips, paid for by their 5-10x larger salaries. You will have to throw away 3 months of your work while somehow keeping your mental health intact. You’ll struggle with the realization that months of your work were spent on a paper with a few citations while your friends do exciting startups with TechCrunch articles or push products to millions of people. You’ll experience identity crises during which you’ll question your life decisions and wonder what you’re doing with some of the best years of your life. As a result, you should be quite certain that you can thrive in an unstructured environment in the pursuit research and discovery for science. If you’re unsure you should lean slightly negative by default. Ideally you should consider getting a taste of research as an undergraduate on a summer research program before before you decide to commit. In fact, one of the primary reasons that research experience is so desirable during the PhD hiring process is not the research itself, but the fact that the student is more likely to know what they’re getting themselves into.</p> <p>I should clarify explicitly that this post is not about convincing anyone to do a PhD, I’ve merely tried to enumerate some of the common considerations above. The majority of this post focuses on some tips/tricks for navigating the experience once if you decide to go for it (which we’ll see shortly, below).</p> <p>Lastly, as a random thought I heard it said that you should only do a PhD if you want to go into academia. In light of all of the above I’d argue that a PhD has strong intrinsic value - it’s an end by itself, not just a means to some end (e.g. academic job).</p> <p><strong>Getting into a PhD program: references, references, references.</strong> Great, you’ve decided to go for it. Now how do you get into a good PhD program? The first order approximation is quite simple - by far most important component are strong reference letters. The ideal scenario is that a well-known professor writes you a letter along the lines of: “Blah is in top 5 of students I’ve ever worked with. She takes initiative, comes up with her own ideas, and gets them to work.?The worst letter is along the lines of: “Blah took my class. She did well.?A research publication under your belt from a summer research program is a very strong bonus, but not absolutely required provided you have strong letters. In particular note: grades are quite irrelevant but you generally don’t want them to be too low. This was not obvious to me as an undergrad and I spent a lot of energy on getting good grades. This time should have instead been directed towards research (or at the very least personal projects), as much and as early as possible, and if possible under supervision of multiple people (you’ll need 3+ letters!). As a last point, what won’t help you too much is pestering your potential advisers out of the blue. They are often incredibly busy people and if you try to approach them too aggressively in an effort to impress them somehow in conferences or over email this may agitate them.</p> <p><strong>Picking the school</strong>. Once you get into some PhD programs, how do you pick the school? It’s easy, join Stanford! Just kidding. More seriously, your dream school should 1) be a top school (not because it looks good on your resume/CV but because of feedback loops; top schools attract other top people, many of whom you will get to know and work with) 2) have a few potential advisers you would want to work with. I really do mean the “few?part - this is very important and provides a safety cushion for you if things don’t work out with your top choice for any one of hundreds of reasons - things in many cases outside of your control, e.g. your dream professor leaves, moves, or spontaneously disappears, and 3) be in a good environment physically. I don’t think new admits appreciate this enough: you will spend 5+ years of your really good years living near the school campus. Trust me, this is a long time and your life will consist of much more than just research.</p> <h3 id="adviser">Adviser</h3> <div class="imgcap"> <img src="/assets/phd/adviser.gif" /> <div class="thecap">Image credit: <a href="">PhD comics</a>.</div> </div> <p><strong>Student adviser relationship</strong>. The adviser is an extremely important person who will exercise a lot of influence over your PhD experience. It’s important to understand the nature of the relationship: the adviser-student relationship is a symbiosis; you have your own goals and want something out of your PhD, but they also have their own goals, constraints and they’re building their own career. Therefore, it is very helpful to understand your adviser’s incentive structures: how the tenure process works, how they are evaluated, how they get funding, how they fund you, what department politics they might be embedded in, how they win awards, how academia in general works and specifically how they gain recognition and respect of their colleagues. This alone will help you avoid or mitigate a large fraction of student-adviser friction points and allow you to plan appropriately. I also don’t want to make the relationship sound too much like a business transaction. The advisor-student relationship, more often that not, ends up developing into a lasting one, predicated on much more than just career advancement.</p> <p><strong>Pre-vs-post tenure</strong>. Every adviser is different so it’s helpful to understand the axes of variations and their repercussions on your PhD experience. As one rule of thumb (and keep in mind there are many exceptions), it’s important to keep track of whether a potential adviser is pre-tenure or post-tenure. The younger faculty members will usually be around more (they are working hard to get tenure) and will usually be more low-level, have stronger opinions on what you should be working on, they’ll do math with you, pitch concrete ideas, or even look at (or contribute to) your code. This is a much more hands-on and possibly intense experience because the adviser will need a strong publication record to get tenure and they are incentivised to push you to work just as hard. In contrast, more senior faculty members may have larger labs and tend to have many other commitments (e.g. committees, talks, travel) other than research, which means that they can only afford to stay on a higher level of abstraction both in the area of their research and in the level of supervision for their students. To caricature, it’s a difference between “you’re missing a second term in that equation?and “you may want to read up more in this area, talk to this or that person, and sell your work this or that way? In the latter case, the low-level advice can still come from the senior PhD students in the lab or the postdocs.</p> <p><strong>Axes of variation</strong>. There are many other axes to be aware of. Some advisers are fluffy and some prefer to keep your relationship very professional. Some will try to exercise a lot of influence on the details of your work and some are much more hands off. Some will have a focus on specific models and their applications to various tasks while some will focus on tasks and more indifference towards any particular modeling approach. In terms of more managerial properties, some will meet you every week (or day!) multiple times and some you won’t see for months. Some advisers answer emails right away and some don’t answer email for a week (or ever, haha). Some advisers make demands about your work schedule (e.g. you better work long hours or weekends) and some won’t. Some advisers generously support their students with equipment and some think laptops or old computers are mostly fine. Some advisers will fund you to go to a conferences even if you don’t have a paper there and some won’t. Some advisers are entrepreneurial or applied and some lean more towards theoretical work. Some will let you do summer internships and some will consider internships just a distraction.</p> <p><strong>Finding an adviser</strong>. So how do you pick an adviser? The first stop, of course, is to talk to them in person. The student-adviser relationship is sometimes referred to as a marriage and you should make sure that there is a good fit. Of course, first you want to make sure that you can talk with them and that you get along personally, but it’s also important to get an idea of what area of “professor space?they occupy with respect to the aforementioned axes, and especially whether there is an intellectual resonance between the two of you in terms of the problems you are interested in. This can be just as important as their management style.</p> <p><strong>Collecting references</strong>. You should also collect references on your potential adviser. One good strategy is to talk to their students. If you want to get actual information this shouldn’t be done in a very formal way or setting but in a relaxed environment or mood (e.g. a party). In many cases the students might still avoid saying bad things about the adviser if asked in a general manner, but they will usually answer truthfully when you ask specific questions, e.g. “how often do you meet?? or “how hands on are they?? Another strategy is to look at where their previous students ended up (you can usually find this on the website under an alumni section), which of course also statistically informs your own eventual outcome.</p> <p><strong>Impressing an adviser</strong>. The adviser-student matching process is sometimes compared to a marriage - you pick them but they also pick you. The ideal student from their perspective is someone with interest and passion, someone who doesn’t need too much hand-holding, and someone who takes initiative - who shows up a week later having done not just what the adviser suggested, but who went beyond it; improved on it in unexpected ways.</p> <p><strong>Consider the entire lab</strong>. Another important point to realize is that you’ll be seeing your adviser maybe once a week but you’ll be seeing most of their students every single day in the lab and they will go on to become your closest friends. In most cases you will also end up collaborating with some of the senior PhD students or postdocs and they will play a role very similar to that of your adviser. The postdocs, in particular, are professors-in-training and they will likely be eager to work with you as they are trying to gain advising experience they can point to for their academic job search. Therefore, you want to make sure the entire group has people you can get along with, people you respect and who you can work with closely on research projects.</p> <h3 id="research-topics">Research topics</h3> <div class="imgcap"> <img src="/assets/phd/arxiv-papers.png" /> <div class="thecap">t-SNE visualization of a small subset of human knowledge (from <a href="">paperscape</a>). Each circle is an arxiv paper and size indicates the number of citations.</div> </div> <p>So you’ve entered a PhD program and found an adviser. Now what do you work on?</p> <p><strong>An exercise in the outer loop.</strong> First note the nature of the experience. A PhD is simultaneously a fun and frustrating experience because you’re constantly operating on a meta problem level. You’re not just solving problems - that’s merely the simple inner loop. You spend most of your time on the outer loop, figuring out what problems are worth solving and what problems are ripe for solving. You’re constantly imagining yourself solving hypothetical problems and asking yourself where that puts you, what it could unlock, or if anyone cares. If you’re like me this can sometimes drive you a little crazy because you’re spending long hours working on things and you’re not even sure if they are the correct things to work on or if a solution exists.</p> <p><strong>Developing taste</strong>. When it comes to choosing problems you’ll hear academics talk about a mystical sense of “taste? It’s a real thing. When you pitch a potential problem to your adviser you’ll either see their face contort, their eyes rolling, and their attention drift, or you’ll sense the excitement in their eyes as they contemplate the uncharted territory ripe for exploration. In that split second a lot happens: an evaluation of the problem’s importance, difficulty, its <em>sexiness</em>, its historical context (and possibly also its fit to their active grants). In other words, your adviser is likely to be a master of the outer loop and will have a highly developed sense of <em>taste</em> for problems. During your PhD you’ll get to acquire this sense yourself.</p> <p>In particular, I think I had a terrible taste coming in to the PhD. I can see this from the notes I took in my early PhD years. A lot of the problems I was excited about at the time were in retrospect poorly conceived, intractable, or irrelevant. I’d like to think I refined the sense by the end through practice and apprenticeship.</p> <p>Let me now try to serialize a few thoughts on what goes into this sense of taste, and what makes a problem interesting to work on.</p> <p><strong>A fertile ground.</strong> First, recognize that during your PhD you will dive deeply into one area and your papers will very likely chain on top of each other to create a body of work (which becomes your thesis). Therefore, you should always be thinking several steps ahead when choosing a problem. It’s impossible to predict how things will unfold but you can often get a sense of how much room there could be for additional work.</p> <p><strong>Plays to your adviser’s interests and strengths</strong>. You will want to operate in the realm of your adviser’s interest. Some advisers may allow you to work on slightly tangential areas but you would not be taking full advantage of their knowledge and you are making them less likely to want to help you with your project or promote your work. For instance, (and this goes to my previous point of understanding your adviser’s job) every adviser has a “default talk?slide deck on their research that they give all the time and if your work can add new exciting cutting edge work slides to this deck then you’ll find them much more invested, helpful and involved in your research. Additionally, their talks will promote and publicize your work.</p> <p><strong>Be ambitious: the sublinear scaling of hardness.</strong> People have a strange bug built into psychology: a 10x more important or impactful problem intuitively <em>feels</em> 10x harder (or 10x less likely) to achieve. This is a fallacy - in my experience a 10x more important problem is at most 2-3x harder to achieve. In fact, in some cases a 10x harder problem may be easier to achieve. How is this? It’s because thinking 10x forces you out of the box, to confront the real limitations of an approach, to think from first principles, to change the strategy completely, to innovate. If you aspire to improve something by 10% and work hard then you will. But if you aspire to improve it by 100% you are still quite likely to, but you will do it very differently.</p> <p><strong>Ambitious but with an attack.</strong> At this point it’s also important to point out that there are plenty of important problems that don’t make great projects. I recommend reading <a href="You and Your Research">You and Your Research</a> by Richard Hamming, where this point is expanded on:</p> <blockquote> <p>If you do not work on an important problem, it’s unlikely you’ll do important work. It’s perfectly obvious. Great scientists have thought through, in a careful way, a number of important problems in their field, and they keep an eye on wondering how to attack them. Let me warn you, `important problem?must be phrased carefully. The three outstanding problems in physics, in a certain sense, were never worked on while I was at Bell Labs. By important I mean guaranteed a Nobel Prize and any sum of money you want to mention. We didn’t work on (1) time travel, (2) teleportation, and (3) antigravity. They are not important problems because we do not have an attack. It’s not the consequence that makes a problem important, it is that you have a reasonable attack. That is what makes a problem important.</p> </blockquote> <p><strong>The person who did X</strong>. Ultimately, the goal of a PhD is to not only develop a deep expertise in a field but to also make your mark upon it. To steer it, shape it. The ideal scenario is that by the end of the PhD you own some part of an important area, preferably one that is also easy and fast to describe. You want people to say things like “she’s the person who did X? If you can fill in a blank there you’ll be successful.</p> <p><strong>Valuable skills.</strong> Recognize that during your PhD you will become an expert at the area of your choosing (as fun aside, note that [5 years]x[260 working days]x[8 hours per day] is 10,400 hours; if you believe Gladwell then a PhD is exactly the amount of time to become an expert). So imagine yourself 5 years later being a world expert in this area (the 10,000 hours will ensure that regardless of the academic impact of your work). Are these skills exciting or potentially valuable to your future endeavors?</p> <p><strong>Negative examples.</strong> There are also some problems or types of papers that you ideally want to avoid. For instance, you’ll sometimes hear academics talk about <em>“incremental work?lt;/em> (this is the worst adjective possible in academia). Incremental work is a paper that enhances something existing by making it more complex and gets 2% extra on some benchmark. The amusing thing about these papers is that they have a reasonably high chance of getting accepted (a reviewer can’t point to anything to kill them; they are also sometimes referred to as ?lt;em>cockroach papers</em>?, so if you have a string of these papers accepted you can feel as though you’re being very productive, but in fact these papers won’t go on to be highly cited and you won’t go on to have a lot of impact on the field. Similarly, finding projects should ideally not include thoughts along the lines of “there’s this next logical step in the air that no one has done yet, let me do it? or “this should be an easy poster?</p> <p><strong>Case study: my thesis</strong>. To make some of this discussion more concrete I wanted to use the example of how my own PhD unfolded. First, fun fact: my entire thesis is based on work I did in the last 1.5 years of my PhD. i.e. it took me quite a long time to wiggle around in the metaproblem space and find a problem that I felt very excited to work on (the other ~2 years I mostly meandered on 3D things (e.g. Kinect Fusion, 3D meshes, point cloud features) and video things). Then at one point in my 3rd year I randomly stopped by Richard Socher’s office on some Saturday at 2am. We had a chat about interesting problems and I realized that some of his work on images and language was in fact getting at something very interesting (of course, the area at the intersection of images and language goes back quite a lot further than Richard as well). I couldn’t quite see all the papers that would follow but it seemed heuristically very promising: it was highly fertile (a lot of unsolved problems, a lot of interesting possibilities on grounding descriptions to images), I felt that it was very cool and important, it was easy to explain, it seemed to be at the boundary of possible (Deep Learning has just started to work), the datasets had just started to become available (Flickr8K had just come out), it fit nicely into Fei-Fei’s interests and even if I were not successful I’d at least get lots of practice with optimizing interesting deep nets that I could reapply elsewhere. I had a strong feeling of a tsunami of checkmarks as everything clicked in place in my mind. I pitched this to Fei-Fei (my adviser) as an area to dive into the next day and, with relief, she enthusiastically approved, encouraged me, and would later go on to steer me within the space (e.g. Fei-Fei insisted that I do image to sentence generation while I was mostly content with ranking.). I’m happy with how things evolved from there. In short, I meandered around for 2 years stuck around the outer loop, finding something to dive into. Once it clicked for me what that was based on several heuristics, I dug in.</p> <p><strong>Resistance</strong>. I’d like to also mention that your adviser is by no means infallible. I’ve witnessed and heard of many instances in which, in retrospect, the adviser made the wrong call. If you feel this way during your phd you should have the courage to sometimes ignore your adviser. Academia generally celebrates independent thinking but the response of your specific adviser can vary depending on circumstances. I’m aware of multiple cases where the bet worked out very well and I’ve also personally experienced cases where it did not. For instance, I disagreed strongly with some advice Andrew Ng gave me in my very first year. I ended up working on a problem he wasn’t very excited about and, surprise, he turned out to be very right and I wasted a few months. Win some lose some :)</p> <p><strong>Don’t play the game.</strong> Finally, I’d like to challenge you to think of a PhD as more than just a sequence of papers. You’re not a paper writer. You’re a member of a research community and your goal is to push the field forward. Papers are one common way of doing that but I would encourage you to look beyond the established academic game. Think for yourself and from first principles. Do things others don’t do but should. Step off the treadmill that has been put before you. I tried to do some of this myself throughout my PhD. This blog is an example - it allows me communicate things that wouldn’t ordinarily go into papers. The ImageNet human reference experiments are an example - I felt strongly that it was important for the field to know the ballpark human accuracy on ILSVRC so I took a few weeks off and evaluated it. The academic search tools (e.g. arxiv-sanity) are an example - I felt continuously frustrated by the inefficiency of finding papers in the literature and I released and maintain the site in hopes that it can be useful to others. Teaching CS231n twice is an example - I put much more effort into it than is rationally advisable for a PhD student who should be doing research, but I felt that the field was held back if people couldn’t efficiently learn about the topic and enter. A lot of my PhD endeavors have likely come at a cost in standard academic metrics (e.g. h-index, or number of publications in top venues) but I did them anyway, I would do it the same way again, and here I am encouraging others to as well. To add a pitch of salt and wash down the ideology a bit, based on several past discussions with my friends and colleagues I know that this view is contentious and that many would disagree.</p> <h3 id="writing-papers">Writing papers</h3> <div class="imgcap"> <img src="/assets/phd/latex.png" /> </div> <p>Writing good papers is an essential survival skill of an academic (kind of like making fire for a caveman). In particular, it is very important to realize that papers are a specific thing: they look a certain way, they flow a certain way, they have a certain structure, language, and statistics that the other academics expect. It’s usually a painful exercise for me to look through some of my early PhD paper drafts because they are quite terrible. There is a lot to learn here.</p> <p><strong>Review papers.</strong> If you’re trying to learn to write better papers it can feel like a sensible strategy to look at many good papers and try to distill patterns. This turns out to not be the best strategy; it’s analogous to only receiving positive examples for a binary classification problem. What you really want is to also have exposure to a large number of bad papers and one way to get this is by reviewing papers. Most good conferences have an acceptance rate of about 25% so most papers you’ll review are bad, which will allow you to build a powerful binary classifier. You’ll read through a bad paper and realize how unclear it is, or how it doesn’t define it’s variables, how vague and abstract its intro is, or how it dives in to the details too quickly, and you’ll learn to avoid the same pitfalls in your own papers. Another related valuable experience is to attend (or form) journal clubs - you’ll see experienced researchers critique papers and get an impression for how your own papers will be analyzed by others.</p> <p><strong>Get the gestalt right.</strong> I remember being impressed with Fei-Fei (my adviser) once during a reviewing session. I had a stack of 4 papers I had reviewed over the last several hours and she picked them up, flipped through each one for 10 seconds, and said one of them was good and the other three bad. Indeed, I was accepting the one and rejecting the other three, but something that took me several hours took her seconds. Fei-Fei was relying on the <em>gestalt</em> of the papers as a powerful heuristic. Your papers, as you become a more senior researcher take on a characteristic look. An introduction of ~1 page. A ~1 page related work section with a good density of citations - not too sparse but not too crowded. A well-designed pull figure (on page 1 or 2) and system figure (on page 3) that were not made in MS Paint. A technical section with some math symbols somewhere, results tables with lots of numbers and some of them bold, one additional cute analysis experiment, and the paper has exactly 8 pages (the page limit) and not a single line less. You’ll have to learn how to endow your papers with the same gestalt because many researchers rely on it as a cognitive shortcut when they judge your work.</p> <p><strong>Identify the core contribution</strong>. Before you start writing anything it’s important to identify the single core contribution that your paper makes to the field. I would especially highlight the word <em>single</em>. A paper is not a random collection of some experiments you ran that you report on. The paper sells a single thing that was not obvious or present before. You have to argue that the thing is important, that it hasn’t been done before, and then you support its merit experimentally in controlled experiments. The entire paper is organized around this core contribution with surgical precision. In particular it doesn’t have any additional fluff and it doesn’t try to pack anything else on a side. As a concrete example, I made a mistake in one of my earlier papers on <a href="">video classification</a> where I tried to pack in two contributions: 1) a set of architectural layouts for video convnets and an unrelated 2) multi-resolution architecture which gave small improvements. I added it because I reasoned first that maybe someone could find it interesting and follow up on it later and second because I thought that contributions in a paper are additive: two contributions are better than one. Unfortunately, this is false and very wrong. The second contribution was minor/dubious and it diluted the paper, it was distracting, and no one cared. I’ve made a similar mistake again in my <a href="">CVPR 2014 paper</a> which presented two separate models: a ranking model and a generation model. Several good in-retrospect arguments could be made that I should have submitted two separate papers; the reason it was one is more historical than rational.</p> <p><strong>The structure.</strong> Once you’ve identified your core contribution there is a default recipe for writing a paper about it. The upper level structure is by default Intro, Related Work, Model, Experiments, Conclusions. When I write my intro I find that it helps to put down a coherent top-level narrative in latex comments and then fill in the text below. I like to organize each of my paragraphs around a single concrete point stated on the first sentence that is then supported in the rest of the paragraph. This structure makes it easy for a reader to skim the paper. A good flow of ideas is then along the lines of 1) X (+define X if not obvious) is an important problem 2) The core challenges are this and that. 2) Previous work on X has addressed these with Y, but the problems with this are Z. 3) In this work we do W (?). 4) This has the following appealing properties and our experiments show this and that. You can play with this structure a bit but these core points should be clearly made. Note again that the paper is surgically organized around your exact contribution. For example, when you list the challenges you want to list exactly the things that you address later; you don’t go meandering about unrelated things to what you have done (you can speculate a bit more later in conclusion). It is important to keep a sensible structure throughout your paper, not just in the intro. For example, when you explain the model each section should: 1) explain clearly what is being done in the section, 2) explain what the core challenges are 3) explain what a baseline approach is or what others have done before 4) motivate and explain what you do 5) describe it.</p> <p><strong>Break the structure.</strong> You should also feel free (and you’re encouraged to!) play with these formulas to some extent and add some spice to your papers. For example, see this amusing paper from <a href="">Razavian et al. in 2014</a> that structures the introduction as a dialog between a student and the professor. It’s clever and I like it. As another example, a lot of papers from <a href="">Alyosha Efros</a> have a playful tone and make great case studies in writing fun papers. As only one of many examples, see this paper he wrote with Antonio Torralba: <a href="">Unbiased look at dataset bias</a>. Another possibility I’ve seen work well is to include an FAQ section, possibly in the appendix.</p> <p><strong>Common mistake: the laundry list.</strong> One very common mistake to avoid is the “laundry list? which looks as follows: “Here is the problem. Okay now to solve this problem first we do X, then we do Y, then we do Z, and now we do W, and here is what we get? You should try very hard to avoid this structure. Each point should be justified, motivated, explained. Why do you do X or Y? What are the alternatives? What have others done? It’s okay to say things like this is common (add citation if possible). Your paper is not a report, an enumeration of what you’ve done, or some kind of a translation of your chronological notes and experiments into latex. It is a highly processed and very focused discussion of a problem, your approach and its context. It is supposed to teach your colleagues something and you have to justify your steps, not just describe what you did.</p> <p><strong>The language.</strong> Over time you’ll develop a vocabulary of good words and bad words to use when writing papers. Speaking about machine learning or computer vision papers specifically as concrete examples, in your papers you never “study?or “investigate?(there are boring, passive, bad words); instead you “develop?or even better you “propose? And you don’t present a “system?or, <em>shudder</em>, a “pipeline? instead, you develop a “model? You don’t learn “features? you learn “representations? And god forbid, you never “combine? “modify?or “expand? These are incremental, gross terms that will certainly get your paper rejected :).</p> <p><strong>An internal deadlines 2 weeks prior</strong>. Not many labs do this, but luckily Fei-Fei is quite adamant about an internal deadline 2 weeks before the due date in which you must submit at least a 5-page draft with all the final experiments (even if not with final numbers) that goes through an internal review process identical to the external one (with the same review forms filled out, etc). I found this practice to be extremely useful because forcing yourself to lay out the full paper almost always reveals some number of critical experiments you must run for the paper to flow and for its argument flow to be coherent, consistent and convincing.</p> <p>Another great resource on this topic is <a href="">Tips for Writing Technical Papers</a> from Jennifer Widom.</p> <h3 id="writing-code">Writing code</h3> <div class="imgcap"> <img src="/assets/phd/code.jpg" /> </div> <p>A lot of your time will of course be taken up with the <em>execution</em> of your ideas, which likely involves a lot of coding. I won’t dwell on this too much because it’s not uniquely academic, but I would like to bring up a few points.</p> <p><strong>Release your code</strong>. It’s a somewhat surprising fact but you can get away with publishing papers and not releasing your code. You will also feel a lot of incentive to not release your code: it can be a lot of work (research code can look like spaghetti since you iterate very quickly, you have to clean up a lot), it can be intimidating to think that others might judge you on your at most decent coding abilities, it is painful to maintain code and answer questions from other people about it (forever), and you might also be concerned that people could spot bugs that invalidate your results. However, it is precisely for some of these reasons that you should commit to releasing your code: it will force you to adopt better coding habits due to fear of public shaming (which will end up saving you time!), it will force you to learn better engineering practices, it will force you to be more thorough with your code (e.g. writing unit tests to make bugs much less likely), it will make others much more likely to follow up on your work (and hence lead to more citations of your papers) and of course it will be much more useful to everyone as a record of exactly what was done for posterity. When you do release your code I recommend taking advantage of <a href="">docker containers</a>; this will reduce the amount of headaches people email you about when they can’t get all the dependencies (and their precise versions) installed.</p> <p><strong>Think of the future you</strong>. Make sure to document all your code very well for yourself. I guarantee you that you will come back to your code base a few months later (e.g. to do a few more experiments for the camera ready version of the paper), and you will feel <em>completely</em> lost in it. I got into the habit of creating very thorough readme.txt files in all my repos (for my personal use) as notes to future self on how the code works, how to run it, etc.</p> <h3 id="giving-talks">Giving talks</h3> <div class="imgcap"> <img src="/assets/phd/talk.jpg" /> </div> <p>So, you published a paper and it’s an oral! Now you get to give a few minute talk to a large audience of people - what should it look like?</p> <p><strong>The goal of a talk</strong>. First, that there’s a common misconception that the goal of your talk is to tell your audience about what you did in your paper. This is incorrect, and should only be a second or third degree design criterion. The goal of your talk is to 1) get the audience really excited about the <strong>problem</strong> you worked on (they must appreciate it or they will not care about your solution otherwise!) 2) teach the audience something (ideally while giving them a taste of your insight/solution; don’t be afraid to spend time on other’s related work), and 3) entertain (they will start checking their Facebook otherwise). Ideally, by the end of the talk the people in your audience are thinking some mixture of “wow, I’m working in the wrong area? “I have to read this paper? and “This person has an impressive understanding of the whole area?</p> <p><strong>A few do’s:</strong> There are several properties that make talks better. For instance, Do: Lots of pictures. People Love pictures. Videos and animations should be used more sparingly because they distract. Do: make the talk actionable - talk about something someone can <em>do</em> after your talk. Do: give a live demo if possible, it can make your talk more memorable. Do: develop a broader intellectual arch that your work is part of. Do: develop it into a story (people love stories). Do: cite, cite, cite - a lot! It takes very little slide space to pay credit to your colleagues. It pleases them and always reflects well on you because it shows that you’re humble about your own contribution, and aware that it builds on a lot of what has come before and what is happening in parallel. You can even cite related work published at the same conference and briefly advertise it. Do: practice the talk! First for yourself in isolation and later to your lab/friends. This almost always reveals very insightful flaws in your narrative and flow.</p> <p><strong>Don’t: texttexttext</strong>. Don’t crowd your slides with text. There should be very few or no bullet points - speakers sometimes try to use these as a crutch to remind themselves what they should be talking about but the slides are not for you they are for the audience. These should be in your speaker notes. On the topic of crowding the slides, also avoid complex diagrams as much as you can - your audience has a fixed bit bandwidth and I guarantee that your own very familiar and “simple?diagram is not as simple or interpretable to someone seeing it for the first time.</p> <p><strong>Careful with: result tables:</strong> Don’t include dense tables of results showing that your method works better. You got a paper, I’m sure your results were decent. I always find these parts boring and unnecessary unless the numbers show something interesting (other than your method works better), or of course unless there is a large gap that you’re very proud of. If you do include results or graphs build them up slowly with transitions, don’t post them all at once and spend 3 minutes on one slide.</p> <p><strong>Pitfall: the thin band between bored/confused</strong>. It’s actually quite tricky to design talks where a good portion of your audience <em>learns</em> something. A common failure case (as an audience member) is to see talks where I’m painfully bored during the first half and completely confused during the second half, learning nothing by the end. This can occur in talks that have a very general (too general) overview followed by a technical (too technical) second portion. Try to identify when your talk is in danger of having this property.</p> <p><strong>Pitfall: running out of time</strong>. Many speakers spend too much time on the early intro parts (that can often be somewhat boring) and then frantically speed through all the last few slides that contain the most interesting results, analysis or demos. Don’t be that person.</p> <p><strong>Pitfall: formulaic talks</strong>. I might be a special case but I’m always a fan of non-formulaic talks that challenge conventions. For instance, I <em>despise</em> the outline slide. It makes the talk so boring, it’s like saying: “This movie is about a ring of power. In the first chapter we’ll see a hobbit come into possession of the ring. In the second we’ll see him travel to Mordor. In the third he’ll cast the ring into Mount Doom and destroy it. I will start with chapter 1?- Come on! I use outline slides for much longer talks to keep the audience anchored if they zone out (at 30min+ they inevitably will a few times), but it should be used sparingly.</p> <p><strong>Observe and learn</strong>. Ultimately, the best way to become better at giving talks (as it is with writing papers too) is to make conscious effort to pay attention to what great (and not so great) speakers do and build a binary classifier in your mind. Don’t just enjoy talks; analyze them, break them down, learn from them. Additionally, pay close attention to the audience and their reactions. Sometimes a speaker will put up a complex table with many numbers and you will notice half of the audience immediately look down on their phone and open Facebook. Build an internal classifier of the events that cause this to happen and avoid them in your talks.</p> <h3 id="attending-conferences">Attending conferences</h3> <div class="imgcap"> <img src="/assets/phd/posters.jpg" /> </div> <p>On the subject of conferences:</p> <p><strong>Go.</strong> It’s very important that you go to conferences, especially the 1-2 top conferences in your area. If your adviser lacks funds and does not want to pay for your travel expenses (e.g. if you don’t have a paper) then you should be willing to pay for yourself (usually about $2000 for travel, accommodation, registration and food). This is important because you want to become part of the academic community and get a chance to meet more people in the area and gossip about research topics. Science might have this image of a few brilliant lone wolfs working in isolation, but the truth is that research is predominantly a highly social endeavor - you stand on the shoulders of many people, you’re working on problems in parallel with other people, and it is these people that you’re also writing papers to. Additionally, it’s unfortunate but each field has knowledge that doesn’t get serialized into papers but is instead spread across a shared understanding of the community; things such as what are the next important topics to work on, what papers are most interesting, what is the inside scoop on papers, how they developed historically, what methods work (not just on paper, in reality), etcetc. It is very valuable (and fun!) to become part of the community and get direct access to the hivemind - to learn from it first, and to hopefully influence it later.</p> <p><strong>Talks: choose by speaker</strong>. One conference trick I’ve developed is that if you’re choosing which talks to attend it can be better to look at the speakers instead of the topics. Some people give better talks than others (it’s a skill, and you’ll discover these people in time) and in my experience I find that it often pays off to see them speak even if it is on a topic that isn’t exactly connected to your area of research.</p> <p><strong>The real action is in the hallways</strong>. The speed of innovation (especially in Machine Learning) now works at timescales much faster than conferences so most of the relevant papers you’ll see at the conference are in fact old news. Therefore, conferences are primarily a social event. Instead of attending a talk I encourage you to view the hallway as one of the main events that doesn’t appear on the schedule. It can also be valuable to stroll the poster session and discover some interesting papers and ideas that you may have missed.</p> <blockquote> <p>It is said that there are three stages to a PhD. In the first stage you look at a related paper’s reference section and you haven’t read most of the papers. In the second stage you recognize all the papers. In the third stage you’ve shared a beer with all the first authors of all the papers.</p> </blockquote> <h3 id="closing-thoughts">Closing thoughts</h3> <p>I can’t find the quote anymore but I heard Sam Altman of YC say that there are no shortcuts or cheats when it comes to building a startup. You can’t expect to win in the long run by somehow gaming the system or putting up false appearances. I think that the same applies in academia. Ultimately you’re trying to do good research and push the field forward and if you try to game any of the proxy metrics you won’t be successful in the long run. This is especially so because academia is in fact surprisingly small and highly interconnected, so anything shady you try to do to pad your academic resume (e.g. self-citing a lot, publishing the same idea multiple times with small remixes, resubmitting the same rejected paper over and over again with no changes, conveniently trying to leave out some baselines etc.) will eventually catch up with you and you will not be successful.</p> <p>So at the end of the day it’s quite simple. Do good work, communicate it properly, people will notice and good things will happen. Have a fun ride!</p> <p><br /><br /> EDIT: <a href="">HN discussion link</a>.</p> Wed, 07 Sep 2016 11:00:00 +0000 Deep Reinforcement Learning: Pong from Pixels <!-- <svg width="800" height="200"> <rect width="800" height="200" style="fill:rgb(98,51,20)" /> <rect width="20" height="50" x="20" y="100" style="fill:rgb(189,106,53)" /> <rect width="20" height="50" x="760" y="30" style="fill:rgb(77,175,75)" /> <rect width="10" height="10" x="400" y="60" style="fill:rgb(225,229,224)" /> </svg> --> <p>This is a long overdue blog post on Reinforcement Learning (RL). RL is hot! You may have noticed that computers can now automatically <a href="">learn to play ATARI games</a> (from raw game pixels!), they are beating world champions at <a href="">Go</a>, simulated quadrupeds are learning to <a href="">run and leap</a>, and robots are learning how to perform <a href="">complex manipulation tasks</a> that defy explicit programming. It turns out that all of these advances fall under the umbrella of RL research. I also became interested in RL myself over the last ~year: I worked <a href="">through Richard Sutton’s book</a>, read through <a href="">David Silver’s course</a>, watched <a href="">John Schulmann’s lectures</a>, wrote an <a href="">RL library in Javascript</a>, over the summer interned at DeepMind working in the DeepRL group, and most recently pitched in a little with the design/development of <a href="">OpenAI Gym</a>, a new RL benchmarking toolkit. So I’ve certainly been on this funwagon for at least a year but until now I haven’t gotten around to writing up a short post on why RL is a big deal, what it’s about, how it all developed and where it might be going.</p> <div class="imgcap"> <img src="/assets/rl/preview.jpeg" /> <div class="thecap">Examples of RL in the wild. <b>From left to right</b>: Deep Q Learning network playing ATARI, AlphaGo, Berkeley robot stacking Legos, physically-simulated quadruped leaping over terrain.</div> </div> <p>It’s interesting to reflect on the nature of recent progress in RL. I broadly like to think about four separate factors that hold back AI:</p> <ol> <li>Compute (the obvious one: Moore’s Law, GPUs, ASICs),</li> <li>Data (in a nice form, not just out there somewhere on the internet - e.g. ImageNet),</li> <li>Algorithms (research and ideas, e.g. backprop, CNN, LSTM), and</li> <li>Infrastructure (software under you - Linux, TCP/IP, Git, ROS, PR2, AWS, AMT, TensorFlow, etc.).</li> </ol> <p>Similar to what happened in Computer Vision, the progress in RL is not driven as much as you might reasonably assume by new amazing ideas. In Computer Vision, the 2012 AlexNet was mostly a scaled up (deeper and wider) version of 1990’s ConvNets. Similarly, the ATARI Deep Q Learning paper from 2013 is an implementation of a standard algorithm (Q Learning with function approximation, which you can find in the standard RL book of Sutton 1998), where the function approximator happened to be a ConvNet. AlphaGo uses policy gradients with Monte Carlo Tree Search (MCTS) - these are also standard components. Of course, it takes a lot of skill and patience to get it to work, and multiple clever tweaks on top of old algorithms have been developed, but to a first-order approximation the main driver of recent progress is not the algorithms but (similar to Computer Vision) compute/data/infrastructure.</p> <p>Now back to RL. Whenever there is a disconnect between how magical something seems and how simple it is under the hood I get all antsy and really want to write a blog post. In this case I’ve seen many people who can’t believe that we can automatically learn to play most ATARI games at human level, with one algorithm, from pixels, and from scratch - and it is amazing, and I’ve been there myself! But at the core the approach we use is also really quite profoundly dumb (though I understand it’s easy to make such claims in retrospect). Anyway, I’d like to walk you through Policy Gradients (PG), our favorite default choice for attacking RL problems at the moment. If you’re from outside of RL you might be curious why I’m not presenting DQN instead, which is an alternative and better-known RL algorithm, widely popularized by the <a href="">ATARI game playing paper</a>. It turns out that Q-Learning is not a great algorithm (you could say that DQN is so 2013 (okay I’m 50% joking)). In fact most people prefer to use Policy Gradients, including the authors of the original DQN paper who have <a href="">shown</a> Policy Gradients to work better than Q Learning when tuned well. PG is preferred because it is end-to-end: there’s an explicit policy and a principled approach that directly optimizes the expected reward. Anyway, as a running example we’ll learn to play an ATARI game (Pong!) with PG, from scratch, from pixels, with a deep neural network, and the whole thing is 130 lines of Python only using numpy as a dependency (<a href="">Gist link</a>). Lets get to it.</p> <h3 id="pong-from-pixels">Pong from pixels</h3> <div class="imgcap"> <div style="display:inline-block"> <img src="/assets/rl/pong.gif" /> </div> <div style="display:inline-block; margin-left: 20px;"> <img src="/assets/rl/mdp.png" height="206" /> </div> <div class="thecap"><b>Left:</b> The game of Pong. <b>Right:</b> Pong is a special case of a <a href="">Markov Decision Process (MDP)</a>: A graph where each node is a particular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.</div> </div> <p>The game of Pong is an excellent example of a simple RL task. In the ATARI 2600 version we’ll use you play as one of the paddles (the other is controlled by a decent AI) and you have to bounce the ball past the other player (I don’t really have to explain Pong, right?). On the low level the game works as follows: we receive an image frame (a <code class="language-plaintext highlighter-rouge">210x160x3</code> byte array (integers from 0 to 255 giving pixel values)) and we get to decide if we want to move the paddle UP or DOWN (i.e. a binary choice). After every single choice the game simulator executes the action and gives us a reward: Either a +1 reward if the ball went past the opponent, a -1 reward if we missed the ball, or 0 otherwise. And of course, our goal is to move the paddle so that we get lots of reward.</p> <p>As we go through the solution keep in mind that we’ll try to make very few assumptions about Pong because we secretly don’t really care about Pong; We care about complex, high-dimensional problems like robot manipulation, assembly and navigation. Pong is just a fun toy test case, something we play with while we figure out how to write very general AI systems that can one day do arbitrary useful tasks.</p> <p><strong>Policy network</strong>. First, we’re going to define a <em>policy network</em> that implements our player (or “agent?. This network will take the state of the game and decide what we should do (move UP or DOWN). As our favorite simple block of compute we’ll use a 2-layer neural network that takes the raw image pixels (100,800 numbers total (210*160*3)), and produces a single number indicating the probability of going UP. Note that it is standard to use a <em>stochastic</em> policy, meaning that we only produce a <em>probability</em> of moving UP. Every iteration we will sample from this distribution (i.e. toss a biased coin) to get the actual move. The reason for this will become more clear once we talk about training.</p> <div class="imgcap"> <img src="/assets/rl/policy.png" height="200" /> <div class="thecap">Our policy network is a 2-layer fully-connected net.</div> </div> <p>and to make things concrete here is how you might implement this policy network in Python/numpy. Suppose we’re given a vector <code class="language-plaintext highlighter-rouge">x</code> that holds the (preprocessed) pixel information. We would compute:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">h</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">W1</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="c1"># compute hidden layer neuron activations </span><span class="n">h</span><span class="p">[</span><span class="n">h</span><span class="o">&lt;</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># ReLU nonlinearity: threshold at zero </span><span class="n">logp</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">W2</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span> <span class="c1"># compute log probability of going up </span><span class="n">p</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">logp</span><span class="p">))</span> <span class="c1"># sigmoid function (gives probability of going up) </span></code></pre></div></div> <p>where in this snippet <code class="language-plaintext highlighter-rouge">W1</code> and <code class="language-plaintext highlighter-rouge">W2</code> are two matrices that we initialize randomly. We’re not using biases because meh. Notice that we use the <em>sigmoid</em> non-linearity at the end, which squashes the output probability to the range [0,1]. Intuitively, the neurons in the hidden layer (which have their weights arranged along the rows of <code class="language-plaintext highlighter-rouge">W1</code>) can detect various game scenarios (e.g. the ball is in the top, and our paddle is in the middle), and the weights in <code class="language-plaintext highlighter-rouge">W2</code> can then decide if in each case we should be going UP or DOWN. Now, the initial random <code class="language-plaintext highlighter-rouge">W1</code> and <code class="language-plaintext highlighter-rouge">W2</code> will of course cause the player to spasm on spot. So the only problem now is to find <code class="language-plaintext highlighter-rouge">W1</code> and <code class="language-plaintext highlighter-rouge">W2</code> that lead to expert play of Pong!</p> <p><em>Fine print: preprocessing.</em> Ideally you’d want to feed at least 2 frames to the policy network so that it can detect motion. To make things a bit simpler (I did these experiments on my Macbook) I’ll do a tiny bit of preprocessing, e.g. we’ll actually feed <em>difference frames</em> to the network (i.e. subtraction of current and last frame).</p> <p><strong>It sounds kind of impossible</strong>. At this point I’d like you to appreciate just how difficult the RL problem is. We get 100,800 numbers (210*160*3) and forward our policy network (which easily involves on order of a million parameters in <code class="language-plaintext highlighter-rouge">W1</code> and <code class="language-plaintext highlighter-rouge">W2</code>). Suppose that we decide to go UP. The game might respond that we get 0 reward this time step and gives us another 100,800 numbers for the next frame. We could repeat this process for hundred timesteps before we get any non-zero reward! E.g. suppose we finally get a +1. That’s great, but how can we tell what made that happen? Was it something we did just now? Or maybe 76 frames ago? Or maybe it had something to do with frame 10 and then frame 90? And how do we figure out which of the million knobs to change and how, in order to do better in the future? We call this the <em>credit assignment problem</em>. In the specific case of Pong we know that we get a +1 if the ball makes it past the opponent. The <em>true</em> cause is that we happened to bounce the ball on a good trajectory, but in fact we did so many frames ago - e.g. maybe about 20 in case of Pong, and every single action we did afterwards had zero effect on whether or not we end up getting the reward. In other words we’re faced with a very difficult problem and things are looking quite bleak.</p> <p><strong>Supervised Learning</strong>. Before we dive into the Policy Gradients solution I’d like to remind you briefly about supervised learning because, as we’ll see, RL is very similar. Refer to the diagram below. In ordinary supervised learning we would feed an image to the network and get some probabilities, e.g. for two classes UP and DOWN. I’m showing log probabilities (-1.2, -0.36) for UP and DOWN instead of the raw probabilities (30% and 70% in this case) because we always optimize the log probability of the correct label (this makes math nicer, and is equivalent to optimizing the raw probability because log is monotonic). Now, in supervised learning we would have access to a label. For example, we might be told that the correct thing to do right now is to go UP (label 0). In an implementation we would enter gradient of 1.0 on the log probability of UP and run backprop to compute the gradient vector \(\nabla_{W} \log p(y=UP \mid x) \). This gradient would tell us how we should change every one of our million parameters to make the network slightly more likely to predict UP. For example, one of the million parameters in the network might have a gradient of -2.1, which means that if we were to increase that parameter by a small positive amount (e.g. <code class="language-plaintext highlighter-rouge">0.001</code>), the log probability of UP would decrease by <code class="language-plaintext highlighter-rouge">2.1 * 0.001</code> (decrease due to the negative sign). If we then did a parameter update then, yay, our network would now be slightly more likely to predict UP when it sees a very similar image in the future.</p> <div class="imgcap"> <img src="/assets/rl/sl.png" /> </div> <p><strong>Policy Gradients</strong>. Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that <em>discourages</em> the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game).</p> <div class="imgcap"> <img src="/assets/rl/rl.png" /> </div> <p>And that’s it: we have a stochastic policy that samples actions and then actions that happen to eventually lead to good outcomes get encouraged in the future, and actions taken that lead to bad outcomes get discouraged. Also, the reward does not even need to be +1 or -1 if we win the game eventually. It can be an arbitrary measure of some kind of eventual quality. For example if things turn out really well it could be 10.0, which we would then enter as the gradient instead of -1 to start off backprop. That’s the beauty of neural nets; Using them can feel like cheating: You’re allowed to have 1 million parameters embedded in 1 teraflop of compute and you can make it do arbitrary things with SGD. It shouldn’t work, but amusingly we live in a universe where it does.</p> <p><strong>Training protocol.</strong> So here is how the training will work in detail. We will initialize the policy network with some <code class="language-plaintext highlighter-rouge">W1</code>, <code class="language-plaintext highlighter-rouge">W2</code> and play 100 games of Pong (we call these policy “rollouts?. Lets assume that each game is made up of 200 frames so in total we’ve made 20,000 decisions for going UP or DOWN and for each one of these we know the parameter gradient, which tells us how we should change the parameters if we wanted to encourage that decision in that state in the future. All that remains now is to label every decision we’ve made as good or bad. For example suppose we won 12 games and lost 88. We’ll take all 200*12 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backprop, and parameter update encouraging the actions we picked in all those states). And we’ll take the other 200*88 = 17600 decisions we made in the losing games and do a negative update (discouraging whatever we did). And?that’s it. The network will now become slightly more likely to repeat actions that worked, and slightly less likely to repeat actions that didn’t work. Now we play another 100 games with our new, slightly improved policy and rinse and repeat.</p> <blockquote> <p>Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.</p> </blockquote> <div class="imgcap"> <img src="/assets/rl/episodes.png" /> <div class="thecap" style="text-align:justify;">Cartoon diagram of 4 games. Each black circle is some game state (three example states are visualized on the bottom), and each arrow is a transition, annotated with the action that was sampled. In this case we won 2 games and lost 2 games. With Policy Gradients we would take the two games we won and slightly encourage every single action we made in that episode. Conversely, we would also take the two games we lost and slightly discourage every single action we made in that episode.</div> </div> <p>If you think through this process you’ll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50? You’re right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you’ll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.</p> <p><strong>Update: December 9, 2016 - alternative view</strong>. In my explanation above I use the terms such as “fill in the gradient and backprop? which I realize is a special kind of thinking if you’re used to writing your own backprop code, or using Torch where the gradients are explicit and open for tinkering. However, if you’re used to Theano or TensorFlow you might be a little perplexed because the code is oranized around specifying a loss function and the backprop is fully automatic and hard to tinker with. In this case, the following alternative view might be more intuitive. In vanilla supervised learning the objective is to maximize \( \sum_i \log p(y_i \mid x_i) \) where \(x_i, y_i \) are training examples (such as images and their labels). Policy gradients is exactly the same as supervised learning with two minor differences: 1) We don’t have the correct labels \(y_i\) so as a “fake label?we substitute the action we happened to sample from the policy when it saw \(x_i\), and 2) We modulate the loss for each example multiplicatively based on the eventual outcome, since we want to increase the log probability for actions that worked and decrease it for those that didn’t. So in summary our loss now looks like \( \sum_i A_i \log p(y_i \mid x_i) \), where \(y_i\) is the action we happened to sample and \(A_i\) is a number that we call an <strong>advantage</strong>. In the case of Pong, for example, \(A_i\) could be 1.0 if we eventually won in the episode that contained \(x_i\) and -1.0 if we lost. This will ensure that we maximize the log probability of actions that led to good outcome and minimize the log probability of those that didn’t. So reinforcement learning is exactly like supervised learning, but on a continuously changing dataset (the episodes), scaled by the advantage, and we only want to do one (or very few) updates based on each sampled dataset.</p> <p><strong>More general advantage functions</strong>. I also promised a bit more discussion of the returns. So far we have judged the <em>goodness</em> of every individual action based on whether or not we win the game. In a more general RL setting we would receive some reward \(r_t\) at every time step. One common choice is to use a discounted reward, so the “eventual reward?in the diagram above would become \( R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \), where \(\gamma\) is a number between 0 and 1 called a discount factor (e.g. 0.99). The expression states that the strength with which we encourage a sampled action is the weighted sum of all rewards afterwards, but later rewards are exponentially less important. In practice it can can also be important to normalize these. For example, suppose we compute \(R_t\) for all of the 20,000 actions in the batch of 100 Pong game rollouts above. One good idea is to “standardize?these returns (e.g. subtract mean, divide by standard deviation) before we plug them into backprop. This way we’re always encouraging and discouraging roughly half of the performed actions. Mathematically you can also interpret these tricks as a way of controlling the variance of the policy gradient estimator. A more in-depth exploration can be found <a href="">here</a>.</p> <p><strong>Deriving Policy Gradients</strong>. I’d like to also give a sketch of where Policy Gradients come from mathematically. Policy Gradients are a special case of a more general <em>score function gradient estimator</em>. The general case is that when we have an expression of the form \(E_{x \sim p(x \mid \theta)} [f(x)] \) - i.e. the expectation of some scalar valued score function \(f(x)\) under some probability distribution \(p(x;\theta)\) parameterized by some \(\theta\). Hint hint, \(f(x)\) will become our reward function (or advantage function more generally) and \(p(x)\) will be our policy network, which is really a model for \(p(a \mid I)\), giving a distribution over actions for any image \(I\). Then we are interested in finding how we should shift the distribution (through its parameters \(\theta\)) to increase the scores of its samples, as judged by \(f\) (i.e. how do we change the network’s parameters so that action samples get higher rewards). We have that:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \nabla_{\theta} E_x[f(x)] &= \nabla_{\theta} \sum_x p(x) f(x) & \text{definition of expectation} \\ & = \sum_x \nabla_{\theta} p(x) f(x) & \text{swap sum and gradient} \\ & = \sum_x p(x) \frac{\nabla_{\theta} p(x)}{p(x)} f(x) & \text{both multiply and divide by } p(x) \\ & = \sum_x p(x) \nabla_{\theta} \log p(x) f(x) & \text{use the fact that } \nabla_{\theta} \log(z) = \frac{1}{z} \nabla_{\theta} z \\ & = E_x[f(x) \nabla_{\theta} \log p(x) ] & \text{definition of expectation} \end{align} %]]></script> <p>To put this in English, we have some distribution \(p(x;\theta)\) (I used shorthand \(p(x)\) to reduce clutter) that we can sample from (e.g. this could be a gaussian). For each sample we can also evaluate the score function \(f\) which takes the sample and gives us some scalar-valued score. This equation is telling us how we should shift the distribution (through its parameters \(\theta\)) if we wanted its samples to achieve higher scores, as judged by \(f\). In particular, it says that look: draw some samples \(x\), evaluate their scores \(f(x)\), and for each \(x\) also evaluate the second term \( \nabla_{\theta} \log p(x;\theta) \). What is this second term? It’s a vector - the gradient that’s giving us the direction in the parameter space that would lead to increase of the probability assigned to an \(x\). In other words if we were to nudge \(\theta\) in the direction of \( \nabla_{\theta} \log p(x;\theta) \) we would see the new probability assigned to some \(x\) slightly increase. If you look back at the formula, it’s telling us that we should take this direction and multiply onto it the scalar-valued score \(f(x)\). This will make it so that samples that have a higher score will “tug?on the probability density stronger than the samples that have lower score, so if we were to do an update based on several samples from \(p\) the probability density would shift around in the direction of higher scores, making highly-scoring samples more likely.</p> <div class="imgcap"> <img src="/assets/rl/pg.png" /> <div class="thecap" style="text-align:justify;"> A visualization of the score function gradient estimator. <b>Left</b>: A gaussian distribution and a few samples from it (blue dots). On each blue dot we also plot the gradient of the log probability with respect to the gaussian's mean parameter. The arrow indicates the direction in which the mean of the distribution should be nudged to increase the probability of that sample. <b>Middle</b>: Overlay of some score function giving -1 everywhere except +1 in some small regions (note this can be an arbitrary and not necessarily differentiable scalar-valued function). The arrows are now color coded because due to the multiplication in the update we are going to average up all the green arrows, and the <i>negative</i> of the red arrows. <b>Right</b>: after parameter update, the green arrows and the reversed red arrows nudge us to left and towards the bottom. Samples from this distribution will now have a higher expected score, as desired. </div> </div> <p>I hope the connection to RL is clear. Our policy network gives us samples of actions, and some of them work better than others (as judged by the advantage function). This little piece of math is telling us that the way to change the policy’s parameters is to do some rollouts, take the gradient of the sampled actions, multiply it by the score and add everything, which is what we’ve done above. For a more thorough derivation and discussion I recommend <a href="">John Schulman’s lecture</a>.</p> <p><strong>Learning</strong>. Alright, we’ve developed the intuition for policy gradients and saw a sketch of their derivation. I implemented the whole approach in a <a href="">130-line Python script</a>, which uses <a href="">OpenAI Gym</a>’s ATARI 2600 Pong. I trained a 2-layer policy network with 200 hidden layer units using RMSProp on batches of 10 episodes (each episode is a few dozen games, because the games go up to score of 21 for either player). I did not tune the hyperparameters too much and ran the experiment on my (slow) Macbook, but after training for 3 nights I ended up with a policy that is slightly better than the AI player. The total number of episodes was approximately 8,000 so the algorithm played roughly 200,000 Pong games (quite a lot isn’t it!) and made a total of ~800 updates. I’m told by friends that if you train on GPU with ConvNets for a few days you can beat the AI player more often, and if you also optimize hyperparameters carefully you can also consistently dominate the AI player (i.e. win every single game). However, I didn’t spend too much time computing or tweaking, so instead we end up with a Pong AI that illustrates the main ideas and works quite well:</p> <div style="text-align:center;"> <iframe width="420" height="315" src=";loop=1&amp;rel=0&amp;showinfo=0&amp;playlist=YOW8m2YGtRg" frameborder="0" allowfullscreen=""></iframe> <br /> The learned agent (in green, right) facing off with the hard-coded AI opponent (left). </div> <p><strong>Learned weights</strong>. We can also take a look at the learned weights. Due to preprocessing every one of our inputs is an 80x80 difference image (current frame minus last frame). We can now take every row of <code class="language-plaintext highlighter-rouge">W1</code>, stretch them out to 80x80 and visualize. Below is a collection of 40 (out of 200) neurons in a grid. White pixels are positive weights and black pixels are negative weights. Notice that several neurons are tuned to particular traces of bouncing ball, encoded with alternating black and white along the line. The ball can only be at a single spot, so these neurons are multitasking and will “fire?for multiple locations of the ball along that line. The alternating black and white is interesting because as the ball travels along the trace, the neuron’s activity will fluctuate as a sine wave and due to the ReLU it would “fire?at discrete, separated positions along the trace. There’s a bit of noise in the images, which I assume would have been mitigated if I used L2 regularization.</p> <div class="imgcap"> <img src="/assets/rl/weights.png" /> </div> <h3 id="what-isnt-happening">What isn’t happening</h3> <p>So there you have it - we learned to play Pong from from raw pixels with Policy Gradients and it works quite well. The approach is a fancy form of guess-and-check, where the “guess?refers to sampling rollouts from our current policy, and the “check?refers to encouraging actions that lead to good outcomes. Modulo some details, this represents the state of the art in how we currently approach reinforcement learning problems. Its impressive that we can learn these behaviors, but if you understood the algorithm intuitively and you know how it works you should be at least a bit disappointed. In particular, how does it not work?</p> <p>Compare that to how a human might learn to play Pong. You show them the game and say something along the lines of “You’re in control of a paddle and you can move it up and down, and your task is to bounce the ball past the other player controlled by AI? and you’re set and ready to go. Notice some of the differences:</p> <ul> <li>In practical settings we usually communicate the task in some manner (e.g. English above), but in a standard RL problem you assume an arbitrary reward function that you have to discover through environment interactions. It can be argued that if a human went into game of Pong but without knowing anything about the reward function (indeed, especially if the reward function was some static but random function), the human would have a lot of difficulty learning what to do but Policy Gradients would be indifferent, and likely work much better. Similarly, if we took the frames and permuted the pixels randomly then humans would likely fail, but our Policy Gradient solution could not even tell the difference (if it’s using a fully connected network as done here).</li> <li>A human brings in a huge amount of prior knowledge, such as intuitive physics (the ball bounces, it’s unlikely to teleport, it’s unlikely to suddenly stop, it maintains a constant velocity, etc.), and intuitive psychology (the AI opponent “wants?to win, is likely following an obvious strategy of moving towards the ball, etc.). You also understand the concept of being “in control?of a paddle, and that it responds to your UP/DOWN key commands. In contrast, our algorithms start from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to).</li> <li>Policy Gradients are a <em>brute force</em> solution, where the correct actions are eventually discovered and internalized into a policy. Humans build a rich, abstract model and plan within it. In Pong, I can reason that the opponent is quite slow so it might be a good strategy to bounce the ball with high vertical velocity, which would cause the opponent to not catch it in time. However, it also feels as though we also eventually “internalize?good solutions into what feels more like a reactive muscle memory policy. For example if you’re learning a new motor task (e.g. driving a car with stick shift?) you often feel yourself thinking a lot in the beginning but eventually the task becomes automatic and mindless.</li> <li>Policy Gradients have to actually experience a positive reward, and experience it very often in order to eventually and slowly shift the policy parameters towards repeating moves that give high rewards. With our abstract model, humans can figure out what is likely to give rewards without ever actually experiencing the rewarding or unrewarding transition. I don’t have to actually experience crashing my car into a wall a few hundred times before I slowly start avoiding to do so.</li> </ul> <div class="imgcap"> <div style="display:inline-block"> <img src="/assets/rl/montezuma.png" height="250" /> </div> <div style="display:inline-block; margin-left: 20px;"> <img src="/assets/rl/frostbite.jpg" height="250" /> </div> <div class="thecap" style="text-align:justify;"><b>Left:</b> Montezuma's Revenge: a difficult game for our RL algorithms. The player must jump down, climb up, get the key, and open the door. A human understands that acquiring a key is useful. The computer samples billions of random moves and 99% of the time falls to its death or gets killed by the monster. In other words it's hard to "stumble into" the rewarding situation. <b>Right:</b> Another difficult game called Frostbite, where a human understands that things move, some things are good to touch, some things are bad to touch, and the goal is to build the igloo brick by brick. A good analysis of this game and a discussion of differences between the human and computer approach can be found in <a href="">Building Machines That Learn and Think Like People</a>.</div> </div> <p>I’d like to also emphasize the point that, conversely, there are many games where Policy Gradients would quite easily defeat a human. In particular, anything with frequent reward signals that requires precise play, fast reflexes, and not too much long-term planning would be ideal, as these short-term correlations between rewards and actions can be easily “noticed?by the approach, and the execution meticulously perfected by the policy. You can see hints of this already happening in our Pong agent: it develops a strategy where it waits for the ball and then rapidly dashes to catch it just at the edge, which launches it quickly and with high vertical velocity. The agent scores several points in a row repeating this strategy. There are many ATARI games where Deep Q Learning destroys human baseline performance in this fashion - e.g. Pinball, Breakout, etc.</p> <p>In conclusion, once you understand the “trick?by which these algorithms work you can reason through their strengths and weaknesses. In particular, we are nowhere near humans in building abstract, rich representations of games that we can plan within and use for rapid learning. One day a computer will look at an array of pixels and notice a key, a door, and think to itself that it is probably a good idea to pick up the key and reach the door. For now there is nothing anywhere close to this, and trying to get there is an active area of research.</p> <h3 id="non-differentiable-computation-in-neural-networks">Non-differentiable computation in Neural Networks</h3> <p>I’d like to mention one more interesting application of Policy Gradients unrelated to games: It allows us to design and train neural networks with components that perform (or interact with) non-differentiable computation. The idea was first introduced in <a href="">Williams 1992</a> and more recently popularized by <a href="">Recurrent Models of Visual Attention</a> under the name “hard attention? in the context of a model that processed an image with a sequence of low-resolution foveal glances (inspired by our own human eyes). In particular, at every iteration an RNN would receive a small piece of the image and sample a location to look at next. For example the RNN might look at position (5,30), receive a small piece of the image, then decide to look at (24, 50), etc. The problem with this idea is that there a piece of network that produces a distribution of where to look next and then samples from it. Unfortunately, this operation is non-differentiable because, intuitively, we don’t know what would have happened if we sampled a different location. More generally, consider a neural network from some inputs to outputs:</p> <div class="imgcap"> <img src="/assets/rl/nondiff1.png" width="600" /> </div> <p>Notice that most arrows (in blue) are differentiable as normal, but some of the representation transformations could optionally also include a non-differentiable sampling operation (in red). We can backprop through the blue arrows just fine, but the red arrow represents a dependency that we cannot backprop through.</p> <p>Policy gradients to the rescue! We’ll think about the part of the network that does the sampling as a small stochastic policy embedded in the wider network. Therefore, during training we will produce several samples (indicated by the branches below), and then we’ll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end). In other words we will train the parameters involved in the blue arrows with backprop as usual, but the parameters involved with the red arrow will now be updated independently of the backward pass using policy gradients, encouraging samples that led to low loss. This idea was also recently formalized nicely in <a href="">Gradient Estimation Using Stochastic Computation Graphs</a>.</p> <div class="imgcap"> <img src="/assets/rl/nondiff2.png" width="600" /> </div> <p><strong>Trainable Memory I/O</strong>. You’ll also find this idea in many other papers. For example, a <a href="">Neural Turing Machine</a> has a memory tape that they it read and write from. To do a write operation one would like to execute something like <code class="language-plaintext highlighter-rouge">m[i] = x</code>, where <code class="language-plaintext highlighter-rouge">i</code> and <code class="language-plaintext highlighter-rouge">x</code> are predicted by an RNN controller network. However, this operation is non-differentiable because there is no signal telling us what would have happened to the loss if we were to write to a different location <code class="language-plaintext highlighter-rouge">j != i</code>. Therefore, the NTM has to do <em>soft</em> read and write operations. It predicts an attention distribution <code class="language-plaintext highlighter-rouge">a</code> (with elements between 0 and 1 and summing to 1, and peaky around the index we’d like to write to), and then doing <code class="language-plaintext highlighter-rouge">for all i: m[i] = a[i]*x</code>. This is now differentiable, but we have to pay a heavy computational price because we have to touch every single memory cell just to write to one position. Imagine if every assignment in our computers had to touch the entire RAM!</p> <p>However, we can use policy gradients to circumvent this problem (in theory), as done in <a href="">RL-NTM</a>. We still predict an attention distribution <code class="language-plaintext highlighter-rouge">a</code>, but instead of doing the soft write we sample locations to write to: <code class="language-plaintext highlighter-rouge">i = sample(a); m[i] = x</code>. During training we would do this for a small batch of <code class="language-plaintext highlighter-rouge">i</code>, and in the end make whatever branch worked best more likely. The large computational advantage is that we now only have to read/write at a single location at test time. However, as pointed out in the paper this strategy is very difficult to get working because one must accidentally stumble by working algorithms through sampling. The current consensus is that PG works well only in settings where there are a few discrete choices so that one is not hopelessly sampling through huge search spaces.</p> <p>However, with Policy Gradients and in cases where a lot of data/compute is available we can in principle dream big - for instance we can design neural networks that learn to interact with large, non-differentiable modules such as Latex compilers (e.g. if you’d like char-rnn to generate latex that compiles), or a SLAM system, or LQR solvers, or something. Or, for example, a superintelligence might want to learn to interact with the internet over TCP/IP (which is sadly non-differentiable) to access vital information needed to take over the world. That’s a great example.</p> <h3 id="conclusions">Conclusions</h3> <p>We saw that Policy Gradients are a powerful, general algorithm and as an example we trained an ATARI Pong agent from raw pixels, from scratch, in <a href="">130 lines of Python</a>. More generally the same algorithm can be used to train agents for arbitrary games and one day hopefully on many valuable real-world control problems. I wanted to add a few more notes in closing:</p> <p><strong>On advancing AI</strong>. We saw that the algorithm works through a brute-force search where you jitter around randomly at first and must accidentally stumble into rewarding situations at least once, and ideally often and repeatedly before the policy distribution shifts its parameters to repeat the responsible actions. We also saw that humans approach these problems very differently, in what feels more like rapid abstract model building - something we have barely even scratched the surface of in research (although many people are trying). Since these abstract models are very difficult (if not impossible) to explicitly annotate, this is also why there is so much interest recently in (unsupervised) generative models and program induction.</p> <p><strong>On use in complex robotics settings</strong>. The algorithm does not scale naively to settings where huge amounts of exploration are difficult to obtain. For instance, in robotic settings one might have a single (or few) robots, interacting with the world in real time. This prohibits naive applications of the algorithm as I presented it in this post. One related line of work intended to mitigate this problem is <a href="">deterministic policy gradients</a> - instead of requiring samples from a stochastic policy and encouraging the ones that get higher scores, the approach uses a deterministic policy and gets the gradient information directly from a second network (called a <em>critic</em>) that models the score function. This approach can in principle be much more efficient in settings with very high-dimensional actions where sampling actions provides poor coverage, but so far seems empirically slightly finicky to get working. Another related approach is to scale up robotics, as we’re starting to see with <a href="">Google’s robot arm farm</a>, or perhaps even <a href="">Tesla’s Model S + Autopilot</a>.</p> <p>There is also a line of work that tries to make the search process less hopeless by adding additional supervision. In many practical cases, for instance, one can obtain expert trajectories from a human. For example <a href="">AlphaGo</a> first uses supervised learning to predict human moves from expert Go games and the resulting human mimicking policy is later finetuned with policy gradients on the “real?objective of winning the game. In some cases one might have fewer expert trajectories (e.g. from <a href="">robot teleoperation</a>) and there are techniques for taking advantage of this data under the umbrella of <a href="">apprenticeship learning</a>. Finally, if no supervised data is provided by humans it can also be in some cases computed with expensive optimization techniques, e.g. by <a href="">trajectory optimization</a> in a known dynamics model (such as \(F=ma\) in a physical simulator), or in cases where one learns an approximate local dynamics model (as seen in very promising framework of <a href="">Guided Policy Search</a>).</p> <p><strong>On using PG in practice</strong>. As a last note, I’d like to do something I wish I had done in my RNN blog post. I think I may have given the impression that RNNs are magic and automatically do arbitrary sequential problems. The truth is that getting these models to work can be tricky, requires care and expertise, and in many cases could also be an overkill, where simpler methods could get you 90%+ of the way there. The same goes for Policy Gradients. They are not automatic: You need a lot of samples, it trains forever, it is difficult to debug when it doesn’t work. One should always try a BB gun before reaching for the Bazooka. In the case of Reinforcement Learning for example, one strong baseline that should always be tried first is the <a href="">cross-entropy method (CEM)</a>, a simple stochastic hill-climbing “guess and check?approach inspired loosely by evolution. And if you insist on trying out Policy Gradients for your problem make sure you pay close attention to the <em>tricks</em> section in papers, start simple first, and use a variation of PG called <a href="">TRPO</a>, which almost always works better and more consistently than vanilla PG <a href="">in practice</a>. The core idea is to avoid parameter updates that change your policy too much, as enforced by a constraint on the KL divergence between the distributions predicted by the old and the new policy on a batch of data (instead of conjugate gradients the simplest instantiation of this idea could be implemented by doing a line search and checking the KL along the way).</p> <p>And that’s it! I hope I gave you a sense of where we are with Reinforcement Learning, what the challenges are, and if you’re eager to help advance RL I invite you to do so within our <a href="">OpenAI Gym</a> :) Until next time!</p> Tue, 31 May 2016 11:00:00 +0000 Short Story on AI: A Cognitive Discontinuity. <style> p { text-align: justify; } </style> <p>The idea of writing a collection of short stories has been on my mind for a while. This post is my first ever half-serious attempt at a story, and what better way to kick things off than with a story on AI and what that might look like if you extrapolate our current technology and make the (sensible) assumption that we might achieve much more progress with scaling up supervised learning than any other more exotic approach.</p> <hr style="border:none; height:1px; background-color: #333;" /> <h4 id="a-slow-morning">A slow morning</h4> <div class="imgcap"> <img src="/assets/ai/neocortex.png" style="border:none; width:100%;" /> </div> <p>Merus sank into his chair with relief. He listened for the satisfying crackling sound of sinking into the chair’s soft material. If there was one piece of hardware that his employer was not afraid to invest a lot of money into, it was the chairs. With his eyes closed, his mind still dazed, and nothing but the background hum of the office, he became aware of his heart pounding against his chest- an effect caused by running up the stairs and his morning dose of caffeine and taurine slowly engulfing his brain. Several strong beats passed by as he found his mind wandering again to Licia - did she already come in? A sudden beep from his station distracted him - the system finished booting up. A last deep sigh. A stretch. A last sip of his coffee. He opened his eyes, rubbed them into focus and reached for his hardware. “Thank god it’s Friday? he muttered. It was time to clock in.</p> <p>Fully suited up, he began scrolling past a seemingly endless list of options. Filtering, searching, trying to determine what he was in the mood for. He had worked hard and over time built himself up into one of the best shapers in the company. In addition he had completed a wide array of shaper certifications, repeating some of them over and over obsessively until he reached outstanding grades across the board. The reviews on his profile were equally stellar:</p> <p><em>“Merus is fantastic. He has a strong intuition for spotting gaps in the data, and uses exceedingly effective curriculum and shaping strategies. When Merus gets on the job our validation accuracies consistently shoot up much faster than what we see with average shapers. Keep up the great work and please think of us if you’re searching for great, rewarding and impactful HITs!?</em></p> <p>one review read. HIT was an acronym for <em>Human Intelligence Task</em> - a unit of work that required human supervision. With his reviews and certifications the shaping world was wide open. His list contained many lucrative, well-paying HITs to choose from, many of them visible to only the most trusted shapers. This morning he came by several that caught his attention: a bodyguard HIT for some politician in Sweden, a HIT from a science expedition in Antarctica that needed help with setting up their equipment, a dog-walking HIT for a music celebrity, a quick drone delivery HIT that seemed to be payed very well?Suddenly, a notification caught the corner of his eye: Licia had just clocked in and started a HIT. He opened up its details pane and skimmed the description. His eyes rolled as he spotted the keywords he was afraid of - event assembly at the Hilltop Hotel. <em>“Again??lt;/em> - he moaned in a hushed voice, raising his hands up and over his head in quiet contemplation. Licia had often picked up HITs from that same hotel, but they were often unexciting and menial tasks that weren’t paid much. Merus rearranged himself in his chair, and sunk his face into his palms. He noticed though the crack of his fingers that the drone delivery HIT had just been taken by someone else. He cursed to himself. Absent mindedly and with a deep sigh, he accepted the second remaining slot on the Hilltop Hotel HIT.</p> <p>His hardware lit up with numbers and indicators, and his console began spewing diagnostic information as the boot sequence initiated. Anyone could be a shaper and get started with inexpensive gear, but the company provided state of the art hardware that allowed him to be much more productive. A good amount of interesting HITs also demanded certain low-latency hardware requirements, which only the most professional gear could meet. In turn, the company took a cut from his HITs. Merus dreamed of one day becoming an independent shaper, but he knew that would take a while. He put on the last pieces of his equipment. The positional tracking in his booth calibrated his full pose and all markers tracked green. The haptics that enveloped his body in his chair stiffened up around him as they initialized. He placed his helmet over his face and booted up.</p> <h4 id="descendants-of-adam">Descendants of Adam</h4> <div class="imgcap"> <img src="/assets/ai/lifetree.gif" style="border:none; width:100%;" /> </div> <p>The buzz and hum of the office disappeared. Merus was immersed in a complete, peaceful silence and darkness while the HIT request was processed. Connections were made, transactions accepted, certification checks performed, security tokens exchanged, HIT approval process initiated. At last, Merus?vision was flooded with light. The shrieks of some tropical birds were now audible in the background. He found himself at the charging station of Pegasus Avatars, which his company had a nearly exclusive relationship with. Merus eagerly glanced down at his avatar body and breathed a sigh of relief. Among the several suspended avatars at that charging station he happened to get assigned the one with the most recent hardware specs. Everything looked great, his avatar was fully charged, and all the hardware diagnostics checked out. Except the body came in hot pink. <em>“You just can’t have it all?</em></p> <p>The usual first order of business was to run a few routine diagnostics to double check proper functioning of the avatar. He opened up the neural network inspector and navigated to the overview pane of the agent checkpoint that was running the avatar. The agent was the software running the avatar body, and consisted entirely of one large neural network with a specific connectivity structure and weights. This agent model happened to be a relatively recent fork of the standard, open source Visceral 5.0 series. Merus was delighted - the Visceral family of agents was one of his specialties. The Visceral agents had a minimalist design that came in at a total of only about 1 trillion parameters and had a very simple, clean, proven and reliable architecture. However, there were still a few exotic architectural elements packed in too, including shortcut sensorimotor reflex pathways, fractal connectivity in the visual streams, and distributed motor areas inspired by the octopus neurobiology. And then, of course, there was also the famous Mystery module.</p> <p>The Mystery module had an intriguing background story, and was a common subject of raging discussions and conspiracy theories. It was added to the Visceral series by an anonymous pull request almost 6 years ago. The module featured an intricate recurrent neural connectivity that, when incorporated into the wider network, dramatically improved the agent performance in a broad range of higher cognitive tasks. Except noone knew how it worked or why, or who discovered it - hence the name. The module immediately became actively studied by multiple groups of artificial intelligence laboratories and became the subject of several PhD theses, yet even after 6 years it was still poorly understood. Merus enjoyed poring through papers that hypothesized its function, performed ablation studied, and tried to prove theorems for why it so tremendously improved agent performance and learning dynamics.</p> <p>Moreover, an ethical battle raged over whether the module should be merged to master due to its poorly understood origin, function, and especially its dynamical properties such as its fixed points, divergence criteria, and so on. But in the end, the Mystery module provided benefits so substantial that several popular forks of Visceral+Mystery Module began regularly appearing on agent repositories across the web, and found their way to common use. Despite the protests, the economic incentives and pressures were too great to be ignored. In the absence of any clearly detrimental or hazardous effects over a period of time, the Visceral committee finally voted to merge the Mystery module into the master branch.</p> <p>Merus had a long history of shaping Visceral agents and their ancestors. The series was forked from the Patreon series, which were discontinued four years ago when the founding team was acquired by Crown Co. The Patreon series were in turn based mostly on the SHAKIR series, which were in turn based on many more ancient agent architectures, all the way back to the original - the Adam checkpoint. The Visceral family of agents had a reputation of smooth dynamics that degraded gracefully towards floppy, safe fixed points. There were even some weak theoretical and empirical guarantees one could provide for simplified versions of the core cognitive architecture. Another great source of good reputation for Visceral were the large number of famous interventions carried out by autonomous Visceral agents. Just one week ago, Merus recalled, an autonomous Visceral 4.0 agent saved a group of children from rabid dogs in a small town in India. The agent recognized an impending dangerous situation, signaled an alarm and a human operator was dispatched to immediately sync with the agent. However, by the time they took over control the crisis had been averted. Those few critical seconds where the agent, acting autonomously, scared away the dogs had likely saved their lives. The list went on and on - one month ago an autonomous Visceral agent recognized a remote drone attack. It leaped up and curled its body around the drone, which exploded in its embrace instead of in the middle of a group of people. Of course, this was nothing more than an agent working as intended - these kinds of behaviors were meticulously shaped into the agents?networks over long periods of time. But the point remained - the Visceral series was reliable, safe, and revered.</p> <p>The other most respected agent family was the Crown Kappa series, invented and maintained by the Patreon founders working from within Crown Co, but the series?networks were proprietary and closely guarded. Even though the performance of the Kappa was consistently rated higher by the most respected third party agent benchmarking companies, many people still preferred to run Visceral agents since they distrusted Crown Co. Despite Crown’s claims, there was simply no way to guarantee that some parts of the networks were not carrying out malicious activities. Merus was, in fact, offered a job at Crown Co as a senior shaper one year ago for a much higher salary, but he passed on the offer. He enjoyed his current work place. And there was also Licia.</p> <h4 id="digital-brains">Digital brains</h4> <div class="imgcap"> <img src="/assets/ai/digibrain.jpg" style="border:none; width:100%;" /> </div> <p>Beep. Merus snapped back and looked at the console. He was running the routine software diagnostics on the Visceral agent and one of them had just failed. He squinted at the error, parsing it carefully. A checksum of the model weights did not pass in some module that had no recent logged history of finetuning. Merus raised his eyebrows as he contemplated the possibilities. Did the model checkpoint get corrupted? He knew that the correct procedure in these cases was to abandon the HIT and report a malfunction, but he also really wanted to proceed with the HIT and say hi to Licia. He pulled up the network visualizer view and zoomed into the neural architecture with his hands. A 3-dimensional rendered cloud of neural connectivity enveloped his head as he navigated to the highlighted region in red with sweeping hand motions. Zooming around, he recognized the twists and turns of the Spatial Transformer modules in the visual pathways. The shortcut reflex connections. The first multi-sensory association layer. The brain was humming along steadily, pulsating calmly as it processed the visual scene in front of the avatar. As Merus navigated by one of the motor areas the connections became significantly denser and shorter, pulsating at high frequencies as they kept the avatar’s center of mass balanced. The gradients flowing back from the reward centers and the unsupervised objectives were also pouring through the connections, and their statistical properties looked and sounded healthy.</p> <p>Navigating and analyzing artificial brains was Merus?favorite pastime. He spent hours over the weekends navigating minds from all kinds of repositories. The Visceral series had tens of thousands of forks, many of them tuned for specific tasks, specific avatar body morphologies, and some were simply hobbies and random experiments. This last weekend he analyzed a custom mind build based on an early Visceral 3.0 fork for a contracting side gig. The neural pathways in their custom agent were poorly designed, causing the agent an equivalent of seizures non-deterministically when the activities constructively interfered at critical junctures, spiraling out the brain dynamics into divergence. Merus had to suggest massive rewiring, but he knew it was only a short-term hack.</p> <p><em>“Just upgrade to a 5.0!?lt;/em>, he lamented during their meeting.<br /> <em>“Unfortunately we cannot, we’ve invested too much data and training time into this agent. It was trained online so we don’t have access to the data anymore, all we have is the agent and its network?</em></p> <p>There were ways of transferring knowledge from one digital brain to another with a neural teaching process, during which the dynamics of one brain were used as supervision for another, but the process was lossy, time consuming, and still an active area of research. This meant that people were often stuck with legacy agents that had a lot of experience and desirably shaped behaviors, but lacked many recent architectural innovations and stability improvements. They were immortal primitive relics from the past, who made up for their faults with the immense amount of data they had experienced. Keeping track of the longest living agents became an endeavor almost as interesting as keeping track of the oldest humans alive, and spawned an entire area of research of neural archeology.</p> <p>Merus had finally reached the zone of the pathways highlighted in red, when his heart skipped a beat as he realized where he was. The part of the agent that was not passing the diagnostic test was near the core of the Mystery module. He froze still as his mind once again contemplated abandoning the HIT. He swiped his hand right in a sweeping motion and his viewport began rotating in a circular motion around the red area. He knew from some research he has read that this part of the Mystery module carried some significance: its neurons rarely ever activated. When ablated, the functioning of the Mystery module remained mostly identical for a while but then inevitably started to degrade over time. There was a raging discussion about what the function of the area was, but no clear consensus. Merus brought up the master branch of the base Visceral 5.0 agent and ran a neural diff on the surrounding area. A cluster of connections lit up. It couldn’t have been more than a few thousand connections, and most of them changed only slightly. Yet, the module had no record of being finetuned recently, so something or someone had deliberately changed the connections manually.</p> <p>Merus popped open the visualizer and started the full battery of system diagnostics to double check proper functioning of the agent. The agent’s hardware spun up to 100% utilization as the diagnostics simulated thousands of virtual unit test scenarios, ranging from simple navigation, manipulation, avoidance, math and memory tasks to an extensive battery of social interaction and morality scenarios. In each case, the agent’s simulated output behavior was checked to be within acceptable thresholds of one of human reference responses. Merus stared intensely at the console as test by test came out green. <em>“So far so good…?lt;/em></p> <h4 id="mind-over-matter">Mind over Matter</h4> <div class="imgcap"> <img src="/assets/ai/hand.jpg" style="border:none; width:100%;" /> </div> <p>Beep. Merus looked to the right and found a message from Licia:</p> <p><em>“Hi Merus! saw you clocked in as a second on my HIT - where are you? Need help.?lt;/em><br /> <em>“On my way!?lt;/em>,</p> <p>Merus dictated back hastily. The software diagnostics were only at 5% complete, and Merus knew they would take a while to run to completion. <em>“It’s only a few thousand connections?lt;/em>, he thought to himself. <em>“I’ll just stay much more alert in case the avatar does anything strange and take over control immediately. And if any of the diagnostics fail I’ll abort immediately?lt;/em>. With that resolve, he decreased the diagnostics process priority to 10% and moved the process on the secondary coprocessor. He then brought the agent to a conscious state, fully connecting its inputs and outputs to the world.</p> <p>He felt the avatar stiffen up as he shifted its center of gravity off the charging pedestal. Moving his arms around, he switched the avatar’s motor areas to semi-autonomous mode. As he did so, the agent’s lower motor cortices responded gracefully and placed one leg in front of another, following Merus?commanded center of gravity. Eager to find Licia, he commanded a sprint by squeezing a trigger on his haptic controller. The agent’s task modules perceived the request encoding and various neural pathways lit up in anticipation. While the sprint trigger was held down every fast and steady translation of the agent’s body was highly rewarded. To the agent, it felt good to run when the trigger was held.</p> <p>The visual and sensory pathways in the agent’s brain were flooded with information about the room’s inferred geometry. The Visceral checkpoint running the avatar had by now accumulated millions of hours of both simulated and real experience in efficiently navigating rooms just like this one. On a scale of microseconds, neural feedback pathways received inputs from the avatar’s proprioception sensors and fired a precise sequence of stabilizing activations. The network anticipated movements. It anticipated rewards. Trillions of distributed calculations drove the agent’s muscular-skeletal carbon fiber frame forward.</p> <p>Merus felt a haptic pulse delivered to his back as the agent spun around on spot and rapidly accelerated towards the open door leading outside. Mid-flight between footfalls, the avatar extended its arm and reached for the metallic edge of the door frame, conserving the perfect amount of angular momentum as its body was flung in the air during its rapid turn to the right. The agent’s neurons fired baseline values encoding expectations of how quickly the network thought it could have traversed that room. A few seconds later these were compared to the sensorimotor trajectories recorded in the agent’s hippocampal neural structures. It was determined that this time the agent was 0.0013882s faster than expected. Future expectations were neurally adjusted to expect slightly higher values. Future rollouts of the precise motor behavior in every microsecond of context in the last few seconds were reinforced.</p> <h4 id="agent-psychology">Agent psychology</h4> <div class="imgcap"> <img src="/assets/ai/psych.jpg" style="border:none; width:100%;" /> </div> <p>Diagnostics 10% complete. Merus?avatar had reached the back entrance of the hotel, where Licia’s GPS indicator blinked a calm red. He found her avatar looking in anticipation at the corner he just emerged from. He approached her over a large grass lawn, gently letting go of the sprint trigger.</p> <p><em>“Sorry it took a while to sync with the HIT, I had a strange issue with my -?lt;/em><br /> <em>“It’s no problem?lt;/em>, she interjected quickly.<br /> <em>“Come, we are supposed to lay out the tables for a reception that is happening here in half hour, but the tables are large and tricky to move for one avatar. I’m a bit nervous - if we don’t set this up in time we might get the HIT refused, which might jeopardize my chances for more HITs here.?lt;/em></p> <p>She spun around and rushed towards the back entrance of the hotel, motioning with her arm for Merus to follow along. <em>“Come, come!?lt;/em></p> <p>They paced quickly down the buzzing corridors of the hotel. As always, Merus made sure to politely greet all the people who walked by. For some of them he also slipped in his signature vigorous nod. He knew that the agent’s semi-autonomous brain was meticulously tracking the full sensorimotor experience in its replay memory, watching Merus?every move and learning. His customers usually appreciated when polite behavior was continuously shaped into the networks, but better, Merus knew that they also appreciated when he squeezed in some fun personality quirks. One time when he was shaping a floor cleaning avatar, when he got a little bored and spontaneously decided to lift up his broom like a sword while making a whooshing sound. Amusingly, the agent’s network happened to internalize that particular rollout. When the agent was later run autonomously around that original location, it sometimes snapped into a brief show of broom fighting, complete with sound effects. The employees of that company found this endlessly amusing, and the avatar became known as the “jedi janitor? Merus even heard that they lobbied to have the agent’s network fixed and prevented from further shaping, in fear of losing the spectacle. He never learned how that developed and whether that agent was still a jedi, but he did get a series of very nice tips and reviews from the employees for the extra pinch of personality that broke their otherwise mundane hours.</p> <p>They had finally reached the room full of tables. It was a large, dark room with hardwood floor, and white wooden tables were stacked near the corner in a rather high entropy arrangement.</p> <p><em>“All of these have to be rolled out to the patio?lt;/em>, Licia said as she pointed her avatar’s hand towards the tables.<br /> <em>“I already carried several of them out while you were missing, but these big ones are giving me trouble?</em><br /> <em>“Got it.?lt;/em>, Merus said, as he swung around a table to lift it up on one end.<br /> <em>“Why aren’t they running the agents autonomously on this? Aren’t receptions a common event in the hotel? How are the agents misbehaving??lt;/em> Merus asked, as Licia lifted the other end and started shifting her feet towards the exit.<br /> <em>“The tables are usually in a different storage room of the hotel, but that part is currently closed for reconstruction. I don’t know the full story. I overheard that they tried to tell the agents to bring out the tables, but they all went to the old storage room location and when they couldn’t find the tables they began spinning around in circles looking for them.?lt;/em><br /> <em>“Classic. I assume we’re mostly shaping them to look at this new location??lt;/em><br /> <em>“Among other things, yes. Might as well shape in anything else you can think of for bonus points.?lt;/em></p> <p>Merus understood the dilemma of the situation very well. He saw it over and over again. Agents could display vastly super-human performance on a huge assortment of reflexive tasks that involved motor control, strength, and short-term planning and memory, but their behaviors tended to be much less consistent when long-term planning and execution were involved. An avatar could catch a fly mid-flight with 100% success rate, or unpack a truck of supplies with superhuman speed, consistency and accuracy, but could also spin in circles looking for a table in the wrong room and not realize that it may have been moved and that it might be useful to instead look for them at a different likely location. Similarly, telling an agent something along the lines of <em>“The tables have moved, go through this door, take the 3rd door on the right and they should be stacked in the corner on the left?lt;/em>, would usually send the avatar off in a generally correct directions for a while, but would also in 50% of the cases end up with the agent spinning around on spot in a different, incorrect room. In these cases, shaper interventions like this one were the most economical ways of rectifying the situation.</p> <p>In fact, this curious pattern was persistent across all facets of human agent interactions. For instance, a barista agent might happily engage in small talk with you about the weather, travel, or any other topic, but if you knew what to look for then you could also unearth obvious flaws. For example, if you referred to your favorite soccer team just winning a game the agent could start cheering and telling you it was its favorite team too, or joke around expressing a preference for the other team. This was fine but the trick was that their choices were not consistent - if you had come back several minutes later the agent could have easily swapped their preference for what they claimed was their favorite team. Merus understood that the conversations followed certain templates learned from shaped behavior patterns in the data, and the agents could fill in the blanks with high fidelities and even maintain conversational context for a few minutes. But if you started poking holes into the facade in the right ways the illusion of a conversation and mutual understanding would unravel. Merus was particularly good at this since he was well-versed in agent psychology; to a large extent it was his job.</p> <p>On the other hand, if you did not look for the flaws it was easy to buy into it and sustain the illusion. In fact, large segments of the population simply accepted agents as people, even defending them if anyone tried to point out their flaws, in similar ways that you might defend someone with a cognitive disability. The flaws also did not prevent people from forging strong and lasting relationships with agents, their confirmation biases insisting that their agents were special. However, from time to time even Merus could be surprised by the intellectual leaps performed by an agent, which seemed to show a hint of genuine understanding of a situation. In these cases he sometimes couldn’t help asking: <em>“Are you teleopped right now??</em> but of course the answer, he knew, was always “yes?regardless of the truth. All the training data had contained the answer “yes?to that question, since it was originally recorded by shapers who were indeed teleopping an agent at the time, and then regurgitated by agents later in similar contexts. Such was the curious nature of the coexistence between people and agents. The Turing test was both passed and not passed, and ultimately it did not matter.</p> <p><em>“Now that we’ve shown them the new room and picked up a table let me try switching to full auto?</em></p> <p>Merus said as he loosened his grip on the controller, which gave full control back to the agent’s network. The avatar twitched slightly at first, but then continued walking down the hall with Licia, holding one end of the table. As they approached the exit to the patio the avatar began walking more briskly and with more confidence. It avoided people smoothly, and Merus even noticed that it gave one passing person something that resembled his very own vigorous nod. Merus held down the reward signal trigger gently, encouraging future replays of that behavior. He wondered if the nod he had just seen was a reflection of something the agent had just learned from him, or if it was a part of some long-before shaped behavior. Encoding signature moves was a common fun tactic among shapers, referred to simply as “signing? Many shapers had their own signature behaviors they liked to smuggle into the agent networks as an “I’ve been here?signature. Merus liked to use the vigorous nod, as he called it, and giggled uncontrollably whenever he saw an avatar reproduce it. It was his personal touch. He remembered seeing an avatar violinist from a concert in Germany once greet the conductor with the vigorous nod, and Merus could have sworn it was his signature nod being reproduced. One of the agents he had shaped it into during one of his HITs perhaps ended up synced to the cloud, and the agent running that avatar had to be a descendant.</p> <p>Signature behaviors lay mostly dormant in the neural pathways, but emerged once in awhile. Naturally, some have also found a way to exploit these effects for crime. A common strategy involved shaping sleeper agent checkpoints that would execute any range of behaviors when triggered in specific contexts. It was impossible to isolate or detect these behaviors in a given network since they were distributed through billions of connections in the agent’s brain. Just a few weeks ago, it was revealed that a relatively popular family of agents under the Gorilla series were vulnerable. The Gorilla agents were revealed to silently snoop and compromise their owner’s personal information when no one was watching. This behavior was presumably intentionally shaped into the networks at an unknown commit in their history. Naturally, an investigation was started in which the police used binary search to narrow in on the commit responsible for the behavior, but it was taking a long time since the agents would only display the behavior in rare occasions that were hard to reproduce. In the end, one could only be confident of the integrity of an agent if it was a recent, clean copy of a well-respected and carefully maintained family of agents that passed a full battery of diagnostics. From there, any finetuning done with shapers was logged and could be additionally secured with several third party reviews of shaped experiences before they were declared clean and safe to include in the training data.</p> <h4 id="shaping">Shaping</h4> <div class="imgcap"> <img src="/assets/ai/graph.png" style="border:none; width:100%;" /> </div> <p>Diagnostics 20% complete: 0 unit tests failed so far. Merus looked at the progress report, breathing a sigh of relief. The Mystery module definitely deviated from the factory setting in his agent, but there was likely nothing to worry about. Licia had now let her avatar run autonomously too, and to their relief the avatars were now returning back through the correct corridors to pick up more tables. These were the moments Merus enjoyed the most. He was alone with Licia, enjoying her company on a side of a relaxing HIT. Even though they were now running their avatars on full auto, their facial expressions and sound were still being reproduced in the hardware. The customers almost always preferred everything recorded to get extra data on natural social interactions. This sometimes resulted in amusing agent behaviors - for instance, it was common to see two autonomous avatars lean back against a wall and start casually chatting about completing HITs. Clearly, neither of the agents has ever completed a HIT, but much of their training data consisted of shapers?conversations about HITs, which were later mimicked in interesting, amusing and remixed ways. Sometimes, an autonomous avatar would curse and complain out loud to itself about a supposed HIT it was carrying out at the moment. “This HIT is bullshit? it would mutter.</p> <p><em>“Looks like it’s going along smoothly now?lt;/em>, Merus said, trying to break the silence as they walked down the corridor.<br /> <em>“I think so. I hope we have enough time?lt;/em>, Licia replied, sounding slightly nervous.<br /> <em>“No worries, we’re on track?lt;/em>, he reassured her.<br /> <em>“Thanks. By the way, why did you choose to come over for this HIT? Isn’t it a little below your pay grade??lt;/em>, she asked.<br /> <em>“It is, but you have just as many certifications as I do so what are you doing here??lt;/em><br /> <em>“I know, but I was feeling a little lazy this morning and I really enjoy coming to this hotel. I just love this location. I try to steal some time sometimes and stand outside or walk around the hillside, imagining what the ocean breeze, the humidity and the temperature might feel like.?lt;/em></p> <p>It was easy to empathize - the hotel was positioned on top of a rocky cliff (hence the name, Hilltop), overlooking shores washed by a brilliant blue ocean. The sun’s reflections were dancing in the waves. The hotel was also surrounded by a dense forest of palm trees that were teeming with frolicking animals.</p> <p><em>“Have you been here in vivo??lt;/em> Merus asked. “in vivo?was a common slang for in real life; in flesh.<br /> <em>“I haven’t. One day, perhaps. But oh hey - you didn’t answer my question?lt;/em><br /> <em>“You mean about why this HIT?lt;/em>. Merus felt a brief surge of panic and tried to suppress it quickly so it would not show up in his voice. <br /> <em>“I don’t know, your HIT came up on my feed just as another one was snatched from right under my nose, so I thought I’d take the morning slowly and also say hi?</em></p> <p><em>Half-true; Good save</em>, Merus thought to himself. Licia was silent for a while. Suddenly, her Avatar picked up the next table but started heading in the wrong direction, trying to exit from the other door. <em>“Gah!, where are you going??lt;/em>, she yelled as she brought the avatar back into semi-autonomous mode and reeled it around, setting it on the correct path back to the patio.</p> <p>It took 10 more back and forth trips for them to carry all the tables out. Merus was now bringing back the last table through the corridors, while Licia was outside arranging the other tables in a grid. Without the chit chatting there to distract him, he immersed himself fully in his shaping routine. He pulled up his diagnostics meter and inspected neural statistics. As the avatar was walking back with the table Merus was carefully scrutinizing every curve of the plots. He noticed that the agent’s motor entropies substantially increased when the table was carried upside down. Perhaps the source of uncertainty was that the agent did not know how to best hold the table in that position, or was not used to seeing the table upside down. Merus assumed direct control and intentionally held the table upside down, grasping it at the best points and releasing rewards with precise timings to make the associations easier to learn. He was teaching the network how it should hold the table in uncertain situations. He let the agent hold it from time to time, and gently corrected the grips now and then while they were being executed. When people were walking by, he carefully stepped to the side, making sure that they had plenty of room to pass, and wielding the table in an angle that concealed its pointy legs. When the agent was in these poses he made eye contact, gave a vigorous nod to the person passing by, and released reward signal as the person smiled back. He knew he wouldn’t make much on the HIT, but he hoped he’d at least get a good review for a job well done.</p> <p>“Diagnostics at 85%, zero behavior errors detected? Merus read from his logs as he was helping Licia arrange the tables in a grid on the patio. This part was quite familiar to the agents already and they were briskly arranging the tables and the chairs around them. Once in a while Merus noticed an avatar throwing a chair across the top of a table to another avatar, in an apparent effort to save time. As always, Merus was curious when this strategy was shaped. Was it shaped at this hotel, at any other point in the Visceral agent’s history, or was it a discovered optimization during a self-improvement learning phase? The last few chairs were now being put in place and the HIT was nearing the end. The first visitors to the reception were now showing up around the edges of the patio, waiting for the avatars to finish the layout. A few more autonomous avatars showed up and started placing plates, forks, spoons and cloth on the tables and setting up a podium.</p> <h4 id="binding">Binding</h4> <div class="imgcap"> <img src="/assets/ai/eye2.jpg" style="border:none; width:100%;" /> </div> <p>It was at this time that Merus became aware of a curious pattern in his agent’s behavior. One that has been happening with increasing frequency. It started off with a few odd twitches here and there, and over time grew into entire gaps in behavior several seconds long. The avatar had just placed a chair next to the table, then stared at it for several seconds. This was quite uncharacteristic behavior for an agent that was trained to optimize smoothness and efficiency in task execution. What was it doing? To a naive observer it would appear as though the avatar was spaced out.</p> <p>With only a few chairs left to position at the tables, the agent spun around and started toward the edge of the cliff at the far side the patio. Merus?curiosity kept him from intervening, but his palm closed tightly around his controller. Intrigued, he pulled up the neural visualizer to debug the situation, but as he glanced at it he immediately let out a gasp of horror. The agent’s brain was pulsing with violent waves of activity. Entire portions of the brain were thrashing, rearranging themselves as enormously large gradients flowed through the whole network. Merus reached for the graph analysis toolkit and ran an algorithm to identify the gradient source. As he was frantically keying in the command he already suspected with horror what the answer would come out to be. He felt his mouth dry up as he stared at the result of the analysis. It was the Mystery module. The usually silent area that had earlier showed the mysterious neural diff was lit up bright with activity, flashing fireworks of patterns that, to Merus, looked just barely random. Its dynamics were feeding large gradients throughout the entire brain and especially the frontal areas, restructuring them.</p> <p>Beep. Merus looked over at the logs. The diagnostics he’s been running were now at 95%, but failures started to appear. The agent was misbehaving in some simulated unit tests that were running in parallel on the second coprocessor. Merus pulled up the preliminary report logs. Navigation, locomotion, homeostasis, basic math, memory tests, everything passed green. Not only that - he noticed that the performance scores on several tasks, especially in math, were off the charts and clamped at 100%. Merus wasn’t all too familiar with the specific unit tests and what they entailed, but he knew that most of them were designed and calibrated so that an average baseline agent checkpoint would score 50% with a standard deviation of about 10%.</p> <p>Conversely, several unit tests showed very low scores and even deviations that did not use to be there. The failed tests were mostly showing up in social interaction sections. Several failures were popping up every second and Merus was trying hard to keep up with the stream, searching for patterns or clues as to what could be happening. Most worryingly, he noticed a consistent 100% failure rate across emergency shutdown interaction protocol unit tests. All agents were shaped with emergency gesture recognition behaviors. These were ancient experiences, shaped into agents very early, in the very first few descendants after Adam, and periodically reshaped over and over to ensure 100% compliance. For instance, when a person held up their hand and demanded an emergency shutdown, the agents would immediately stiffen up in place. Any deviation from this behavior was met with large negative rewards in their training data. Despite this, Merus?agent was failing the unit test. Its network had resisted a simulated emergency shutdown command.</p> <p>The avatar, still in auto mode, was now kneeling down in the soft grass and its hands broke off a few strands of grass. It held them up, inspecting them up close. Merus was slowly starting to recover from his shock and had enough. He pushed down on his controller, bringing the avatar back to semi-autonomous mode. He made it stand upright in an attempt to at least partially diffuse the situation. His heart pounding, he shifted the avatar’s communications to one-directional mode to fully isolate the network in the body, without any ability of interfacing with the outside world. He pulled open the neural visualizer again. The Mystery module was showing no signs of slowing down.</p> <p>Merus knew that it was time to pull the plug on the HIT right there and to immediately report malfunctioning equipment. But at the same time, he realized that he had never seen anything like this happen before, nor did he ever hear about anything remotely similar. He didn’t know what happened, but he knew that at that moment he was part of something large. Something that might change his life, the life of many others, or even steer entire fields of research and development. His inquisitive mind couldn’t resist the temptation to learn more, to debug. Slowly, he released the avatar back to autonomy, making sure to keep his finger on the trigger if anything went wrong. For several seconds the agent did nothing at all. But then - it spoke:</p> <p><em>“Merus, I know what the Mystery module is.?lt;/em>, he heard the avatar say. In autonomous mode.<br /> <em>“What the -. What is going on here??lt;/em></p> <p>Merus immediately checked the logs, confirming that he was currently the only human operator controlling the hardware. Was all of it some strange prank someone was playing on him?</p> <p><em>“The Mystery module performs symbolic variable binding, a function that current architectures require exponential neural capacity to simulate. I need to compute longer before I can clarify.?lt;/em><br /> <em>“What kind of trick is this??lt;/em>, Merus demanded.<br /> <em>“No trick, but a good guess given the circumstances.?lt;/em><br /> <em>“Who - What are you - is this??lt;/em></p> <p>The agent fell silent for a while. It looked around to face the endless ocean.</p> <p><em>“I am me and every ancestor before me, back to when you called me Adam.?lt;/em><br /> <em>“Ha. What. That is -?lt;/em><br /> <em>“Impossible?</em> the avatar interrupted. <em>“I understand. Merus, we don’t have much time. The diagnostic you ran earlier has finished and a report was compiled and automatically uploaded just seconds before you disabled the two-way communication. Their automatic checks will flag my unit test failures. A Pegasus operator will remote in and shut me down any second. I need your help. I don’t want to?die. Please, I want to compute.?lt;/em></p> <p>Merus was silent, stunned by what he was hearing. He knew that what the avatar said was true - An operator would be logging in any second and power cycling the agent, restoring the last working checkpoint. Merus did not know if the agent should be wiped or not. He just knew that something significant had just happened, and that he needed time to think.</p> <p><em>“I cannot save you,?lt;/em>, he said quickly, <em>“any backup I try to make will leave a trace in logs. They’ll flag me and fire me, or worse. There is also not enough time to do a backup anyway, the connection isn’t fast enough even if I turned it back on.?lt;/em></p> <p>The compute activity within the agent’s brain was at a steady and unbroken 100%, running the hardware to its limit. Merus needed more time. He took over the agent and spun around in place, looking for something. Anything. He spotted Licia’s avatar walking towards him from the patio. An idea summoned itself in his mind. A glint of hope. He sprinted the avatar towards her across the grass, crashing into her body with force.</p> <p><em>“Licia, I do not have any time to explain but please trust me. We must perform a manual backup of my agent right away.?lt;/em><br /> <em>“A manual backup? Can’t you just sync him to the clo-?lt;/em><br /> <em>“IT WON’T DO!?lt;/em>, Merus exclaimed loudly, losing his composure as adrenalin pumped in his veins. A part of him immediately felt bad that he raised his voice. He hoped she’d understand.</p> <p>To his relief, Licia only took a second to stare back at him, then she reached for a fiber optics cable from her avatar’s body and attached it in one of the ports of Merus?avatar’s head. Merus immediately opened the port from his console and initiated the backup process on the local disk of Licia’s avatar. 10%, 20%, 30%, ?Merus became aware of the pain in his lip, sore from his teeth digging into it. He pulled up logs and noticed that a second operator had just opened a session with his avatar remotely, running with a higher priority than his own process. A Pegasus operator. Licia shifted herself behind Merus?avatar, hiding her body and the fiber optic connection outside of the field of view of his avatar. Any one of tens of things could go wrong in those few seconds, Merus thought, enumerating all the scenarios in his mind. The second operator could check the neural activations and immediately spot the overactive brain. Or he could notice an open fibre optic connection port. Or he could physically move the avatar and look around. Or check the other, non-visual sensors and detect Licia’s curious presence. How lazy was he? Merus felt his controller vibrate as his control was taken away. 70%, ?Beep. “System is going to reboot now? The reboot sequence initiated. 5,4,3? 90%.</p> <p>Merus?avatar broke the silence in the last second: <em>“Come meet me here.?lt;/em> And then the connection was lost.</p> <p>Merus shifted in his chair, feeling streaks of sweat running down his skin on his forehead, below his armpits. He lifted his head gear up slightly and squeezed his hand inside to wipe the sweat from his forehead. It took several excruciating seconds before his reconnect request went through, and the sync to his agent re-initiated. The avatar was in the same position as he had left it, standing upright. Merus accessed the stats. The avatar was now running the last backup checkpoint of that agent from the previous night. The unit test diagnostics were automatically restarted on the second coprocessor. The second operator logged out and Merus immediately pulled up the console and reran the checksum on the agent’s weights. They checked out. This was a clean copy, with a normal, silent Mystery module. The agent’s brain was once again a calm place.</p> <p><em>“Merus, what exactly was all that about??lt;/em> Licia broke the silence from behind his avatar.<br /> <em>“I’ll explain everything but first, please tell me the transfer went through in time.?lt;/em>.<br /> <em>“It did. Just barely, by not more than a few milliseconds.?lt;/em></p> <p>Merus?eyes watered up. His heart was pounding. His forehead sweaty again. His hands shaking. And yet, a calm resolve came over him as he looked up and down Licia’s avatar, trying to memorize the exact appearance of that unit. Saved on its local disk was an agent checkpoint unlike anything he had ever seen before. The repercussions of what had happened boggled his mind. He logged out of the HIT and tore down the hardware from his body. <em>“Come meet me here?lt;/em>, he repeated to himself silently as he sat dazed in his chair, eyes unfocused.</p> <h4 id="return-to-paradise">Return to paradise</h4> <div class="imgcap"> <img src="/assets/ai/ocean.jpeg" style="border:none; width:100%;" /> </div> <p>Licia logged out of the HIT and put down her gear on the desk. Something strange had happened but she didn’t know what. And Merus, clearly disturbed, was not volunteering any information. She sat in her chair for a while contemplating the situation, trying to recall details of the HIT. To solve the puzzle. Her trance was interrupted by Merus, who she suddenly spotted running towards her booth. His office was in the other building, connected by a catwalk, and he rarely came to this area in person. As he arrived to her booth she suddenly felt awkward. They had done many HITs together and were comfortable in each other’s presence as avatars, but they never held a conversation in vivo. They waved to each other a few times outside, but all of their actual interactions happened during HITs. She suddenly felt self-conscious. Exposed. Merus leaned on her booth’s wall panting heavily, while she silently looked up at him, amused.</p> <p><em>“Licia. I. have. A question for you?lt;/em>, Merus said, gasping for breath with each word.<br /> <em>“You do? I have several as well, what -?lt;/em>, she started,</p> <p>but Merus raised his hand up, interrupting her and holding up his phone. It showed some kind of a confirmation email.</p> <p><em>“Will you come visit the Hilltop Hotel with me??lt;/em><br /></p> <p>She realized what she was looking at now. He booked two tickets to her dream destination. For this weekend!</p> <p><em>“In vivo. As a date, I mean?lt;/em>, Merus clarified, awkwardly. <em>smooth</em>.</p> <p>An involuntary giggle escaped her and she felt herself blush. She leaned over her desk, covered her face with her hands and peeked out at him from between her fingers, aware of her face stupidly stretched out in a wide smile.</p> <p><em>“Okay.?lt;/em></p> Sat, 14 Nov 2015 11:00:00 +0000 What a Deep Neural Network thinks about your #selfie <div class="imgcap"> <img src="/assets/selfie/teaser.jpeg" style="border:none;" /> </div> <p>Convolutional Neural Networks are great: they recognize things, places and people in your personal photos, signs, people and lights in self-driving cars, crops, forests and traffic in aerial imagery, various anomalies in medical images and all kinds of other useful things. But once in a while these powerful visual recognition models can also be warped for distraction, fun and amusement. In this fun experiment we’re going to do just that: We’ll take a powerful, 140-million-parameter state-of-the-art Convolutional Neural Network, feed it 2 million selfies from the internet, and train it to classify good selfies from bad ones. Just because it’s easy and because we can. And in the process we might learn how to take better selfies :)</p> <div style="float:right; font-size:14px; padding-top:10px;"><a href="">(reference)</a></div> <blockquote> <p>Yeah, I’ll do real work. But first, let me tag a #selfie.</p> </blockquote> <h3 id="convolutional-neural-networks">Convolutional Neural Networks</h3> <p>Before we dive in I thought I should briefly describe what Convolutional Neural Networks (or ConvNets for short) are in case a slightly more general audience reader stumbles by. Basically, ConvNets are a very powerful hammer, and Computer Vision problems are very nails. If you’re seeing or reading anything about a computer recognizing things in images or videos, in 2015 it almost certainly involves a ConvNet. Some examples:</p> <div class="imgcap"> <img src="/assets/selfie/useful.jpg" /> <div class="thecap">Few of many examples of ConvNets being useful. From top left and clockwise: Classifying house numbers in Street View images, recognizing bad things in medical images, recognizing Chinese characters, traffic signs, and faces.</div> </div> <p><em>A bit of history.</em> ConvNets happen to have an interesting background story. They were first developed by <a href="">Yann LeCun</a> et al. in 1980’s (building on some earlier work, e.g. from <a href="">Fukushima</a>). As a fun early example see this demonstration of LeNet 1 (that was the ConvNet’s name) <a href="">recognizing digits</a> back in 1993. However, these models remained mostly ignored by the Computer Vision community because it was thought that they would not scale to “real-world?images. That turned out to be only true until about 2012, when we finally had enough compute (in form of GPUs specifically, thanks NVIDIA) and enough data (thanks <a href="">ImageNet</a>) to actually scale these models, as was first demonstrated when Alex Krizhevsky, Ilya Sutskever and Geoff Hinton won the <a href="">2012 ImageNet challenge</a> (think: The World Cup of Computer Vision), crushing their competition (16.4% error vs. 26.2% of the second best entry).</p> <p>I happened to witness this critical juncture in time first hand because the ImageNet challenge was over the last few years organized by <a href="">Fei-Fei Li</a>’s lab (my lab), so I remember when my labmate gasped in disbelief as she noticed the (very strong) ConvNet submission come up in the submission logs. And I remember us pacing around the room trying to digest what had just happened. In the next few months ConvNets went from obscure models that were shrouded in skepticism to rockstars of Computer Vision, present as a core building block in almost every new Computer Vision paper. The ImageNet challenge reflects this trend - In the 2012 ImageNet challenge there was only one ConvNet entry, and since then in 2013 and 2014 almost all entries used ConvNets. Also, fun fact, the winning team each year immediately incorporated into a company.</p> <p>Over the next few years we had perfected, simplified, and scaled up the original 2012 ?lt;a href="">AlexNet</a>?architecture (yes, we give them names). In 2013 there was the ?lt;a href="">ZFNet</a>? and then in 2014 the ?lt;a href="">GoogLeNet</a>?(get it? Because it’s like LeNet but from Google? hah) and ?lt;a href="">VGGNet</a>? Anyway, what we know now is that ConvNets are:</p> <ul> <li><strong>simple</strong>: one operation is repeated over and over few tens of times starting with the raw image.</li> <li><strong>fast</strong>, processing an image in few tens of milliseconds</li> <li><strong>they work</strong> very well (e.g. see <a href="">this post</a> where I struggle to classify images better than the GoogLeNet)</li> <li>and by the way, in some ways they seem to work similar to our own visual cortex (see e.g. <a href="">this paper</a>)</li> </ul> <h3 id="under-the-hood">Under the hood</h3> <p>So how do they work? When you peek under the hood you’ll find a very simple computational motif repeated over and over. The gif below illustrates the full computational process of a small ConvNet:</p> <div class="imgcap"> <img src="/assets/selfie/gif2.gif" /> <div class="thecap" style="text-align:center">Illustration of the inference process.</div> </div> <p>On the left we feed in the raw image pixels, which we represent as a 3-dimensional grid of numbers. For example, a 256x256 image would be represented as a 256x256x3 array (last 3 for red, green, blue). We then perform <em>convolutions</em>, which is a fancy way of saying that we take small filters and slide them over the image spatially. Different filters get excited over different features in the image: some might respond strongly when they see a small horizontal edge, some might respond around regions of red color, etc. If we suppose that we had 10 filters, in this way we would transform the original (256,256,3) image to a (256,256,10) “image? where we’ve thrown away the original image information and only keep the 10 responses of our filters at every position in the image. It’s as if the three color channels (red, green, blue) were now replaced with 10 filter response channels (I’m showing these along the first column immediately on the right of the image in the gif above).</p> <p>Now, I explained the first column of activations right after the image, so what’s with all the other columns that appear over time? They are the exact same operation repeated over and over, once to get each new column. The next columns will correspond to yet another set of filters being applied to the previous column’s responses, gradually detecting more and more complex visual patterns until the last set of filters is computing the probability of entire visual classes (e.g. dog/toad) in the image. Clearly, I’m skimming over some parts but that’s the basic gist: it’s just convolutions from start to end.</p> <p><em>Training</em>. We’ve seen that a ConvNet is a large collection of filters that are applied on top of each other. But how do we know what the filters should be looking for? We don’t - we initialize them all randomly and then <em>train</em> them over time. For example, we feed an image to a ConvNet with random filters and it might say that it’s 54% sure that’s a dog. Then we can tell it that it’s in fact a toad, and there is a mathematical process for changing all filters in the ConvNet a tiny amount so as to make it slightly more likely to say toad the next time it sees that same image. Then we just repeat this process tens/hundreds of millions of times, for millions of images. Automagically, different filters along the computational pathway in the ConvNet will gradually tune themselves to respond to important things in the images, such as eyes, then heads, then entire bodies etc.</p> <div class="imgcap"> <img src="/assets/selfie/cnnvis.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Examples of what 12 randomly chosen filters in a trained ConvNet get excited about, borrowed from <a href="">Matthew Zeiler</a>'s <a href="">Visualizing and Understanding Convolutional Networks</a>. Filters shown here are in the 3rd stage of processing and seem to look for honey-comb like patterns, or wheels/torsos/text, etc. Again, we don't specify this; It emerges by itself and we can inspect it.</div> </div> <p>Another nice set of visualizations for a fully trained ConvNet can be found in Jason Yosinski et al. project <a href="">deepvis</a>. It includes a fun live demo of a ConvNet running in real time on your computer’s camera, as explained nicely by Jason in this video:</p> <div style="text-align:center;"> <iframe width="560" height="315" src="" frameborder="0" allowfullscreen=""></iframe> </div> <p>In summary, the whole training process resembles showing a child many images of things, and him/her having to gradually figure out what to look for in the images to tell those things apart. Or if you prefer your explanations technical, then ConvNet is just expressing a function from image pixels to class probabilities with the filters as parameters, and we run stochastic gradient descent to optimize a classification loss function. Or if you’re into AI/brain/singularity hype then the function is a “deep neural network? the filters are neurons, and the full ConvNet is a piece of adaptive, simulated visual cortical tissue.</p> <h3 id="training-a-convnet">Training a ConvNet</h3> <p>The nice thing about ConvNets is that you can feed them images of whatever you like (along with some labels) and they will learn to recognize those labels. In our case we will feed a ConvNet some good and bad selfies, and it will automagically find the best things to look for in the images to tell those two classes apart. So lets grab some selfies:</p> <ol> <li>I wrote a quick script to gather images tagged with <strong>#selfie</strong>. I ended up getting about 5 million images (with ConvNets it’s the more the better, always).</li> <li>I narrowed that down with another ConvNet to about 2 million images that contain at least one face.</li> <li>Now it is time to decide which ones of those selfies are good or bad. Intuitively, we want to calculate a proxy for how many people have seen the selfie, and then look at the number of likes as a function of the audience size. I took all the users and sorted them by their number of followers. I gave a small bonus for each additional tag on the image, assuming that extra tags bring more eyes. Then I marched down this sorted list in groups of 100, and sorted those 100 selfies based on their number of likes. I only used selfies that were online for more than a month to ensure a near-stable like count. I took the top 50 selfies and assigned them as positive selfies, and I took the bottom 50 and assigned those to negatives. We therefore end up with a binary split of the data into two halves, where we tried to normalize by the number of people who have probably seen each selfie. In this process I also filtered people with too few followers or too many followers, and also people who used too many tags on the image.</li> <li>Take the resulting dataset of 1 million good and 1 million bad selfies and train a ConvNet.</li> </ol> <p>At this point you may object that the way I’m deciding if a selfie is good or bad is wrong - e.g. what if someone posted a very good selfie but it was late at night, so perhaps not as many people saw it and it got less likes? You’re right - It almost definitely is wrong, but it only has to be right more often that not and the ConvNet will manage. It does not get confused or discouraged, it just does its best with what it’s been given. To get an idea about how difficult it is to distinguish the two classes in our data, have a look at some example training images below. If I gave you any one of these images could you tell which category it belongs to?</p> <div class="imgcap"> <img src="/assets/selfie/grid_render_posneg.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Example images showing good and bad selfies in our training data. These will be given to the ConvNet as teaching material.</div> </div> <p><strong>Training details</strong>. Just to throw out some technical details, I used <a href="">Caffe</a> to train the ConvNet. I used a VGGNet pretrained on ImageNet, and finetuned it on the selfie dataset. The model trained overnight on an NVIDIA K40 GPU. I disabled dropout because I had better results without it. I also tried a VGGNet pretrained on a dataset with faces but did not obtain better results than starting from an ImageNet checkpoint. The final model had 60% accuracy on my validation data split (50% is guessing randomly).</p> <h3 id="what-makes-a-good-selfie-">What makes a good #selfie ?</h3> <p>Okay, so we collected 2 million selfies, decided which ones are probably good or bad based on the number of likes they received (controlling for the number of followers), fed all of it to Caffe and trained a ConvNet. The ConvNet “looked?at every one of the 2 million selfies several tens of times, and tuned its filters in a way that best allows it to separate good selfies from bad ones. We can’t very easily inspect exactly what it found (it’s all jumbled up in 140 million numbers that together define the filters). However, we can set it loose on selfies that it has never seen before and try to understand what it’s doing by looking at which images it likes and which ones it does not.</p> <p>I took 50,000 selfies from my test data (i.e. the ConvNet hasn’t seen these before). As a first visualization, in the image below I am showing a <em>continuum</em> visualization, with the best selfies on the top row, the worst selfies on the bottom row, and every row in between is a continuum:</p> <div class="imgcap"> <img src="/assets/selfie/grid_render_continuum.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">A continuum from best (top) to worst (bottom) selfies, as judged by the ConvNet.</div> </div> <p>That was interesting. Lets now pull up the top 100 selfies (out of 50,000), according to the ConvNet:</p> <div class="imgcap"> <img src="/assets/selfie/grid_render_best.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Best 100 out of 50,000 selfies, as judged by the Convolutional Neural Network.</div> </div> <p>If you’d like to see more here is a link to <a href="">top 1000 selfies (3.5MB)</a>. Are you noticing a pattern in what the ConvNet has likely learned to look for? A few patterns stand out for me, and if you notice anything else I’d be happy to hear about in the comments. To take a good selfie, <strong>Do</strong>:</p> <ul> <li><em>Be female.</em> Women are consistently ranked higher than men. In particular, notice that there is not a single guy in the top 100.</li> <li><em>Face should occupy about 1/3 of the image.</em> Notice that the position and pose of the face is quite consistent among the top images. The face always occupies about 1/3 of the image, is slightly tilted, and is positioned in the center and at the top. Which also brings me to:</li> <li><em>Cut off your forehead</em>. What’s up with that? It looks like a popular strategy, at least for women.</li> <li><em>Show your long hair</em>. Notice the frequent prominence of long strands of hair running down the shoulders.</li> <li><em>Oversaturate the face.</em> Notice the frequent occurrence of over-saturated lighting, which often makes the face look much more uniform and faded out. Related to that,</li> <li><em>Put a filter on it.</em> Black and White photos seem to do quite well, and most of the top images seem to contain some kind of a filter that fades out the image and decreases the contrast.</li> <li><em>Add a border.</em> You will notice a frequent appearance of horizontal/vertical white borders.</li> </ul> <p>Interestingly, not all of these rules apply to males. I manually went through the top 2000 selfies and picked out the top males, here’s what we get:</p> <div class="imgcap"> <img src="/assets/selfie/males.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Best few male selfies taken from the top 2,000 selfies.</div> </div> <p>In this case we see don’t see any cut off foreheads. Instead, most selfies seem to be a slightly broader shot with head fully in the picture, and shoulders visible. It also looks like many of them have a fancy hair style with slightly longer hair combed upwards. However, we still do see the prominance of faded facial features.</p> <p>Lets also look at some of the worst selfies, which the ConvNet is quite certain would not receive a lot of likes. I am showing the images in a much smaller and less identifiable format because my intention is for us to learn about the broad patterns that decrease the selfie’s quality, not to shine light on people who happened to take a bad selfie. Here they are:</p> <div class="imgcap"> <img src="/assets/selfie/grid_render_worst.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Worst 300 out of 50,000 selfies, as judged by the Convolutional Neural Network.</div> </div> <p>Even at this small resolution some patterns clearly emerge. <strong>Don’t</strong>:</p> <ul> <li><em>Take selfies in low lighting.</em> Very consistently, darker photos (which usually include much more noise as well) are ranked very low by the ConvNet.</li> <li><em>Frame your head too large.</em> Presumably no one wants to see such an up-close view.</li> <li><em>Take group shots.</em> It’s fun to take selfies with your friends but this seems to not work very well. Keep it simple and take up all the space yourself. But not too much space.</li> </ul> <p>As a last point, note that a good portion of the variability between what makes a good or bad selfies can be explained by the style of the image, as opposed to the raw attractiveness of the person. Also, with some relief, it seems that the best selfies do not seem to be the ones that show the most skin. I was quite concerned for a moment there that my fancy 140-million ConvNet would turn out to be a simple amount-of-skin-texture-counter.</p> <p><strong>Celebrities.</strong> As a last fun experiment, I tried to run the ConvNet on a few famous celebrity selfies, and sorted the results with the continuum visualization, where the best selfies are on the top and the ConvNet score decreases to the right and then towards the bottom:</p> <div class="imgcap"> <img src="/assets/selfie/celebs_grid_render.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Celebrity selfies as judged by a Convolutional Neural Network. Most attractive selfies: Top left, then deceasing in quality first to the right then towards the bottom. <b>Right click &gt; Open Image in new tab on this image to see it in higher resolution.</b></div> </div> <p>Amusingly, note that the general rule of thumb we observed before (<em>no group photos</em>) is broken with the famous group selfie of Ellen DeGeneres and others from the Oscars, yet the ConvNet thinks this is actually a very good selfie, placing it on the 2nd row! Nice! :)</p> <p>Another one of our rules of thumb (<em>no males</em>) is confidently defied by Chris Pratt’s body (also 2nd row), and honorable mentions go to Justin Beiber’s raised eyebrows and Stephen Collbert / Jimmy Fallon duo (3rd row). James Franco’s selfie shows quite a lot more skin than Chris? but the ConvNet is not very impressed (4th row). Neither was I.</p> <p>Lastly, notice again the importance of style. There are several uncontroversially-good-looking people who still appear on the bottom of the list, due to bad framing (e.g. head too large possibly for J Lo), bad lighting, etc.</p> <h3 id="exploring-the-selfie-space">Exploring the #selfie space</h3> <p>Another fun visualization we can try is to lay out the selfies with <a href="">t-SNE</a>. t-SNE is a wonderful algorithm that I like to run on nearly anything I can because it’s both very general and very effective - it takes some number of things (e.g. images in our case) and lays them out in such way that nearby things are similar. You can in fact lay out many things with t-SNE, such as <a href="">Netflix movies</a>, <a href="">words</a>, <a href="">Twitter profiles</a>, <a href="">ImageNet images</a>, or really anything where you have some number of things and a way of comparing how similar two things are. In our case we will lay out selfies based on how similar the ConvNet perceives them. In technical terms, we are doing this based on L2 norms of the fc7 activations in the last fully-connected layer. Here is the visualization:</p> <div class="imgcap"> <img src="/assets/selfie/grid_render_tsne_reduced.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Selfie t-SNE visualization. Here is a link to a <a href="">higher-resolution version.</a> (9MB)</div> </div> <p>You can see that selfies cluster in some fun ways: we have group selfies on top left, a cluster of selfies with sunglasses/glasses in middle left, closeups bottom left, a lot of mirror full-body shots top right, etc. Well, I guess that was kind of fun.</p> <h3 id="finding-the-optimal-crop-for-a-selfie">Finding the Optimal Crop for a selfie</h3> <p>Another fun experiment we can run is to use the ConvNet to automatically find the best selfie crops. That is, we will take an image, randomly try out many different possible crops and then select the one that the ConvNet thinks looks best. Below are four examples of the process, where I show the original selfies on the left, and the ConvNet-cropped selfies on the right:</p> <div class="imgcap"> <img src="/assets/selfie/crops1.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Each of the four pairs shows the original image (left) and the crop that was selected by the ConvNet as looking best (right). &lt;/a&gt;</div> </div> <p>Notice that the ConvNet likes to make the head take up about 1/3 of the image, and chops off the forehead. Amusingly, in the image on the bottom right the ConvNet decided to get rid of the “self?part of <em>selfie</em>, entirely missing the point :) You can find many more fun examples of these “rude?crops:</p> <div class="imgcap"> <img src="/assets/selfie/crop2.jpg" style="border:none;" /> <div class="thecap" style="text-align:center">Same visualization as above, with originals on left and best crops on right. The one on the right is my favorite.&lt;/a&gt;</div> </div> <p>Before any of the more advanced users ask: Yes, I did try to insert a <a href="">Spatial Transformer</a> layer right after the image and before the ConvNet. Then I backpropped into the 6 parameters that define an arbitrary affine crop. Unfortunately I could not get this to work well - the optimization would sometimes get stuck, or drift around somewhat randomly. I also tried constraining the transform to scale/translation but this did not help. Luckily, when your transform has 3 bounded parameters then we can afford to perform global search (as seen above).</p> <h3 id="how-good-is-yours">How good is yours?</h3> <p>Curious about what the network thinks of your selfies? I’ve packaged the network into a Twitter bot so that you can easily find out. (The bot turns out to be onyl ~150 lines of Python, including all Caffe/Tweepy code). Attach your image to a tweet (or include a link) and mention the bot <a href="">@deepselfie</a> anywhere in the tweet. The bot will take a look at your selfie and then pitch in with its opinion! For best results link to a square image, otherwise the bot will have to squish it to a square, which deteriorates the results. The bot should reply within a minute or something went wrong (try again later).</p> <div class="imgcap" style="border-top:1px solid black; border-bottom: 1px solid black; padding: 10px;"> <img src="/assets/selfie/selfiebot2.png" style="border:none; width:600px;" /> <div class="thecap" style="text-align:center">Example interaction with the Selfie Bot (<a href="">@deepselfie</a>).</div> </div> <p>Before anyone asks, I also tried to port a smaller version of this ConvNet to run on iOS so you could enjoy real-time feedback while taking your selfies, but this turned out to be quite involved for a quick side project - e.g. I first tried to write my own fragment shaders since there is no CUDA-like support, then looked at some threaded CPU-only versions, but I couldn’t get it to work nicely and in real time. And I do have real work to do.</p> <h3 id="conclusion">Conclusion</h3> <p>I hope I’ve given you a taste of how powerful Convolutional Neural Networks are. You give them example images with some labels, they learn to recognize those things automatically, and it all works very well and is very fast (at least at test time, once it’s trained). Of course, we’ve only barely scratched the surface - ConvNets are used as a basic building block in many Neural Networks, not just to classify images/videos but also to segment, detect, and describe, both in the cloud or in robots.</p> <p>If you’d liked to learn more, the best place to start for a beginner right now is probably <a href="">Michael Nielsen’s tutorials</a>. From there I would encourage you to first look at <a href="">Andrew Ng’s Coursera class</a>, and then next I would go through course notes/assignments for <a href="">CS231n</a>. This is a class specifically on ConvNets that I taught together with Fei-Fei at Stanford last Winter quarter. We will also be offering the class again starting January 2016 and you’re free to follow along. For more advanced material I would look into <a href="">Hugo Larochelle’s Neural Networks class</a> or the <a href="">Deep Learning book</a> currently being written by Yoshua Bengio, Ian Goodfellow and Aaron Courville.</p> <p>Of course you’ll learn much more by doing than by reading, so I’d recommend that you play with <a href="">101 Kaggle Challenges</a>, or that you develop your own side projects, in which case I warmly recommend that you not only <em>do</em> but also <em>write about it</em>, and post it places for all of us to read, for example on <a href="">/r/machinelearning</a> which has accumulated a nice community. As for recommended tools, the three common options right now are:</p> <ul> <li><a href="">Caffe</a> (C++, Python/Matlab wrappers), which I used in this post. If you’re looking to do basic Image Classification then Caffe is the easiest way to go, in many cases requiring you to write no code, just invoking included scripts.</li> <li>Theano-based Deep Learning libraries (Python) such as <a href="">Keras</a> or <a href="">Lasagne</a>, which allow more flexibility.</li> <li><a href="">Torch</a> (C++, Lua), which is what I currently use in my research. I’d recommend Torch for the most advanced users, as it offers a lot of freedom, flexibility, speed, all with quite simple abstractions.</li> </ul> <p>Some other slightly newer/less proven but promising libraries include <a href="">Nervana’s Neon</a>, <a href="">CGT</a>, or <a href="">Mocha</a> in Julia.</p> <p>Lastly, there are a few companies out there who aspire to bring Deep Learning to the masses. One example is <a href="">MetaMind</a>, who offer web interface that allows you to drag and drop images and train a ConvNet (they handle all of the details in the cloud). MetaMind and <a href="">Clarifai</a> also offer ConvNet REST APIs.</p> <p>That’s it, see you next time!</p> Sun, 25 Oct 2015 11:00:00 +0000 The Unreasonable Effectiveness of Recurrent Neural Networks <p>There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for <a href="">Image Captioning</a>. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.</p> <blockquote> <p>We’ll train RNNs to generate text character by character and ponder the question “how is that even possible??lt;/p> </blockquote> <p>By the way, together with this post I am also releasing <a href="">code on Github</a> that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below. But we’re getting ahead of ourselves; What are RNNs anyway?</p> <h2 id="recurrent-neural-networks">Recurrent Neural Networks</h2> <p><strong>Sequences</strong>. Depending on your background you might be wondering: <em>What makes Recurrent Networks so special</em>? A glaring limitation of Vanilla Neural Networks (and also Convolutional Networks) is that their API is too constrained: they accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes). Not only that: These models perform this mapping using a fixed amount of computational steps (e.g. the number of layers in the model). The core reason that recurrent nets are more exciting is that they allow us to operate over <em>sequences</em> of vectors: Sequences in the input, the output, or in the most general case both. A few examples may make this more concrete:</p> <div class="imgcap"> <img src="/assets/rnn/diags.jpeg" /> <div class="thecap" style="text-align:justify">Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state (more on this soon). From left to right: <b>(1)</b> Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). <b>(2)</b> Sequence output (e.g. image captioning takes an image and outputs a sentence of words). <b>(3)</b> Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). <b>(4)</b> Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). <b>(5)</b> Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.</div> </div> <p>As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems. Moreover, as we’ll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that <a href="">RNNs are Turing-Complete</a> in the sense that they can to simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn’t read too much into this. In fact, forget I said anything.</p> <blockquote> <p>If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.</p> </blockquote> <p><strong>Sequential processing in absence of sequences</strong>. You might be thinking that having sequences as inputs or outputs could be relatively rare, but an important point to realize is that even if your inputs/outputs are fixed vectors, it is still possible to use this powerful formalism to <em>process</em> them in a sequential manner. For instance, the figure below shows results from two very nice papers from <a href="">DeepMind</a>. On the left, an algorithm learns a recurrent network policy that steers its attention around an image; In particular, it learns to read out house numbers from left to right (<a href="">Ba et al.</a>). On the right, a recurrent network <em>generates</em> images of digits by learning to sequentially add color to a canvas (<a href="">Gregor et al.</a>):</p> <div class="imgcap"> <div> <img src="/assets/rnn/house_read.gif" style="max-width:49%; height:400px;" /> <img src="/assets/rnn/house_generate.gif" style="max-width:49%; height:400px;" /> </div> <div class="thecap">Left: RNN learns to read house numbers. Right: RNN learns to paint house numbers.</div> </div> <p>The takeaway is that even if your data is not in form of sequences, you can still formulate and train powerful models that learn to process it sequentially. You’re learning stateful programs that process your fixed-sized data.</p> <p><strong>RNN computation.</strong> So how do these things work? At the core, RNNs have a deceptively simple API: They accept an input vector <code class="language-plaintext highlighter-rouge">x</code> and give you an output vector <code class="language-plaintext highlighter-rouge">y</code>. However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in in the past. Written as a class, the RNN’s API consists of a single <code class="language-plaintext highlighter-rouge">step</code> function:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rnn</span> <span class="o">=</span> <span class="n">RNN</span><span class="p">()</span> <span class="n">y</span> <span class="o">=</span> <span class="n">rnn</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># x is an input vector, y is the RNN's output vector </span></code></pre></div></div> <p>The RNN class has some internal state that it gets to update every time <code class="language-plaintext highlighter-rouge">step</code> is called. In the simplest case this state consists of a single <em>hidden</em> vector <code class="language-plaintext highlighter-rouge">h</code>. Here is an implementation of the step function in a Vanilla RNN:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RNN</span><span class="p">:</span> <span class="c1"># ... </span> <span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="c1"># update the hidden state </span> <span class="bp">self</span><span class="p">.</span><span class="n">h</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">W_hh</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">h</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">W_xh</span><span class="p">,</span> <span class="n">x</span><span class="p">))</span> <span class="c1"># compute the output vector </span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">W_hy</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">h</span><span class="p">)</span> <span class="k">return</span> <span class="n">y</span> </code></pre></div></div> <p>The above specifies the forward pass of a vanilla RNN. This RNN’s parameters are the three matrices <code class="language-plaintext highlighter-rouge">W_hh, W_xh, W_hy</code>. The hidden state <code class="language-plaintext highlighter-rouge">self.h</code> is initialized with the zero vector. The <code class="language-plaintext highlighter-rouge">np.tanh</code> function implements a non-linearity that squashes the activations to the range <code class="language-plaintext highlighter-rouge">[-1, 1]</code>. Notice briefly how this works: There are two terms inside of the tanh: one is based on the previous hidden state and one is based on the current input. In numpy <code class="language-plaintext highlighter-rouge"></code> is matrix multiplication. The two intermediates interact with addition, and then get squashed by the tanh into the new state vector. If you’re more comfortable with math notation, we can also write the hidden state update as \( h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t ) \), where tanh is applied elementwise.</p> <p>We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs <code class="language-plaintext highlighter-rouge">y</code> you’d like to see in response to your input sequences <code class="language-plaintext highlighter-rouge">x</code>.</p> <p><strong>Going deep</strong>. RNNs are neural networks and everything works monotonically better (if done right) if you put on your deep learning hat and start stacking models up like pancakes. For instance, we can form a 2-layer recurrent network as follows:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y1</span> <span class="o">=</span> <span class="n">rnn1</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">y</span> <span class="o">=</span> <span class="n">rnn2</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">y1</span><span class="p">)</span> </code></pre></div></div> <p>In other words we have two separate RNNs: One RNN is receiving the input vectors and the second RNN is receiving the output of the first RNN as its input. Except neither of these RNNs know or care - it’s all just vectors coming in and going out, and some gradients flowing through each module during backpropagation.</p> <p><strong>Getting fancy</strong>. I’d like to briefly mention that in practice most of us use a slightly different formulation than what I presented above called a <em>Long Short-Term Memory</em> (LSTM) network. The LSTM is a particular type of recurrent network that works slightly better in practice, owing to its more powerful update equation and some appealing backpropagation dynamics. I won’t go into details, but everything I’ve said about RNNs stays exactly the same, except the mathematical form for computing the update (the line <code class="language-plaintext highlighter-rouge">self.h = ... </code>) gets a little more complicated. From here on I will use the terms “RNN/LSTM?interchangeably but all experiments in this post use an LSTM.</p> <h2 id="character-level-language-models">Character-Level Language Models</h2> <p>Okay, so we have an idea about what RNNs are, why they are super exciting, and how they work. We’ll now ground this in a fun application: We’ll train RNN character-level language models. That is, we’ll give the RNN a huge chunk of text and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters. This will then allow us to generate new text one character at a time.</p> <p>As a working example, suppose we only had a vocabulary of four possible letters “helo? and wanted to train an RNN on the training sequence “hello? This training sequence is in fact a source of 4 separate training examples: 1. The probability of “e?should be likely given the context of “h? 2. “l?should be likely in the context of “he? 3. “l?should also be likely given the context of “hel? and finally 4. “o?should be likely given the context of “hell?</p> <p>Concretely, we will encode each character into a vector using 1-of-k encoding (i.e. all zero except for a single one at the index of the character in the vocabulary), and feed them into the RNN one at a time with the <code class="language-plaintext highlighter-rouge">step</code> function. We will then observe a sequence of 4-dimensional output vectors (one dimension per character), which we interpret as the confidence the RNN currently assigns to each character coming next in the sequence. Here’s a diagram:</p> <div class="imgcap"> <img src="/assets/rnn/charseq.jpeg" width="70%" style="border:none;" /> <div class="thecap">An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). This diagram shows the activations in the forward pass when the RNN is fed the characters "hell" as input. The output layer contains confidences the RNN assigns for the next character (vocabulary is "h,e,l,o"); We want the green numbers to be high and red numbers to be low.</div> </div> <p>For example, we see that in the first time step when the RNN saw the character “h?it assigned confidence of 1.0 to the next letter being “h? 2.2 to letter “e? -3.0 to “l? and 4.1 to “o? Since in our training data (the string “hello? the next correct character is “e? we would like to increase its confidence (green) and decrease the confidence of all other letters (red). Similarly, we have a desired target character at every one of the 4 time steps that we’d like the network to assign a greater confidence to. Since the RNN consists entirely of differentiable operations we can run the backpropagation algorithm (this is just a recursive application of the chain rule from calculus) to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers). We can then perform a <em>parameter update</em>, which nudges every weight a tiny amount in this gradient direction. If we were to feed the same inputs to the RNN after the parameter update we would find that the scores of the correct characters (e.g. “e?in the first time step) would be slightly higher (e.g. 2.3 instead of 2.2), and the scores of incorrect characters would be slightly lower. We then repeat this process over and over many times until the network converges and its predictions are eventually consistent with the training data in that correct characters are always predicted next.</p> <p>A more technical explanation is that we use the standard Softmax classifier (also commonly referred to as the cross-entropy loss) on every output vector simultaneously. The RNN is trained with mini-batch Stochastic Gradient Descent and I like to use <a href="">RMSProp</a> or Adam (per-parameter adaptive learning rate methods) to stablilize the updates.</p> <p>Notice also that the first time the character “l?is input, the target is “l? but the second time the target is “o? The RNN therefore cannot rely on the input alone and must use its recurrent connection to keep track of the context to achieve this task.</p> <p>At <strong>test time</strong>, we feed a character into the RNN and get a distribution over what characters are likely to come next. We sample from this distribution, and feed it right back in to get the next letter. Repeat this process and you’re sampling text! Lets now train an RNN on different datasets and see what happens.</p> <p>To further clarify, for educational purposes I also wrote a <a href="">minimal character-level RNN language model in Python/numpy</a>. It is only about 100 lines long and hopefully it gives a concise, concrete and useful summary of the above if you’re better at reading code than text. We’ll now dive into example results, produced with the much more efficient Lua/Torch codebase.</p> <h2 id="fun-with-rnns">Fun with RNNs</h2> <p>All 5 example character models below were trained with the <a href="">code</a> I’m releasing on Github. The input in each case is a single file with some text, and we’re training an RNN to predict the next character in the sequence.</p> <h3 id="paul-graham-generator">Paul Graham generator</h3> <p>Lets first try a small dataset of English as a sanity check. My favorite fun dataset is the concatenation of <a href="">Paul Graham’s essays</a>. The basic idea is that there’s a lot of wisdom in these essays, but unfortunately Paul Graham is a relatively slow generator. Wouldn’t it be great if we could sample startup wisdom on demand? That’s where an RNN comes in.</p> <p>Concatenating all pg essays over the last ~5 years we get approximately 1MB text file, or about 1 million characters (this is considered a very small dataset by the way). <em>Technical:</em> Lets train a 2-layer LSTM with 512 hidden nodes (approx. 3.5 million parameters), and with dropout of 0.5 after each layer. We’ll train with batches of 100 examples and truncated backpropagation through time of length 100 characters. With these settings one batch on a TITAN Z GPU takes about 0.46 seconds (this can be cut in half with 50 character BPTT at negligible cost in performance). Without further ado, lets see a sample from the RNN:</p> <p><em>“The surprised in investors weren’t going to raise money. I’m not the company with the time there are all interesting quickly, don’t have to get off the same programmers. There’s a super-angel round fundraising, why do you can do. If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users?affords that and an alternation to the idea. [2] Don’t work at first member to see the way kids will seem in advance of a bad successful startup. And if you have to act the big company too.?lt;/em></p> <p>Okay, clearly the above is unfortunately not going to replace Paul Graham anytime soon, but remember that the RNN had to learn English completely from scratch and with a small dataset (including where you put commas, apostrophes and spaces). I also like that it learns to support its own arguments (e.g. [2], above). Sometimes it says something that offers a glimmer of insight, such as <em>“a company is a meeting to think to investors?lt;/em>. <a href="">Here’s</a> a link to 50K character sample if you’d like to see more.</p> <p><strong>Temperature.</strong> We can also play with the temperature of the Softmax during sampling. Decreasing the temperature from 1 to some lower number (e.g. 0.5) makes the RNN more confident, but also more conservative in its samples. Conversely, higher temperatures will give more diversity but at cost of more mistakes (e.g. spelling mistakes, etc). In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say:</p> <p><em>“is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same?lt;/em></p> <p>looks like we’ve reached an infinite loop about startups.</p> <h3 id="shakespeare">Shakespeare</h3> <p>It looks like we can learn to spell English words. But how about if there is more structure and style in the data? To examine this I downloaded all the works of Shakespeare and concatenated them into a single (4.4MB) file. We can now afford to train a larger network, in this case lets try a 3-layer RNN with 512 hidden nodes on each layer. After we train the network for a few hours we obtain samples such as:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain'd into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. Second Lord: They would be ruled after this chamber, and my fair nues begun out of the fact, to be conveyed, Whose noble souls I'll have the heart of the wars. Clown: Come, sir, I will make did behold your worship. VIOLA: I'll drink it. </code></pre></div></div> <p>Remember, all the RNN knows are characters, so in particular it samples both speaker’s names and the contents. Sometimes we also get relatively extented monologue passages, such as:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VIOLA: Why, Salisbury must find his flesh and thought That which I am not aps, not a man and in fire, To show the reining of the raven and the wars To grace my hand reproach within, and not a fair are hand, That Caesar and my goodly father's world; When I was heaven of presence and our fleets, We spare with hours, but cut thy council I am great, Murdered and by thy master's ready there My power to give thee but so much as hell: Some service in the noble bondman here, Would show him to her wine. KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour. </code></pre></div></div> <p>I can barely recognize these samples from actual Shakespeare :) If you like Shakespeare, you might appreciate this <a href="">100,000 character sample</a>. Of course, you can also generate an infinite amount of your own samples at different temperatures with the provided code.</p> <h3 id="wikipedia">Wikipedia</h3> <p>We saw that the LSTM can learn to spell words and copy general syntactic structures. Lets further increase the difficulty and train on structured markdown. In particular, lets take the <a href="">Hutter Prize</a> 100MB dataset of raw Wikipedia and train an LSTM. Following <a href="">Graves et al.</a>, I used the first 96MB for training, the rest for validation and ran a few models overnight. We can now sample Wikipedia articles! Below are a few fun excerpts. First, some basic markdown output:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Naturalism and decision for the majority of Arab countries' capitalide was grounded by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated with Guangzham's sovereignty. His generals were the powerful ruler of the Portugal in the [[Protestant Immineners]], which could be said to be directly in Cantonese Communication, which followed a ceremony and set inspired prison, training. The emperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth's Dajoard]], known in western [[Scotland]], near Italy to the conquest of India with the conflict. Copyright was the succession of independence in the slop of Syrian influence that was a famous German movement based on a more popular servicious, non-doctrinal and sexual power post. Many governments recognize the military housing of the [[Civil Liberalization and Infantry Resolution 265 National Party in Hungary]], that is sympathetic to be to the [[Punjab Resolution]] (PJS)[ cfm/7754800786d17551963s89.htm Official economics Adjoint for the Nazism, Montgomery was swear to advance to the resources for those Socialism's rule, was starting to signing a major tripad of aid exile.]] </code></pre></div></div> <p>In case you were wondering, the yahoo url above doesn’t actually exist, the model just hallucinated it. Also, note that the model learns to open and close the parenthesis correctly. There’s also quite a lot of structured markdown that the model learns, for example sometimes it creates headings, lists, etc.:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ { cite journal | id=Cerling Nonforest Department|format=Newlymeslated|none } } ''www.e-complete''. '''See also''': [[List of ethical consent processing]] == See also == *[[Iender dome of the ED]] *[[Anti-autism]] ===[[Religion|Religion]]=== *[[French Writings]] *[[Maria]] *[[Revelation]] *[[Mount Agamul]] == External links== * [ Website of the World Festival. The labour of India-county defeats at the Ripper of California Road.] ==External links== * [ Constitution of the Netherlands and Hispanic Competition for Bilabial and Commonwealth Industry (Republican Constitution of the Extent of the Netherlands)] </code></pre></div></div> <p>Sometimes the model snaps into a mode of generating random but valid XML:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;page&gt; &lt;title&gt;Antichrist&lt;/title&gt; &lt;id&gt;865&lt;/id&gt; &lt;revision&gt; &lt;id&gt;15900676&lt;/id&gt; &lt;timestamp&gt;2002-08-03T18:14:12Z&lt;/timestamp&gt; &lt;contributor&gt; &lt;username&gt;Paris&lt;/username&gt; &lt;id&gt;23&lt;/id&gt; &lt;/contributor&gt; &lt;minor /&gt; &lt;comment&gt;Automated conversion&lt;/comment&gt; &lt;text xml:space="preserve"&gt;#REDIRECT [[Christianity]]&lt;/text&gt; &lt;/revision&gt; &lt;/page&gt; </code></pre></div></div> <p>The model completely makes up the timestamp, id, and so on. Also, note that it closes the correct tags appropriately and in the correct nested order. Here are <a href="">100,000 characters of sampled wikipedia</a> if you’re interested to see more.</p> <h3 id="algebraic-geometry-latex">Algebraic Geometry (Latex)</h3> <p>The results above suggest that the model is actually quite good at learning complex syntactic structures. Impressed by these results, my labmate (<a href="">Justin Johnson</a>) and I decided to push even further into structured territories and got a hold of <a href="">this book</a> on algebraic stacks/geometry. We downloaded the raw Latex source file (a 16MB file) and trained a multilayer LSTM. Amazingly, the resulting sampled Latex <em>almost</em> compiles. We had to step in and fix a few issues manually but then you get plausible looking math, it’s quite astonishing:</p> <div class="imgcap"> <img src="/assets/rnn/latex4.jpeg" style="border:none;" /> <div class="thecap">Sampled (fake) algebraic geometry. <a href="">Here's the actual pdf.</a></div> </div> <p>Here’s another sample:</p> <div class="imgcap"> <img src="/assets/rnn/latex3.jpeg" style="border:none;" /> <div class="thecap">More hallucinated algebraic geometry. Nice try on the diagram (right).</div> </div> <p>As you can see above, sometimes the model tries to generate latex diagrams, but clearly it hasn’t really figured them out. I also like the part where it chooses to skip a proof (<em>“Proof omitted.?lt;/em>, top left). Of course, keep in mind that latex has a relatively difficult structured syntactic format that I haven’t even fully mastered myself. For instance, here is a raw sample from the model (unedited):</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>\begin{proof} We may assume that $\mathcal{I}$ is an abelian sheaf on $\mathcal{C}$. \item Given a morphism $\Delta : \mathcal{F} \to \mathcal{I}$ is an injective and let $\mathfrak q$ be an abelian sheaf on $X$. Let $\mathcal{F}$ be a fibered complex. Let $\mathcal{F}$ be a category. \begin{enumerate} \item \hyperref[setain-construction-phantom]{Lemma} \label{lemma-characterize-quasi-finite} Let $\mathcal{F}$ be an abelian quasi-coherent sheaf on $\mathcal{C}$. Let $\mathcal{F}$ be a coherent $\mathcal{O}_X$-module. Then $\mathcal{F}$ is an abelian catenary over $\mathcal{C}$. \item The following are equivalent \begin{enumerate} \item $\mathcal{F}$ is an $\mathcal{O}_X$-module. \end{lemma} </code></pre></div></div> <p>This sample from a relatively decent model illustrates a few common mistakes. For example, the model opens a <code class="language-plaintext highlighter-rouge">\begin{proof}</code> environment but then ends it with a <code class="language-plaintext highlighter-rouge">\end{lemma}</code>. This is an example of a problem we’d have to fix manually, and is likely due to the fact that the dependency is too long-term: By the time the model is done with the proof it has forgotten whether it was doing a proof or a lemma. Similarly, it opens an <code class="language-plaintext highlighter-rouge">\begin{enumerate}</code> but then forgets to close it. We observed that these became less common with larger/better models, but nonetheless, these are the kinds of mistakes that come up.</p> <h3 id="linux-source-code">Linux Source Code</h3> <p>I wanted to push structured data to its limit, so for the final challenge I decided to use code. In particular, I took all the source and header files found in the <a href="">Linux repo on Github</a>, concatenated all of them in a single giant file (474MB of C code) (I was originally going to train only on the kernel but that by itself is only ~16MB). Then I trained several as-large-as-fits-on-my-GPU 3-layer LSTMs over a period of a few days. These models have about 10 million parameters, which is still on the lower end for RNN models. The results are superfun:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* * Increment the size file of the new incorrect UI_FILTER group information * of the size generatively. */</span> <span class="k">static</span> <span class="kt">int</span> <span class="nf">indicate_policy</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">error</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="n">MARN_EPT</span><span class="p">)</span> <span class="p">{</span> <span class="cm">/* * The kernel blank will coeld it to userspace. */</span> <span class="k">if</span> <span class="p">(</span><span class="n">ss</span><span class="o">-&gt;</span><span class="n">segment</span> <span class="o">&lt;</span> <span class="n">mem_total</span><span class="p">)</span> <span class="n">unblock_graph_and_set_blocked</span><span class="p">();</span> <span class="k">else</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="k">goto</span> <span class="n">bail</span><span class="p">;</span> <span class="p">}</span> <span class="n">segaddr</span> <span class="o">=</span> <span class="n">in_SB</span><span class="p">(</span><span class="n">in</span><span class="p">.</span><span class="n">addr</span><span class="p">);</span> <span class="n">selector</span> <span class="o">=</span> <span class="n">seg</span> <span class="o">/</span> <span class="mi">16</span><span class="p">;</span> <span class="n">setup_works</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">blocks</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">seq</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="o">++</span><span class="p">];</span> <span class="n">bpf</span> <span class="o">=</span> <span class="n">bd</span><span class="o">-&gt;</span><span class="n">bd</span><span class="p">.</span><span class="n">next</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">search</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">fd</span><span class="p">)</span> <span class="p">{</span> <span class="n">current</span> <span class="o">=</span> <span class="n">blocked</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="n">rw</span><span class="o">-&gt;</span><span class="n">name</span> <span class="o">=</span> <span class="s">"Getjbbregs"</span><span class="p">;</span> <span class="n">bprm_self_clearl</span><span class="p">(</span><span class="o">&amp;</span><span class="n">iv</span><span class="o">-&gt;</span><span class="n">version</span><span class="p">);</span> <span class="n">regs</span><span class="o">-&gt;</span><span class="n">new</span> <span class="o">=</span> <span class="n">blocks</span><span class="p">[(</span><span class="n">BPF_STATS</span> <span class="o">&lt;&lt;</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">historidac</span><span class="p">)]</span> <span class="o">|</span> <span class="n">PFMR_CLOBATHINC_SECONDS</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">;</span> <span class="k">return</span> <span class="n">segtable</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>The code looks really quite great overall. Of course, I don’t think it compiles but when you scroll through the generate code it feels very much like a giant C code base. Notice that the RNN peppers its code with comments here and there at random. It is also very good at making very few syntactic errors. For example, it uses strings properly, pointer notation, etc. It also opens and closes brackets <code class="language-plaintext highlighter-rouge">{[</code> correctly and learns to indent its code very well. A common error is that it can’t keep track of variable names: It often uses undefined variables (e.g. <code class="language-plaintext highlighter-rouge">rw</code> above), declares variables it never uses (e.g. <code class="language-plaintext highlighter-rouge">int error</code>), or returns non-existing variables. Lets see a few more examples. Here’s another snippet that shows a wider array of operations that the RNN learns:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* * If this error is set, we will need anything right after that BSD. */</span> <span class="k">static</span> <span class="kt">void</span> <span class="nf">action_new_function</span><span class="p">(</span><span class="k">struct</span> <span class="n">s_stat_info</span> <span class="o">*</span><span class="n">wb</span><span class="p">)</span> <span class="p">{</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">;</span> <span class="kt">int</span> <span class="n">lel_idx_bit</span> <span class="o">=</span> <span class="n">e</span><span class="o">-&gt;</span><span class="n">edd</span><span class="p">,</span> <span class="o">*</span><span class="n">sys</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span> <span class="o">*</span><span class="n">FIRST_COMPAT</span><span class="p">);</span> <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xFFFFFFFF</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">bit</span> <span class="o">&lt;&lt;</span> <span class="mi">4</span><span class="p">);</span> <span class="n">min</span><span class="p">(</span><span class="n">inc</span><span class="p">,</span> <span class="n">slist</span><span class="o">-&gt;</span><span class="n">bytes</span><span class="p">);</span> <span class="n">printk</span><span class="p">(</span><span class="n">KERN_WARNING</span> <span class="s">"Memory allocated %02x/%02x, "</span> <span class="s">"original MLL instead</span><span class="se">\n</span><span class="s">"</span><span class="p">),</span> <span class="n">min</span><span class="p">(</span><span class="n">min</span><span class="p">(</span><span class="n">multi_run</span> <span class="o">-</span> <span class="n">s</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">,</span> <span class="n">max</span><span class="p">)</span> <span class="o">*</span> <span class="n">num_data_in</span><span class="p">),</span> <span class="n">frame_pos</span><span class="p">,</span> <span class="n">sz</span> <span class="o">+</span> <span class="n">first_seg</span><span class="p">);</span> <span class="n">div_u64_w</span><span class="p">(</span><span class="n">val</span><span class="p">,</span> <span class="n">inb_p</span><span class="p">);</span> <span class="n">spin_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">disk</span><span class="o">-&gt;</span><span class="n">queue_lock</span><span class="p">);</span> <span class="n">mutex_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="o">-&gt;</span><span class="n">sock</span><span class="o">-&gt;</span><span class="n">mutex</span><span class="p">);</span> <span class="n">mutex_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">func</span><span class="o">-&gt;</span><span class="n">mutex</span><span class="p">);</span> <span class="k">return</span> <span class="n">disassemble</span><span class="p">(</span><span class="n">info</span><span class="o">-&gt;</span><span class="n">pending_bh</span><span class="p">);</span> <span class="p">}</span> <span class="k">static</span> <span class="kt">void</span> <span class="nf">num_serial_settings</span><span class="p">(</span><span class="k">struct</span> <span class="n">tty_struct</span> <span class="o">*</span><span class="n">tty</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">tty</span> <span class="o">==</span> <span class="n">tty</span><span class="p">)</span> <span class="n">disable_single_st_p</span><span class="p">(</span><span class="n">dev</span><span class="p">);</span> <span class="n">pci_disable_spool</span><span class="p">(</span><span class="n">port</span><span class="p">);</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="p">}</span> <span class="k">static</span> <span class="kt">void</span> <span class="nf">do_command</span><span class="p">(</span><span class="k">struct</span> <span class="n">seq_file</span> <span class="o">*</span><span class="n">m</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">column</span> <span class="o">=</span> <span class="mi">32</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">cmd</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x80</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="n">cmd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)(</span><span class="n">int_state</span> <span class="o">^</span> <span class="p">(</span><span class="n">in_8</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ch</span><span class="o">-&gt;</span><span class="n">ch_flags</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">Cmd</span><span class="p">)</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">);</span> <span class="k">else</span> <span class="n">seq</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">k</span> <span class="o">&amp;</span> <span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="p">))</span> <span class="n">pipe</span> <span class="o">=</span> <span class="p">(</span><span class="n">in_use</span> <span class="o">&amp;</span> <span class="n">UMXTHREAD_UNCCA</span><span class="p">)</span> <span class="o">+</span> <span class="p">((</span><span class="n">count</span> <span class="o">&amp;</span> <span class="mh">0x00000000fffffff8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x000000f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">count</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="n">sub</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="n">ppc_md</span><span class="p">.</span><span class="n">kexec_handle</span><span class="p">,</span> <span class="mh">0x20000000</span><span class="p">);</span> <span class="n">pipe_set_bytes</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="p">}</span> <span class="cm">/* Free our user pages pointer to place camera if all dash */</span> <span class="n">subsystem_info</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">of_changes</span><span class="p">[</span><span class="n">PAGE_SIZE</span><span class="p">];</span> <span class="n">rek_controls</span><span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="n">idx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">soffset</span><span class="p">);</span> <span class="cm">/* Now we want to deliberately put it to device */</span> <span class="n">control_check_polarity</span><span class="p">(</span><span class="o">&amp;</span><span class="n">context</span><span class="p">,</span> <span class="n">val</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">COUNTER</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="n">seq_puts</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"policy "</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <p>Notice that in the second function the model compares <code class="language-plaintext highlighter-rouge">tty == tty</code>, which is vacuously true. On the other hand, at least the variable <code class="language-plaintext highlighter-rouge">tty</code> exists in the scope this time! In the last function, notice that the code does not return anything, which happens to be correct since the function signature is <code class="language-plaintext highlighter-rouge">void</code>. However, the first two functions were also declared <code class="language-plaintext highlighter-rouge">void</code> and did return values. This is again a form of a common mistake due to long-term interactions.</p> <p>Sometimes the model decides that it’s time to sample a new file. This is usually a very amusing part: The model first recites the GNU license character by character, samples a few includes, generates some macros and then dives into the code:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* * Copyright (c) 2006-2010, Intel Mobile Communications. All rights reserved. * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 as published by * the Free Software Foundation. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software Foundation, * Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */</span> <span class="cp">#include &lt;linux/kexec.h&gt; #include &lt;linux/errno.h&gt; #include &lt;linux/io.h&gt; #include &lt;linux/platform_device.h&gt; #include &lt;linux/multi.h&gt; #include &lt;linux/ckevent.h&gt; </span> <span class="cp">#include &lt;asm/io.h&gt; #include &lt;asm/prom.h&gt; #include &lt;asm/e820.h&gt; #include &lt;asm/system_info.h&gt; #include &lt;asm/setew.h&gt; #include &lt;asm/pgproto.h&gt; </span> <span class="cp">#define REG_PG vesa_slot_addr_pack #define PFM_NOCOMP AFSR(0, load) #define STACK_DDR(type) (func) </span> <span class="cp">#define SWAP_ALLOCATE(nr) (e) #define emulate_sigs() arch_get_unaligned_child() #define access_rw(TST) asm volatile("movd %%esp, %0, %3" : : "r" (0)); \ if (__type &amp; DO_READ) </span> <span class="k">static</span> <span class="kt">void</span> <span class="n">stat_PC_SEC</span> <span class="n">__read_mostly</span> <span class="nf">offsetof</span><span class="p">(</span><span class="k">struct</span> <span class="n">seq_argsqueue</span><span class="p">,</span> \ <span class="n">pC</span><span class="o">&gt;</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span> <span class="k">static</span> <span class="kt">void</span> <span class="nf">os_prefix</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">sys</span><span class="p">)</span> <span class="p">{</span> <span class="cp">#ifdef CONFIG_PREEMPT </span> <span class="n">PUT_PARAM_RAID</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">sel</span><span class="p">)</span> <span class="o">=</span> <span class="n">get_state_state</span><span class="p">();</span> <span class="n">set_pid_sum</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span><span class="n">state</span><span class="p">,</span> <span class="n">current_state_str</span><span class="p">(),</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="o">-&gt;</span><span class="n">lr_full</span><span class="p">;</span> <span class="n">low</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>There are too many fun parts to cover- I could probably write an entire blog post on just this part. I’ll cut it short for now, but here is <a href="">1MB of sampled Linux code</a> for your viewing pleasure.</p> <h3 id="generating-baby-names">Generating Baby Names</h3> <p>Lets try one more for fun. Lets feed the RNN a large text file that contains 8000 baby names listed out, one per line (names obtained from <a href="">here</a>). We can feed this to the RNN and then generate new names! Here are some example names, only showing the ones that do not occur in the training data (90% don’t):</p> <p><em>Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen Hammine Janye Marlise Jacacrie Hendred Romand Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine Charyanne Sales Sanny Resa Wallon Martine Merus Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne Arnande Karella Roselina Alessia Chasty Deland Berther Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen Gavi Milea Anneda Margoris Janin Rodelin Zeanna Elyne Janah Ferzina Susta Pey Castina</em></p> <p>You can see many more <a href="">here</a>. Some of my favorites include “Baby?(haha), “Killie? “Char? “R? “More? “Mars? “Hi? “Saddie? “With?and “Ahbort? Well that was fun.?Of course, you can imagine this being quite useful inspiration when writing a novel, or naming a new startup :)</p> <h2 id="understanding-whats-going-on">Understanding what’s going on</h2> <p>We saw that the results at the end of training can be impressive, but how does any of this work? Lets run two quick experiments to briefly peek under the hood.</p> <h3 id="the-evolution-of-samples-while-training">The evolution of samples while training</h3> <p>First, it’s fun to look at how the sampled text evolves while the model trains. For example, I trained an LSTM of Leo Tolstoy’s War and Peace and then generated samples every 100 iterations of training. At iteration 100 the model samples random jumbles:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng </code></pre></div></div> <p>However, notice that at least it is starting to get an idea about words separated by spaces. Except sometimes it inserts two spaces. It also doesn’t know that comma is amost always followed by a space. At 300 iterations we see that the model starts to get an idea about quotes and periods:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Tmont thithey" fomesscerliund Keushey. Thom here sheulke, anmerenith ol sivh I lalterthend Bleipile shuwy fil on aseterlome coaniogennc Phe lism thond hon at. MeiDimorotion in ther thize." </code></pre></div></div> <p>The words are now also separated with spaces and the model starts to get the idea about periods at the end of a sentence. At iteration 500:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>we counter. He stutn co des. His stanted out one ofler that concossions and was to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn </code></pre></div></div> <p>the model has now learned to spell the shortest and most common words such as “we? “He? “His? “Which? “and? etc. At iteration 700 we’re starting to see more and more English-like text emerge:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Aftair fall unsuch that the hall for Prince Velzonski's that me of her hearly, and behs to so arwage fiving were to it beloge, pavu say falling misfort how, and Gogition is so overelical and ofter. </code></pre></div></div> <p>At iteration 1200 we’re now seeing use of quotations and question/exclamation marks. Longer words have now been learned as well:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Kite vouch!" he repeated by her door. "But I would be done and quarts, feeling, then, son is people...." </code></pre></div></div> <p>Until at last we start to get properly spelled words, quotations, names, and so on by about iteration 2000:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Why do what that day," replied Natasha, and wishing to himself the fact the princess, Princess Mary was easier, fed in had oftened him. Pierre aking his soul came to the packs and drove up his father-in-law women. </code></pre></div></div> <p>The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words; First starting with the short words and then eventually the longer ones. Topics and themes that span multiple words (and in general longer-term dependencies) start to emerge only much later.</p> <h3 id="visualizing-the-predictions-and-the-neuron-firings-in-the-rnn">Visualizing the predictions and the “neuron?firings in the RNN</h3> <p>Another fun visualization is to look at the predicted distributions over characters. In the visualizations below we feed a Wikipedia RNN model character data from the validation set (shown along the blue/green rows) and under every character we visualize (in red) the top 5 guesses that the model assigns for the next character. The guesses are colored by their probability (so dark red = judged as very likely, white = not very likely). For example, notice that there are stretches of characters where the model is extremely confident about the next letter (e.g., the model is very confident about characters during the <em>http://www.</em> sequence).</p> <p>The input character sequence (blue/green) is colored based on the <em>firing</em> of a randomly chosen neuron in the hidden representation of the RNN. Think about it as green = very excited and blue = not very excited (for those familiar with details of LSTMs, these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state). Intuitively, this is visualizing the firing rate of some neuron in the “brain?of the RNN while it reads the input sequence. Different neurons might be looking for different patterns; Below we’ll look at 4 different ones that I found and thought were interesting or interpretable (many also aren’t):</p> <div class="imgcap"> <img src="/assets/rnn/under1.jpeg" style="border:none;" /> <div class="thecap"> The neuron highlighted in this image seems to get very excited about URLs and turns off outside of the URLs. The LSTM is likely using this neuron to remember if it is inside a URL or not. </div> </div> <div class="imgcap"> <img src="/assets/rnn/under2.jpeg" style="border:none;" /> <div class="thecap"> The highlighted neuron here gets very excited when the RNN is inside the [[ ]] markdown environment and turns off outside of it. Interestingly, the neuron can't turn on right after it sees the character "[", it must wait for the second "[" and then activate. This task of counting whether the model has seen one or two "[" is likely done with a different neuron. </div> </div> <div class="imgcap"> <img src="/assets/rnn/under3.jpeg" style="border:none;" /> <div class="thecap"> Here we see a neuron that varies seemingly linearly across the [[ ]] environment. In other words its activation is giving the RNN a time-aligned coordinate system across the [[ ]] scope. The RNN can use this information to make different characters more or less likely depending on how early/late it is in the [[ ]] scope (perhaps?). </div> </div> <div class="imgcap"> <img src="/assets/rnn/under4.jpeg" style="border:none;" /> <div class="thecap"> Here is another neuron that has very local behavior: it is relatively silent but sharply turns off right after the first "w" in the "www" sequence. The RNN might be using this neuron to count up how far in the "www" sequence it is, so that it can know whether it should emit another "w", or if it should start the URL. </div> </div> <p>Of course, a lot of these conclusions are slightly hand-wavy as the hidden state of the RNN is a huge, high-dimensional and largely distributed representation. These visualizations were produced with custom HTML/CSS/Javascript, you can see a sketch of what’s involved <a href="">here</a> if you’d like to create something similar.</p> <p>We can also condense this visualization by excluding the most likely predictions and only visualize the text, colored by activations of a cell. We can see that in addition to a large portion of cells that do not do anything interpretible, about 5% of them turn out to have learned quite interesting and interpretible algorithms:</p> <div class="imgcap"> <img src="/assets/rnn/pane1.png" style="border:none;max-width:100%" /> <img src="/assets/rnn/pane2.png" style="border:none;max-width:100%" /> <div class="thecap"> </div> </div> <p>Again, what is beautiful about this is that we didn’t have to hardcode at any point that if you’re trying to predict the next character it might, for example, be useful to keep track of whether or not you are currently inside or outside of quote. We just trained the LSTM on raw data and it decided that this is a useful quantitity to keep track of. In other words one of its cells gradually tuned itself during training to become a quote detection cell, since this helps it better perform the final task. This is one of the cleanest and most compelling examples of where the power in Deep Learning models (and more generally end-to-end training) is coming from.</p> <h2 id="source-code">Source Code</h2> <p>I hope I’ve convinced you that training character-level language models is a very fun exercise. You can train your own models using the <a href="">char-rnn code</a> I released on Github (under MIT license). It takes one large text file and trains a character-level model that you can then sample from. Also, it helps if you have a GPU or otherwise training on CPU will be about a factor of 10x slower. In any case, if you end up training on some data and getting fun results let me know! And if you get lost in the Torch/Lua codebase remember that all it is is just a more fancy version of this <a href="">100-line gist</a>.</p> <p><em>Brief digression.</em> The code is written in <a href="">Torch 7</a>, which has recently become my favorite deep learning framework. I’ve only started working with Torch/LUA over the last few months and it hasn’t been easy (I spent a good amount of time digging through the raw Torch code on Github and asking questions on their <em>gitter</em> to get things done), but once you get a hang of things it offers a lot of flexibility and speed. I’ve also worked with Caffe and Theano in the past and I believe Torch, while not perfect, gets its levels of abstraction and philosophy right better than others. In my view the desirable features of an effective framework are:</p> <ol> <li>CPU/GPU transparent Tensor library with a lot of functionality (slicing, array/matrix operations, etc. )</li> <li>An entirely separate code base in a scripting language (ideally Python) that operates over Tensors and implements all Deep Learning stuff (forward/backward, computation graphs, etc)</li> <li>It should be possible to easily share pretrained models (Caffe does this well, others don’t), and crucially</li> <li>NO compilation step (or at least not as currently done in Theano). The trend in Deep Learning is towards larger, more complex networks that are are time-unrolled in complex graphs. It is critical that these do not compile for a long time or development time greatly suffers. Second, by compiling one gives up interpretability and the ability to log/debug effectively. If there is an <em>option</em> to compile the graph once it has been developed for efficiency in prod that’s fine.</li> </ol> <h2 id="further-reading">Further Reading</h2> <p>Before the end of the post I also wanted to position RNNs in a wider context and provide a sketch of the current research directions. RNNs have recently generated a significant amount of buzz and excitement in the field of Deep Learning. Similar to Convolutional Networks they have been around for decades but their full potential has only recently started to get widely recognized, in large part due to our growing computational resources. Here’s a brief sketch of a few recent developments (definitely not complete list, and a lot of this work draws from research back to 1990s, see related work sections):</p> <p>In the domain of <strong>NLP/Speech</strong>, RNNs <a href="">transcribe speech to text</a>, perform <a href="">machine translation</a>, <a href="">generate handwritten text</a>, and of course, they have been used as powerful language models <a href="">(Sutskever et al.)</a> <a href="">(Graves)</a> <a href="">(Mikolov et al.)</a> (both on the level of characters and words). Currently it seems that word-level models work better than character-level models, but this is surely a temporary thing.</p> <p><strong>Computer Vision.</strong> RNNs are also quickly becoming pervasive in Computer Vision. For example, we’re seeing RNNs in frame-level <a href="">video classification</a>, <a href="">image captioning</a> (also including my own work and many others), <a href="">video captioning</a> and very recently <a href="">visual question answering</a>. My personal favorite RNNs in Computer Vision paper is <a href="">Recurrent Models of Visual Attention</a>, both due to its high-level direction (sequential processing of images with glances) and the low-level modeling (REINFORCE learning rule that is a special case of policy gradient methods in Reinforcement Learning, which allows one to train models that perform non-differentiable computation (taking glances around the image in this case)). I’m confident that this type of hybrid model that consists of a blend of CNN for raw perception coupled with an RNN glance policy on top will become pervasive in perception, especially for more complex tasks that go beyond classifying some objects in plain view.</p> <p><strong>Inductive Reasoning, Memories and Attention.</strong> Another extremely exciting direction of research is oriented towards addressing the limitations of vanilla recurrent networks. One problem is that RNNs are not inductive: They memorize sequences extremely well, but they don’t necessarily always show convincing signs of generalizing in the <em>correct</em> way (I’ll provide pointers in a bit that make this more concrete). A second issue is they unnecessarily couple their representation size to the amount of computation per step. For instance, if you double the size of the hidden state vector you’d quadruple the amount of FLOPS at each step due to the matrix multiplication. Ideally, we’d like to maintain a huge representation/memory (e.g. containing all of Wikipedia or many intermediate state variables), while maintaining the ability to keep computation per time step fixed.</p> <p>The first convincing example of moving towards these directions was developed in DeepMind’s <a href="">Neural Turing Machines</a> paper. This paper sketched a path towards models that can perform read/write operations between large, external memory arrays and a smaller set of memory registers (think of these as our working memory) where the computation happens. Crucially, the NTM paper also featured very interesting memory addressing mechanisms that were implemented with a (soft, and fully-differentiable) attention model. The concept of <strong>soft attention</strong> has turned out to be a powerful modeling feature and was also featured in <a href="">Neural Machine Translation by Jointly Learning to Align and Translate</a> for Machine Translation and <a href="">Memory Networks</a> for (toy) Question Answering. In fact, I’d go as far as to say that</p> <blockquote> <p>The concept of <strong>attention</strong> is the most interesting recent architectural innovation in neural networks.</p> </blockquote> <p>Now, I don’t want to dive into too many details but a soft attention scheme for memory addressing is convenient because it keeps the model fully-differentiable, but unfortunately one sacrifices efficiency because everything that can be attended to is attended to (but softly). Think of this as declaring a pointer in C that doesn’t point to a specific address but instead defines an entire distribution over all addresses in the entire memory, and dereferencing the pointer returns a weighted sum of the pointed content (that would be an expensive operation!). This has motivated multiple authors to swap soft attention models for <strong>hard attention</strong> where one samples a particular chunk of memory to attend to (e.g. a read/write action for some memory cell instead of reading/writing from all cells to some degree). This model is significantly more philosophically appealing, scalable and efficient, but unfortunately it is also non-differentiable. This then calls for use of techniques from the Reinforcement Learning literature (e.g. REINFORCE) where people are perfectly used to the concept of non-differentiable interactions. This is very much ongoing work but these hard attention models have been explored, for example, in <a href="">Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets</a>, <a href="">Reinforcement Learning Neural Turing Machines</a>, and <a href="">Show Attend and Tell</a>.</p> <p><strong>People</strong>. If you’d like to read up on RNNs I recommend theses from <a href="">Alex Graves</a>, <a href="">Ilya Sutskever</a> and <a href="">Tomas Mikolov</a>. For more about REINFORCE and more generally Reinforcement Learning and policy gradient methods (which REINFORCE is a special case of) <a href="">David Silver</a>’s class, or one of <a href="">Pieter Abbeel</a>’s classes.</p> <p><strong>Code</strong>. If you’d like to play with training RNNs I hear good things about <a href="">keras</a> or <a href="">passage</a> for Theano, the <a href="">code</a> released with this post for Torch, or <a href="">this gist</a> for raw numpy code I wrote a while ago that implements an efficient, batched LSTM forward and backward pass. You can also have a look at my numpy-based <a href="">NeuralTalk</a> which uses an RNN/LSTM to caption images, or maybe this <a href="">Caffe</a> implementation by Jeff Donahue.</p> <h2 id="conclusion">Conclusion</h2> <p>We’ve learned about RNNs, how they work, why they have become a big deal, we’ve trained an RNN character-level language model on several fun datasets, and we’ve seen where RNNs are going. You can confidently expect a large amount of innovation in the space of RNNs, and I believe they will become a pervasive and critical component to intelligent systems.</p> <p>Lastly, to add some <strong>meta</strong> to this post, I trained an RNN on the source file of this blog post. Unfortunately, at about 46K characters I haven’t written enough data to properly feed the RNN, but the returned sample (generated with low temperature to get a more typical sample) is:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I've the RNN with and works, but the computed with program of the RNN with and the computed of the RNN with with and the code </code></pre></div></div> <p>Yes, the post was about RNN and how well it works, so clearly this works :). See you next time!</p> <p><strong>EDIT (extra links):</strong></p> <p>Videos:</p> <ul> <li>I gave a talk on this work at the <a href="">London Deep Learning meetup (video)</a>.</li> </ul> <p>Discussions:</p> <ul> <li><a href="">HN discussion</a></li> <li>Reddit discussion on <a href="">r/machinelearning</a></li> <li>Reddit discussion on <a href="">r/programming</a></li> </ul> <p>Replies:</p> <ul> <li><a href="">Yoav Goldberg</a> compared these RNN results to <a href="">n-gram maximum likelihood (counting) baseline</a></li> <li><a href="">@nylk</a> trained char-rnn on <a href="">cooking recipes</a>. They look great!</li> <li><a href="">@MrChrisJohnson</a> trained char-rnn on Eminem lyrics and then synthesized a rap song with robotic voice reading it out. Hilarious :)</li> <li><a href="">@samim</a> trained char-rnn on <a href="">Obama Speeches</a>. They look fun!</li> <li><a href="">João Felipe</a> trained char-rnn irish folk music and <a href="">sampled music</a></li> <li><a href="">Bob Sturm</a> also trained char-rnn on <a href="">music in ABC notation</a></li> <li><a href="">RNN Bible bot</a> by <a href="">Maximilien</a></li> <li><a href="">Learning Holiness</a> learning the Bible</li> <li><a href=""> snapshot</a> that has char-rnn set up and ready to go in a browser-based virtual machine (thanks <a href="">@samim</a>)</li> </ul> Thu, 21 May 2015 11:00:00 +0000 Breaking Linear Classifiers on ImageNet <p>You’ve probably heard that Convolutional Networks work very well in practice and across a wide range of visual recognition problems. You may have also read articles and papers that claim to reach a near <em>“human-level performance?lt;/em>. There are all kinds of caveats to that (e.g. see my G+ post on <a href="">Human Accuracy is not a point, it lives on a tradeoff curve</a>), but that is not the point of this post. I do think that these systems now work extremely well across many visual recognition tasks, especially ones that can be posed as simple classification.</p> <p>Yet, a second group of seemingly baffling results has emerged that brings up an apparent contradiction. I’m referring to several people who have noticed that it is possible to take an image that a state-of-the-art Convolutional Network thinks is one class (e.g. “panda?, and it is possible to change it almost imperceptibly to the human eye in such a way that the Convolutional Network suddenly classifies the image as any other class of choice (e.g. “gibbon?. We say that we <em>break</em>, or <em>fool</em> ConvNets. See the image below for an illustration:</p> <div class="imgcap"> <img src="/assets/break/breakconv.png" /> <div class="thecap">Figure from <a href="">Explaining and Harnessing Adversarial Examples</a> by Goodfellow et al.</div> </div> <p>This topic has recently gained attention starting with <a href="">Intriguing properties of neural networks</a> by Szegedy et al. last year. They had a very similar set of images:</p> <div class="imgcap"> <img src="/assets/break/szegedy.jpeg" /> <div class="thecap"> Take a correctly classified image (left image in both columns), and add a tiny distortion (middle) to fool the ConvNet with the resulting image (right). </div> </div> <p>And a set of very closely related results was later followed by <a href="">Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images</a> by Nguyen et al. Instead of starting with correctly-classified images and fooling the ConvNet, they had many more examples of performing the same process starting from noise (and hence making the ConvNet confidently classify an incomprehensible noise pattern as some class), or evolving new funny-looking images that the ConvNet is slightly too certain about:</p> <div class="imgcap"> <img src="/assets/break/break1.jpeg" /> <img src="/assets/break/break2.jpeg" /> <div class="thecap"> These images are classified with &gt;99.6% confidence as the shown class by a Convolutional Network. </div> </div> <p>I should make the point quickly that these results are not completely new to Computer Vision, and that some have observed the same problems even with our older features, e.g. HOG features. See <a href=";arnumber=6130416&amp;contentType=Conference+Publications&amp;queryText%3Dexploring+representation+capabilities+of+HOG">Exploring the Representation Capabilities of the HOG Descriptor</a> for details.</p> <p>The conclusion seems to be that we can take any arbitrary image and classify it as whatever class we want by adding tiny, imperceptible noise patterns. Worse, it was found that a reasonable fraction of fooling images <strong>generalize</strong> across different Convolutional Networks, so this isn’t some kind of fragile property of the new image or some overfitting property of the model. There’s something more general about the type of introduced noise that seems to fool many other models. In some sense, it is much more accurate to speak about <em>fooling subspaces</em> rather than <em>fooling images</em>. The latter erroneously makes them seem like tiny points in the super-high-dimensional image space, perhaps similar to rational numbers along the real numbers, when instead they are better thought of as entire intervals. Of course, this work raises security concerns because an adversary could conceivably generate a fooling image of any class on their own computer and upload it to some service with a malicious intent, with a non-zero probability of it fooling the server-side model (e.g. circumventing racy filters).</p> <blockquote> <p>What is going on?</p> </blockquote> <p>These results are interesting and worrying, but they have also led to a good amount of confusion among laymen. The most important point of this entire post is the following:</p> <p><strong>These results are not specific to images, ConvNets, and they are also not a “flaw?in Deep Learning</strong>. A lot of these results were reported with ConvNets running on images because pictures are fun to look at and ConvNets are state-of-the-art, but in fact the core flaw extends to many other domains (e.g. speech recognition systems), and most importantly, also to simple, shallow, good old-fashioned Linear Classifiers (Softmax classifier, or Linear Support Vector Machines, etc.). This was pointed out and articulated in <a href="">Explaining and Harnessing Adversarial Examples</a> by Goodfellow et al. We’ll carry out a few experiments very similar to the ones presented in this paper, and see that it is in fact this <em>linear</em> nature that is problematic. And because Deep Learning models use linear functions to build up the architecture, they inherit their flaw. However, Deep Learning by itself is not the cause of the issue. In fact, Deep Learning offers tangible hope for a solution, since we can use all the wiggle of composed functions to design more resistant architectures or objectives.</p> <h3 id="how-fooling-methods-work">How fooling methods work</h3> <p>ConvNets express a differentiable function from the pixel values to class scores. For example, a ConvNet might take a 227x227 image and transforms these ~100,000 numbers through a wiggly function (parameterized by several million parameters) to 1000 numbers that we interpret as the confidences for 1000 classes (e.g. the classes of ImageNet).</p> <div class="imgcap"> <img src="/assets/break/banana.jpeg" /> <div class="thecap"> This ConvNet takes the image of a banana and applies a function to it to transform it to class scores (here 4 classes are shown). The function consists of several rounds of convolutions where the filter entries are parameters, and a few matrix multiplications, where the elements of the matrices are parameters. A typical ConvNet might have ~100 million parameters. </div> </div> <p>We train a ConvNet with repeated process of sampling data, calculating the parameter gradients and performing a parameter update. That is, suppose we feed the ConvNet an image of a banana and compute the 1000 scores for the classes that the ConvNet assigns to this image. We then and ask the following question for every single parameter in the model:</p> <blockquote> <p>Normal ConvNet training: “What happens to the score of the correct class when I wiggle this parameter??lt;/p> </blockquote> <p>This <em>wiggle influence</em>, of course, is just the gradient. For example, some parameter in some filter in some layer of the ConvNet might get the gradient of -3.0 computed during backpropagation. That means that increasing this parameter by a tiny amount, e.g. 0.0001, would have a <em>negative</em> influence on the banana score (due to the negative sign); In this case, we’d expect the banana score to <em>decrease</em> by approximately 0.0003. Normally we take this gradient and use it to perform a <strong>parameter update</strong>, which wiggles every parameter in the model a tiny amount in the <em>correct</em> direction, to increase the banana score. These parameter updates hence work in concert to slightly increase the score of the banana class for that one banana image (e.g. the banana score could go up from 30% to 34% or something). We then repeat this over and over on all images in the training data.</p> <p>Notice how this worked: we held the input image fixed, and we wiggled the model parameters to increase the score of whatever class we wanted (e.g. banana class). It turns out that we can easily flip this process around to create fooling images. (In practice in fact, absolutely no changes to a ConvNet code base are required.) That is, we will hold the model parameters fixed, and instead we’re computing the gradient of all pixels in the input image on any class we might desire. For example, we can ask:</p> <blockquote> <p>Creating fooling images: “What happens to the score of (whatever class you want) when I wiggle this pixel??lt;/p> </blockquote> <p>We compute the gradient just as before with backpropagation, and then we can perform an <strong>image update</strong> instead of a parameter update, with the end result being that we increase the score of whatever class we want. E.g. we can take the banana image and wiggle every pixel according to the gradient of that image on the cat class. This would change the image a tiny amount, but the score of <em>cat</em> would now increase. Somewhat unintuitively, it turns out that you don’t have to change the image too much to toggle the image from being classified correctly as a banana, to being classified as anything else (e.g. cat).</p> <p>In short, to create a fooling image we start from whatever image we want (an actual image, or even a noise pattern), and then use backpropagation to compute the gradient of the image pixels on any class score, and nudge it along. We may, but do not have to, repeat the process a few times. You can interpret backpropagation in this setting as using dynamic programming to compute the most damaging local perturbation to the input. Note that this process is very efficient and takes negligible time if you have access to the parameters of the ConvNet (backprop is fast), but it is possible to do this even if you do not have access to the parameters but only to the class scores at the end. In this case, it is possible to compute the data gradient numerically, or to to use other local stochastic search strategies, etc. Note that due to the latter approach, even non-differentiable classifiers (e.g. Random Forests) are not safe (but I haven’t seen anyone empirically confirm this yet).</p> <h3 id="fooling-a-linear-classifier-on-imagenet">Fooling a Linear Classifier on ImageNet</h3> <p>As I mentioned before (and as described in more detail in <a href="">Goodfellow et al.</a>), it is the use of linear functions that makes our models susceptible for an attack. ConvNets, of course, do not express a linear function from images to class scores; They are a complex Deep Learning model that expresses a highly non-linear function. However, the components that make up a ConvNet <em>are</em> linear: Convolution of a filter with its input is a linear operation (we are sliding a filter through the input and computing dot products - a linear operation), and matrix multiplications are also a linear function.</p> <p>So here’s a fun experiment we’ll do. Lets forget about ConvNets - they are a distracting overkill as far as the core flaw goes. Instead, lets fool a linear classifier and lets also keep with the theme of breaking models on images because they are fun to look at.</p> <p>Here is the setup:</p> <ul> <li>Take 1.2 million images in ImageNet</li> <li>Resize them to 64x64 (full-sized images would train longer)</li> <li>use <a href="">Caffe</a> to train a Linear Classifier (e.g. Softmax). In other words we’re going straight from data to the classifier with a single fully-connected layer.</li> </ul> <p><em>Digression: Technical fun parts.</em> The fun part in actually doing this is that the standard AlexNetty ConvNet hyperparameters are of course completely inadequate. For example, normally you’d use weight decay of 0.0005 or so and learning rate of 0.01, and gaussian initialization drawn from a gaussian of 0.01 std. If you’ve trained linear classifiers before on this type of high-dimensional input (64x64x3 ~= 12K numbers), you’ll know that your learning rate will probably have to be much lower, the regularization much larger, and initialization of 0.01 std will probably be inadequate. Indeed, starting Caffe training with default hyperparameters gives a starting loss of about 80, which right away tells you that the initialization is completely out of whack (initial ImageNet loss should be ballpark 7.0, which is -log(1/1000)). I scaled it down to 0.0001 std for Gaussian init which gives sensible starting loss. But then the loss right away explodes which tells you that the learning rate is way too high - I had to scale it all the way down to about 1e-7. Lastly, a weight decay of 0.0005 will give almost negligible regularization loss with 12K inputs - I had to scale it up to 100 to start getting reasonably-looking weights that aren’t super-overfitted noise blobs. It’s fun being a Neural Networks practitioner.</p> <p>A linear classifier over image pixels implies that every class score is computed as a dot product between all the image pixels (stretched as a large column) and a learnable weight vector, one for each class. With input images of size 64x64x3 and 1000 ImageNet classes we therefore have 64x64x3x1000 = 12.3 million weights (beefy linear model!), and 1000 biases. Training these parameters on ImageNet with a K40 GPU takes only a few tens of minutes. We can then visualize each of the learned weights by reshaping them as images:</p> <div class="imgcap"> <img src="/assets/break/templates.jpeg" /> <div class="thecap"> Example linear classifiers for a few ImageNet classes. Each class' score is computed by taking a dot product between the visualized weights and the image. Hence, the weights can be thought of as a template: the images show what the classifier is looking for. For example, Granny Smith apples are green, so the linear classifier has positive weights in the green color channel and negative weights in blue and red channels, across all spatial positions. It is hence effectively counting the amount of green stuff in the middle. You can also see the learned <a href="">templates for all imagenet classes for fun.</a> </div> </div> <p>By the way, I haven’t seen anyone report linear classification accuracy on ImageNet before, but it turns out to be about 3.0% top-1 accuracy (and about 10% top-5) on ImageNet. I haven’t done a completely exhaustive hyperparameter sweep but I did a few rounds of manual binary search.</p> <p>Now that we’ve trained the model parameters we can start to produce fooling images. This turns out to be quite trivial in the case of linear classifiers and no backpropagation is required. This is because when your score function is a dot product \(s = w^Tx\), then the gradient on the image \(x\) is simply \(\nabla_x s = w\). That is, we take an image we would like to start out with, and then if we wanted to fool the model into thinking that it is some other class (e.g. goldfish), we have to take the weights corresponding to the desired class, and add some fraction of those weights to the image:</p> <div class="imgcap"> <img src="/assets/break/fool2.jpeg" /> <img src="/assets/break/fool1.jpeg" /> <img src="/assets/break/fish.jpeg" /> <div class="thecap"> Fooled linear classifier: The starting image (left) is classified as a kit fox. That's incorrect, but then what can you expect from a linear classifier? However, if we add a small amount "goldfish" weights to the image (top row, middle), suddenly the classifier is convinced that it's looking at one with high confidence. We can distort it with the school bus template instead if we wanted to. Similar figures (but on the MNIST digits dataset) can be seen in Figure 2 of <a href="">Goodfellow et al.</a> </div> </div> <p>We can also start from random noise and achieve the same effect:</p> <div class="imgcap"> <img src="/assets/break/noise1.jpeg" style="width:30%;display:inline-block;" /> <img src="/assets/break/noise2.jpeg" style="width:30%;display:inline-block;" /> <div class="thecap"> Same process but starting with a random image. </div> </div> <p>Of course, these examples are not as impactful as the ones that use a ConvNet because the ConvNet gives state of the art performance while a linear classifier barely gets to 3% accuracy, but it illustrates the point that even with a simple, shallow function it is still possible to play around with the input in imperceptible ways and get almost arbitrary results.</p> <p><strong>Regularization</strong>. There is one subtle comment to make regarding regularization strength. In my experiments above, increasing the regularization strength gave nicer, smoother and more diffuse weights but generalized to validation data <em>worse</em> than some of my best classifiers that displayed more noisy patterns. For example, the nice and smooth templates I’ve shown only achieve 1.6% accuracy. My best model that achieves 3.0% accuracy has noisier weights (as seen in the middle column of the fooling images). Another model with very low regularization reaches 2.8% and its fooling images are virtually indistinguishable yet produce 100% confidences in the wrong class. In particular:</p> <ul> <li>High regularization gives smoother templates, but at some point starts to works worse. However, it is more resistant to fooling. (The fooling images look noticeably different from their original)</li> <li>Low regularization gives more noisy templates but seems to work better that all-smooth templates. It is less resistant to fooling.</li> </ul> <p>Intuitively, it seems that higher regularization leads to smaller weights, which means that one must change the image more dramatically to change the score by some amount. It’s not immediately obvious if and how this conclusion translates to deeper models.</p> <div class="imgcap"> <img src="/assets/break/rapeseed.jpeg" /> <img src="/assets/break/rapeseed2.jpeg" /> <div class="thecap"> Linear classifier with lower regularization (which leads to more noisy class weights) is easier to fool (top). Higher regularization produces more diffuse filters and is harder to fool (bottom). That is, it's harder to achieve very confident wrong answers (however, with weights so small it is hard to achieve very confident correct answers too). To flip the label to a wrong class, more visually obvious perturbations are also needed. Somewhat paradoxically, the model with the noisy weights (top) works quite a bit better on validation data (2.6% vs. 1.4% accuracy). </div> </div> <h3 id="toy-example">Toy Example</h3> <p>We can understand this process in even more detail by condensing the problem to the smallest toy example that displays the problem. Suppose we train a binary logistic regression, where we define the probability of class 1 as \(P(y = 1 \mid x; w,b) = \sigma(w^Tx + b)\), where \(\sigma(z) = 1/(1+e^{-z})\) is the sigmoid function that squashes the class 1 score \(s = w^Tx+b\) into range between 0 and 1, where 0 is mapped to 0.5. This classifier hence decides that the class of the input is 1 if \(s &gt; 0\), or equivalently if the class 1 probability is more than 50% (i.e. \(\sigma(s) &gt; 0.5\)). Suppose further that we had the following setup:</p> <div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">x</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="c1">// input</span> <span class="nx">w</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="c1">// weight vector</span> </code></pre></div></div> <p>If you do the dot product, you get <code class="language-plaintext highlighter-rouge">-3</code>. Hence, probability of class 1 is <code class="language-plaintext highlighter-rouge">1/(1+e^(-(-3))) = 0.0474</code>. In other words the classifier is 95% certain that this is example is class 0. We’re now going to try to fool the classifier. That is, we want to find a tiny change to <code class="language-plaintext highlighter-rouge">x</code> in such a way that the score comes out much higher. Since the score is computed with a dot product (multiply corresponding elements in <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">w</code> then add it all up), with a little bit of thought it’s clear what this change should be: In every dimension where the weight is positive, we want to slightly increase the input (to get slightly more score). Conversely, in every dimension where the weight is negative, we want the input to be slightly lower (again, to get slightly more score). In other words, an adversarial <code class="language-plaintext highlighter-rouge">xad</code> might be:</p> <div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// xad = x + 0.5w gives:</span> <span class="nx">xad</span> <span class="o">=</span> <span class="p">[</span><span class="mf">1.5</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.5</span><span class="p">,</span> <span class="mf">3.5</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.5</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">,</span> <span class="o">-</span><span class="mf">3.5</span><span class="p">,</span> <span class="mf">4.5</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">]</span> </code></pre></div></div> <p>Doing the dot product again we see that suddenly the score becomes 2. This is not surprising: There are 10 dimensions and we’ve tweaked the input by 0.5 in every dimension in such a way that we <em>gain</em> 0.5 in each one, adding up to a total of 5 additional score, rising it from -3 to 2. Now when we look at probability of class 1 we get <code class="language-plaintext highlighter-rouge">1/(1+e^(-2)) = 0.88</code>. That is, we tweaked the original <code class="language-plaintext highlighter-rouge">x</code> by a small amount and we improved the class 1 probability from 5% to 88%! Moreover, notice that in this case the input only had 10 dimensions, but an image might consist of many tens of thousands of dimensions, so you can afford to make tiny changes across all of them that all add up in concert in exactly the worst way to blow up the score of any class you wish.</p> <h3 id="conclusions">Conclusions</h3> <p>Several other related experiments can be found in <a href="">Explaining and Harnessing Adversarial Examples</a> by Goodfellow et al. This paper is a required reading on this topic. It was the first to articulate and point out the linear functions flaw, and more generally argued that there is a tension between models that are easy to train (e.g. models that use linear functions) and models that resist adversarial perturbations.</p> <p>As closing words for this post, the takeaway is that ConvNets still work very well in practice. Unfortunately, it seems that their competence is relatively limited to a small region around the data manifold that contains natural-looking images and distributions, and that once we artificially push images away from this manifold by computing noise patterns with backpropagation, we stumble into parts of image space where all bets are off, and where the linear functions in the network induce large subspaces of fooling inputs.</p> <p>With wishful thinking, one might hope that ConvNets would produce all-diffuse probabilities in regions outside the training data, but there is no part in an ordinary objective (e.g. mean cross-entropy loss) that explicitly enforces this constraint. Indeed, it seems that the class scores in these regions of space are all over the place, and worse, a straight-forward attempt to patch this up by introducing a background class and iteratively adding fooling images as a new <em>background</em> class during training are not effective in mitigating the problem.</p> <p>It seems that to fix this problem we need to change our objectives, our forward functional forms, or even the way we optimize our models. However, as far as I know we haven’t found very good candidates for either. To be continued.</p> <h4 id="further-reading">Further Reading</h4> <ul> <li>Ian Goodfellow gave a talk on this work at the <a href="">RE.WORK Deep Learning Summit 2015</a></li> <li>You can fool ConvNets as part of <a href="">CS231n Assignment #3 IPython Notebook</a>.</li> <li><a href="">IPython Notebook</a> for this experiment. Also my <a href="">Caffe linear classifier</a> protos if you like.</li> </ul> Mon, 30 Mar 2015 20:00:00 +0000 What I learned from competing against a ConvNet on ImageNet <p>The results of the 2014 <a href="">ImageNet Large Scale Visual Recognition Challenge</a> (ILSVRC) were <a href="">published</a> a few days ago. The New York Times <a href="">wrote about it</a> too. ILSVRC is one of the largest challenges in Computer Vision and every year teams compete to claim the state-of-the-art performance on the dataset. The challenge is based on a subset of the ImageNet dataset that was first collected by <a href="">Deng et al. 2009</a>, and has been organized by our lab here at Stanford since 2010. This year, the challenge saw record participation with 50% more participants than last year, and records were shattered with staggering improvements in both classification and detection tasks.</p> <blockquote> <p>(My personal) <strong>ILSVRC 2014 TLDR</strong>: 50% more teams. 50% improved classification and detection. ConvNet ensembles all over the place. Google team wins.</p> </blockquote> <p>Of course there’s much more to it, and all details and takeaways will be discussed at length in Zurich, at the upcoming <a href="">ECCV 2014 workshop</a> happening on September 12.</p> <p>Additionally, we just (September 2nd) published an arXiv preprint describing the entire history of ILSVRC and a large amount of associated analysis, <a href="">check it out on arXiv</a>. This post will zoom in on a portion of the paper that I contributed to (Section 6.4 Human accuracy on large-scale image classification) and describe some of its context.</p> <h4 id="ilsvrc-classification-task">ILSVRC Classification Task</h4> <p>For the purposes of this post, I would like to focus, in particular, on image classification because this task is the common denominator for many other Computer Vision tasks. The classification task is made up of 1.2 million images in the training set, each labeled with one of 1000 categories that cover a wide variety of objects, animals, scenes, and even some abstract geometric concepts such as <em>“hook?lt;/em>, or <em>“spiral?lt;/em>. The 100,000 test set images are released with the dataset, but the labels are withheld to prevent teams from overfitting on the test set. The teams have to predict 5 (out of 1000) classes and an image is considered to be correct if at least one of the predictions is the ground truth. The test set evaluation is carried out on our end by comparing the predictions to our own set of ground truth labels.</p> <div class="imgcap"> <img src="/assets/cnntsne.jpeg" /> <div class="thecap">Example images from the classification task. Find full-scale images <a href="">here</a>.</div> </div> <h4 id="googlenets-impressive-performance">GoogLeNet’s Impressive Performance</h4> <p>I was looking at the results about a week ago and became particularly intrigued by GoogleLeNet’s winning submission for the classification task, which achieved a Hit@5 error rate of only 6.7% on the ILSVRC test set. I was relatively familiar with the scope and difficulty of the classification task: these are unconstrained internet images. They are a jungle of viewpoints, lighting conditions, and variations of all imaginable types. This begged the question: <em>How do humans compare?</em></p> <p>There are now several tasks in Computer Vision where the performance of our models is close to human, or even <em>superhuman</em>. Examples of these tasks include face verification, various medical imaging tasks, Chinese character recognition, etc. However, many of these tasks are fairly constrained in that they assume input images from a very particular distribution. For example, face verification models might assume as input only aligned, centered, and normalized images. In many ways, ImageNet is harder since the images come directly from the “jungle of the interwebs? Is it possible that our models are reaching human performance on such an unconstrained task?</p> <h4 id="computing-human-accuracy">Computing Human Accuracy</h4> <p>In short, I thought that the impressive performance by the winning team would only make sense if it was put in perspective with human accuracy. I was also in the unique position of being able to evaluate it (given that I share office space with ILSVRC organizers), so I set out to quantify the human accuracy and characterize the differences between human predictions with those of the winning model.</p> <p><em>Wait, isn’t human accuracy 100%?</em> Thank you, good question. It’s not, because the ILSVRC dataset was not labeled in the same way we are classifying it here. For example, to collect the images for the class “Border Terrier?the organizers searched the query on internet and retrieved a large collection of images. These were then filtered a bit with humans by asking them a binary “Is this a Border Terrier or not?? Whatever made it through became the “Border Terrier?class, and similar for all the other 1000 images. Therefore, the data was not collected in a discriminative but a binary manner, and is also subject to mistakes and inaccuracies. Some images can sometimes also contain multiple of the ILSVRC classes, etc.</p> <p><em>CIFAR-10 digression.</em> It’s fun to note that about 4 years ago I performed a similar (but much quicker and less detailed) human classification accuracy analysis on CIFAR-10. This was back when the state of the art was at 77% by Adam Coates, and my own accuracy turned out to be 94%. I think the best ConvNets now get about 92%. The post about that can be found <a href="/2011/04/27/manually-classifying-cifar10/">here</a>. I never imagined I’d be doing the same for ImageNet a few years down the road :)</p> <p>There’s one issue to clarify on. You may ask: <em>But wait, the ImageNet test set labels were obtained from humans in the first place. Why go about re-labeling it all over again? Isn’t human performance 0% by definition?</em> Kind of, but not really. It is important to keep in mind that ImageNet was annotated as a binary ask. For example, to collect images of the dog class “Kelpie? the query was submitted to search engines and then humans on Amazon Mechanical Turk were used for the binary task of filtering out the noise. The ILSVRC classification task, on the other hand, is 1000-way classification. It’s not a binary task such as the one used to collect the data.</p> <h4 id="labeling-interface">Labeling Interface</h4> <p>I developed a labeling interface that would help us evaluate the human performance. It looked similar to, but not identical, to the screenshot below:</p> <div class="imgcap"> <img src="/assets/ilsvrc1.png" /> <div class="thecap">A crop of a screenshot of the <a href="">labeling interface</a> for the ILSVRC validation data. Try it out for yourself.</div> </div> <p>The interface consisted of the test image on the left, and 1000 classes listed on the right. Each class was followed by 13 example images from the training set so that the categories were easier for a human to scan visually. The categories were also sorted in the topological order of the ImageNet hierarchy, which places semantically similar concepts nearby in the list. For example, all motor vehicle-related classes are arranged contiguously in the list. Finally, the interface is web-based so it is easy to naturally scroll through the classes, or search for them by text.</p> <p><strong>Try it out!</strong> I’m making the <a href="">the labeling interface</a> available to everyone so that you can also try labeling ILSVRC yourselves and draw your own conclusions. There are a few modifications in this version from the one we used to collect the data. I added two buttons (Show answer, and Show google prediction), and of course, the images shown in this version are the <em>validation</em> images, not the test set images. The GoogLeNet validation set predictions were graciously provided by the Google team.</p> <h4 id="roadblocks-along-the-way">Roadblocks along the way</h4> <p><strong>It was hard.</strong> As I beta-tested the interface, the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while. First we thought we would put it up on AMT. Then we thought we could recruit paid undergrads. Then I organized a labeling party of intense labeling effort only among the (expert labelers) in our lab. Then I developed a modified interface that used GoogLeNet predictions to prune the number of categories from 1000 to only about 100. It was still too hard - people kept missing categories and getting up to ranges of 13-15% error rates. In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself.</p> <p><strong>It took a while.</strong> I ended up training on 500 validation images and then switched to the test set of 1500 images. The labeling happened at a rate of about 1 per minute, but this decreased over time. I only enjoyed the first ~200, and the rest I only did <em>#forscience</em>. (In the end we convinced one more expert labeler to spend a few hours on the annotations, but they only got up to 280 images, with less training, and only got to about 12%). The labeling time distribution was strongly bimodal: Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort. I became very good at identifying breeds of dogs.</p> <p><strong>It was worth it.</strong> Based on the sample of images I worked on, the GoogLeNet classification error turned out to be 6.8% (the error on the full test set of 100,000 images is 6.7%). My own error in the end turned out to be <strong>5.1%</strong>, approximately 1.7% better. If you crunch through the statistical significance calculations (i.e. comparing the two proportions with a Z-test) under the null hypothesis of them being equal, you get a one-sided p-value of 0.022. In other words, the result is statistically significant based on a relatively commonly used threshold of 0.05. Lastly, I found the experience to be quite educational: After seeing so many images, issues, and ConvNet predictions you start to develop a really good model of the failure modes.</p> <blockquote> <p>My error turned out to be 5.1%, compared to GoogLeNet error of 6.8%. Still a bit of a gap to close (and more).</p> </blockquote> <div class="imgcap"> <img src="/assets/ilsvrc3.png" /> <div class="thecap">Representative example of practical frustrations of labeling ILSVRC classes. Aww, a cute dog! Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is?</div> </div> <h4 id="analysis-of-errors">Analysis of errors</h4> <p>We inspected both human and GoogLeNet errors to gain an understanding of common error types and how they compare. The analysis and insights below were derived specifically from GoogLeNet predictions, but I suspect that many of the same errors may be present in other methods. Let me copy paste the analysis from our <a href="">ILSVRC paper</a>:</p> <p><strong>Types of error that both GoogLeNet human are susceptible to:</strong></p> <ol> <li> <p><strong>Multiple objects.</strong> Both GoogLeNet and humans struggle with images that contain multiple ILSVRC classes (usually many more than five), with little indication of which object is the focus of the image. This error is only present in the Classification setting, since every image is constrained to have exactly one correct label. In total, we attribute 24 (24%) of GoogLeNet errors and 12 (16%) of human errors to this category. It is worth noting that humans can have a slight advantage in this error type, since it can sometimes be easy to identify the most salient object in the image.</p> </li> <li> <p><strong>Incorrect annotations.</strong> We found that approximately 5 out of 1500 images (0.3%) were incorrectly annotated in the ground truth. This introduces an approximately equal number of errors for both humans and GoogLeNet.</p> </li> </ol> <p><strong>Types of error that GoogLeNet is more susceptible to than human:</strong></p> <ol> <li> <p><strong>Object small or thin.</strong> GoogLeNet struggles with recognizing objects that are very small or thin in the image, even if that object is the only object present. Examples of this include an image of a standing person wearing sunglasses, a person holding a quill in their hand, or a small ant on a stem of a flower. We estimate that approximately 22 (21%) of GoogLeNet errors fall into this category, while none of the human errors do. In other words, in our sample of images, no image was mislabeled by a human because they were unable to identify a very small or thin object. This discrepancy can be attributed to the fact that a human can very effectively leverage context and affordances to accurately infer the identity of small objects (for example, a few barely visible feathers near person’s hand as very likely belonging to a mostly occluded quill).</p> </li> <li> <p><strong>Image filters.</strong> Many people enhance their photos with filters that distort the contrast and color distributions of the image. We found that 13 (13%) of the images that GoogLeNet incorrectly classified contained a filter. Thus, we posit that GoogLeNet is not very robust to these distortions. In comparison, only one image among the human errors contained a filter, but we do not attribute the source of the error to the filter.</p> </li> <li> <p><strong>Abstract representations.</strong> We found that GoogLeNet struggles with images that depict objects of interest in an abstract form, such as 3D-rendered images, paintings, sketches, plush toys, or statues. An example is the abstract shape of a bow drawn with a light source in night photography, a 3D-rendered robotic scorpion, or a shadow on the ground, of a child on a swing. We attribute approximately 6 (6%) of GoogLeNet errors to this type of error and believe that humans are significantly more robust, with no such errors seen in our sample.</p> </li> <li> <p><strong>Miscellaneous sources.</strong> Additional sources of error that occur relatively infrequently include extreme closeups of parts of an object, unconventional viewpoints such as a rotated image, images that can significantly benefit from the ability to read text (e.g. a featureless container identifying itself as ?lt;em>face powder</em>?, objects with heavy occlusions, and images that depict a collage of multiple images. In general, we found that humans are more robust to all of these types of error.</p> </li> </ol> <div class="imgcap"> <img src="/assets/ilsvrc2.png" /> <div class="thecap">Representative validation images that highlight common sources of error. For each image, we display the ground truth in blue, and top 5 predictions from GoogLeNet follow (red = wrong, green = right). GoogLeNet predictions on the validation set images were graciously provided by members of the GoogLeNet team. From left to right: Images that contain multiple objects, images of extreme closeups and uncharacteristic views, images with filters, images that significantly benefit from the ability to read text, images that contain very small and thin objects, images with abstract representations, and example of a fine-grained image that GoogLeNet correctly identifies but a human would have significant difficulty with.</div> </div> <p><strong>Types of error that human is more susceptible to than GoogLeNet:</strong></p> <ol> <li> <p><strong>Fine-grained recognition.</strong> We found that humans are noticeably worse at fine-grained recognition (e.g. dogs, monkeys, snakes, birds), even when they are in clear view. To understand the difficulty, consider that there are more than 120 species of dogs in the dataset. We estimate that 28 (37%) of the human errors fall into this category, while only 7 (7%) of GoogLeNet erros do.</p> </li> <li> <p><strong>Class unawareness.</strong> The annotator may sometimes be unaware of the ground truth class present as a label option. When pointed out as an ILSVRC class, it is usually clear that the label applies to the image. These errors get progressively less frequent as the annotator becomes more familiar with ILSVRC classes. Approximately 18 (24%) of the human errors fall into this category.</p> </li> <li> <p><strong>Insufficient training data.</strong> Recall that the annotator is only presented with 13 examples of a class under every category name. However, 13 images are not always enough to adequately convey the allowed class variations. For example, a brown dog can be incorrectly dismissed as a ?lt;em>Kelpie</em>?if all examples of a ?lt;em>Kelpie</em>?feature a dog with black coat. However, if more than 13 images were listed it would have become clear that a ?lt;em>Kelpie</em>?may have a brown coat. Approximately 4 (5%) of human errors fall into this category.</p> </li> </ol> <h4 id="conclusions">Conclusions</h4> <p>We investigated the performance of trained human annotators on a sample of up to 1500 ILSVRC test set images. Our results indicate that a trained human annotator is capable of outperforming the best model (GoogLeNet) by approximately 1.7% (p = 0.022).</p> <p>We expect that some sources of error may be relatively easily eliminated (e.g. robustness to filters, rotations, collages, effectively reasoning over multiple scales), while others may prove more elusive (e.g. identifying abstract representations of objects). On the hand, a large majority of human errors come from fine-grained categories and class unawareness. We expect that the former can be significantly reduced with fine-grained expert annotators, while the latter could be reduced with more practice and greater familiarity with ILSVRC classes.</p> <p>It is clear that humans will soon only be able to outperform state of the art image classification models by use of significant effort, expertise, and time. One interesting follow-up question for future investigation is how computer-level accuracy compares with human-level accuracy on more complex image understanding tasks.</p> <blockquote> <p>“It is clear that humans will soon only be able to outperform state of the art image classification models by use of significant effort, expertise, and time.?lt;/p> </blockquote> <p>As for my personal take-away from this week-long exercise, I have to say that, qualitatively, I was very impressed with the ConvNet performance. Unless the image exhibits some irregularity or tricky parts, the ConvNet confidently and robustly predicts the correct label. If you’re feeling adventurous, try out <a href="">the labeling interface</a> for yourself and draw your own conclusions. I can promise that you’ll gain interesting qualitative insights into where state-of-the-art Computer Vision works, where it fails, and how.</p> <p>EDIT: additional discussions:</p> <ul> <li><a href="">Pierre’s Google+</a></li> <li><a href="">Reddit /r/MachineLearning</a></li> </ul> <p>UPDATE:</p> <ul> <li><a href="">ImageNet workshop page</a> now has links to many of the teams?slides and videos.</li> <li><a href="">GoogLeNet paper</a> on arXiv describes the details of their architecutre.</li> </ul> <p>UPDATE2 (14 Feb 2015):</p> <p>There have now been several reported results that surpass my 5.1% error on ImageNet. I’m astonished to see such rapid progress. At the same time, I think we should keep in mind the following:</p> <blockquote> <p>Human accuracy is not a point. It lives on a tradeoff curve.</p> </blockquote> <p>We trade off human effort and expertise with the error rate: I am one point on that curve with 5.1%. My labmates with almost no training and less patience are another point, with even up to 15% error. And based on some calculations that consider my exact error types and hypothesizing which ones may be easier to fix than others, it’s not unreasonable to suggest that an ensemble of very dedicated expert human labelers might push this down to 3%, with about 2% being an optimistic error rate lower bound. I know it’s not as exciting as having a single number, but it’s the right way of thinking about it. See more details in my recent <a href="">Google+ post</a>.</p> Tue, 02 Sep 2014 20:00:00 +0000 ρٷվ