Chapter 16. How to Feed and Care for Your Machine-Learning Experts

Pete Warden

Machine learning is a craft as well as a science, and for the best results you’ll often need to turn to experienced specialists. Not every team has enough interesting problems to justify a full-time machine-learning position, though. As the value of the approach becomes better-known, the demand for part-time or project-based machine-learning work has grown, but it’s often hard for a traditional engineering team to effectively work with outside experts in the field. I’m going to talk about some of the things I learned while running an outsourced project through Kaggle,[73] a community of thousands of researchers who participate in data competitions modeled on the Netflix Prize. This was an extreme example of outsourcing: we literally handed over a dataset, a short description, and a success metric to a large group of strangers. It had almost none of the traditional interactions you’d expect, but it did teach me valuable lessons that apply to any interactions with machine-learning specialists.

Define the Problem

My company Jetpac creates a travel magazine written by your friends, using vacation photos they’ve shared with you on Facebook and other social services. The average user has had over two hundred thousand pictures shared with them, so we have a lot to choose from. Unfortunately, many of them are not very good, at least for our purposes. When we showed people our prototype, most would be turned off by the poor quality of the images, so we knew we needed a solution that would help us pick out the most beautiful from the deep pool to which we had access.

This intention was too vague to build on, though. We had to decide exactly what we wanted our process to produce, and hence what question we wanted to answer about each photo. This is the crucial first step for any machine-learning application. Unless you know what your goal is, you’ll never be able to build a successful solution. One of the key qualities you need is some degree of objectivity and repeatability. If, like me, you’re trying to predict a human’s reaction, it’s no good if it varies unpredictably from person to person. Asking “Is this a good flavor of ice cream?” might be a good market research question, but the result says more about an individual’s personal taste than an objective “goodness” quality of the dessert.

That drove us to ask specifically, “Does this photo inspire you to travel to the place shown?” We were able to confirm that the answers for individual photos were quite consistent across the handful of people we ran initial tests on, so it appeared that there was some rough agreement on criteria. That proved we could use one person’s opinion as a good predictor of what other people would think. Therefore, if we could mimic a person with machine learning, we should be able to pick photos that many people would like.

As with most software engineering disciplines, the hardest part of a machine-learning expert’s job is defining the requirements in enough detail to implement a successful solution. You’re the domain expert, so in the end you’re the only one who can define exactly what problem you want to solve. Spend a lot of time on this stage; everything else depends on it.

Fake It Before You Make It

Because the problem definition is so crucial, you should prototype what would happen if you could get good answers to the question you’re asking. Does it actually achieve the goals you set when you try it within your application? In our case, we could determine that by using a human to rate a handful of users’ photos, and then present the end result within our interface to see how it changed the experience. The results were extremely positive: users went from a lukewarm reaction to enthusiasm when the first photos they saw were striking and inspiring. From a technical standpoint, it helped us, too, because it ensured we had set up the right modular interface so that the quality algorithm could be a true black box within our system, with completely-defined inputs and outputs.

For almost any machine-learning problem, you can build a similar human-powered prototype at the cost of some sweat and time. It will teach you crucial lessons about what you need from the eventual algorithm, and help you to be much smarter about steering the rest of the process. A good analogy is the use of static “slideware” presentations of an application to get initial feedback from users. It gives people a chance to experience something close to the result for which you’re aiming, at a stage when it’s very easy to change.

In practice, the system we set up worked very simply. We’d ask a user to sign up on the site, and we’d send them an email telling them that their slideshow would be ready soon. We loaded information about all the photos in their social circle into our database, and then added them to a queue. We had a photo-rating page that team-members could access that displayed a single photo from the person at the head of the queue, and had two buttons for indicating that the photo was inspiring, or not so much. Choosing one would refresh the page and show the next unrated photo, and if multiple people were rating, the workload would be spread out among them. This allowed us to rapidly work through the tens of thousands of photos that were shared with an average user, and we often had their results back within an hour. It meant that when we sat down with advisors or potential investors for a meeting, they could sign up at the start and we could show them a complete experience by the end of the discussion. That helped convince people that the final product would be fun and interesting if we could solve the quality issue, and enabled them to give us meaningful advice about the rest of the application before that was done.

Create a Training Set

Prototyping naturally led to this next stage. Before we could build any algorithm, we needed a good sample of how people reacted to different photos, both to teach the machine-learning system and to check its results. We also wanted to continue testing other parts of our application while we worked on the machine-learning solution, so we ended up building a human-powered module that both produced a training set, and manually rated the photos for our early users. This might sound like a classic Mechanical Turk[74] problem, but we were unable to use Amazon’s service, or other similar offerings, because the photos weren’t public. To protect people’s privacy, we needed to restrict access to a small group of vetted people who had signed nondisclosure agreements and would work as our employees. We did experiment with hiring a few people in the Philippines, but discovered that the cultural gap between them and our user population was too large. Situations like conferences and graduations were obviously inspiring to us and our target users in the US, but weren’t recognizable as such by our overseas hires. We could no doubt have fixed this with enough training and more detailed guidelines, but it proved easier to tackle it internally instead.

We already had millions of photos available to analyze, and we wanted to build as big a training set as possible, because we knew that would increase the accuracy of any solution. To help us rate as many photos as possible, I built a web interface that displayed a grid of nine photos from a single album, and let us mark the set as good or bad and advance to the next group with one keystroke. Initially these were y and n, but we found that switching to , and . helped improve our speed. The whole team joined in, and we spent hours each day on the task. We had all rated tens of thousands of photos by the time we stopped, but our community manager Cathrine Lindblom had a real gift for it, notching up multiple days of more than fifteen thousand votes, which works out to a sustained rate of one every two seconds!

One of the decisions that was implicit in our interface was that we’d be rating entire albums, rather than individual photos. This was an optimization that seemed acceptable because we’d noticed anecdotally that the quality of photos seemed pretty consistent within a single album, and that a selection of nine pictures at once was enough to get a good idea of that overall quality. It did constrain any algorithm that we produced to only rating entire albums, though.

To ensure that we were being consistent, we kept an eye on the accepted percentage rate. On average, about 30% of all photos were rated as good, so if anyone’s personal rating average was more than a few percentage points above or below that, they knew they were being too strict or lenient. I also considered collecting multiple ratings on some photos, to make sure we were really being consistent, but when I did a manual inspection of our results, it seemed like we were doing a good enough job without that check.

In the end, we had over two hundred and fifty thousand votes, while at the time we started the competition we’d only accumulated about fifty thousand.

Pick the Features

We were starting to develop a good reference library of which photos were considered good and bad, but we hadn’t decided which attributes we’d try to base the predictions on. The pixel data seemed like the obvious choice, but because the pictures were externally hosted on services like Facebook, that would potentially mean fetching the data for hundreds of thousands of images for every user, and the processing and bandwidth requirements would be far too much for our budget. We were already relying heavily on the text captions for identifying which places and activities were in the photos, so I decided to test a hunch that particular words might be good indicators of quality. I didn’t know ahead of time which words might be important, so I set out to create a league table of how highly the most common words correlated with our ratings.

I used some simple Pig scripts to join the votes we’d stored in one table in our Cassandra database with the photos to which they were linked, held in a different table. I then found all the words used in the album title, description, and individual captions, and created overall frequency counts for each. I threw away any words that appeared less than a thousand times, to exclude rare words that could skew the results. I then used Pig to calculate the ratio of good to bad photos associated with each word, and output a CSV file that listed them in order.

When I saw the results, I knew that my hunch was right. For certain words like Tombs and Trails, over ninety percent of the pictures were rated highly, and at the other end of the scale, less than three percent of photos with Mommy in the caption made people want to travel there. These extremely high and low ranking words weren’t numerous enough to offer a simple solution—they still only occurred in a few percent of the captions—but they did give me confidence that the words would be important features for any machine-learning process. One thing I didn’t expect was that words like beautiful weren’t significant indicators of good pictures; it seems like bad photographers are as likely to use that in their captions as good ones!

I also had a feeling that particular locations were going to be associated with clusters of good or bad photos, so I split the world up into one-degree sized boxes, and used another Pig script to place our rated photos into them based on their rounded latitude and longitude coordinates, and ordered the results by the ratio of good to bad photos in each, excluding buckets with less than a thousand photos. Again, there seemed to be some strong correlations with which to work, where places like Peru had over ninety percent of their pictures highly rated. That gave me a good enough reason to put the location coordinates into the inputs we’d be feeding to the machine learning.

Along with word occurrences and position, I also added a few other simple metrics like the number of photos in an album and the photo dimensions. I had no evidence that they were correlated with quality, but they were at hand and seemed likely suspects. At this point I had quite a rich set of features, and I was concerned that providing too many might result in confusion or overfitting. The exact process of picking features and deciding when to stop definitely felt like a bit of an art, but doing some sanity testing to uncover correlations is a good way to have some confidence that you’re making reasonable choices.

Encode the Data

Machine-learning algorithms expect to have inputs and outputs that are essentially spreadsheets, tables of numbers with each row representing a single record, with the columns containing particular attributes, and a special column containing the value you’re trying to predict. Some of my values were words, so I needed to convert those into numbers somehow. One of the other things I was concerned about was ensuring that the albums had no identifiable information, since essentially anyone could download the datasets from the Kaggle website. To tackle both problems, I randomly sorted the most common words and assigned to them numbers based on the order they appeared in the resulting list. The spreadsheet cells contained a space-separated list of the numbers of the words that appeared in the captions, description, and title.

In a similar way, I rounded down the latitude and longitude coordinates to the nearest degree, to prevent exact locations from being revealed. The other attributes were already integral numbers, so I had everything converted into convenient numerical values. The final step was converting the predicted good or bad values into numbers, in this case just 1 for good and 0 for bad.

Split Into Training, Test, and Solution Sets

Now I had the fifty thousand albums we’d voted on so far described in a single spreadsheet, but for the machine-learning process, I needed to split them up into three different files. The first was the training set, a list of albums’ attributes together with the human rating of each one. As the name would suggest, this was used to build a predictive model, fed into the learning algorithm so that correlations between particular attributes and the results could be spotted and used. Next was the test set, which contained another list of different albums’ attributes, but without including the results of a human’s rating of its photos. This would be used to gauge how well the model that had been built using the training set was working, by feeding in the attributes and outputting predictions for each one.

These first two spreadsheets would be available to all competitors to download, but the third solution set was kept privately on the Kaggle servers. It contained just a single column with the actual human predictions for every album in the test set. To enter the contest, the competitors would upload their own model’s predictions for the test set, and behind the scenes Kaggle would automatically compare those against the true solution, and assign each entry a score based on how close all the predicted values were to the human ratings.

One wrinkle was that to assign a score, we had to decide how to measure the overall error between the predictions and the true values. Initially I picked a measure I understood, root-mean square, but I was persuaded by Jeremy Howard to use capped binomial deviance instead. I still don’t understand the metric, despite having stared at the code implementing it, but it produced great results in the end, so he obviously knew what he was talking about!

We had to pick how to split the albums we had between the training and the test sets, and a rule of thumb seemed to be to divide them with roughly a three to one ratio. We had just over fifty thousand albums rated at that point, so I put forty thousand in the training, and twelve thousand in the test. I used standard Unix tools like tail to split up the full CSV file, after doing a random sort to make sure that there weren’t any selection biases caused by ordering creeping into the data. After that, I was able to upload the three files to Kaggle’s servers, and the technical side of the competition was ready.

These same files were also perfect inputs to try with an off-the-shelf machine learning framework. I spent a few hours building an example using the scikit-learn[75] Python package and its default support vector machine package, but it didn’t produce very effective results. I obviously needed the expertise of a specialist, so for lack of an in-house genius, I went ahead with the Kaggle competition. It’s a good idea to do something similar before you call in outside expertise though, if only to sanity check the data you’ve prepared and get an understanding of what results a naive approach will achieve.

Describe the Problem

Beyond just providing files, I needed to explain what the competition was for, and why it was worth entering. As a starving startup, we could only afford five thousand dollars for prize money, and we needed the results in just three weeks, so I wanted to motivate contestants as much as I could with the description! As it turned out, people seemed to enjoy the unusually short length of the contest. The description was also useful in explaining what the data meant. In theory, machine learning doesn’t need to know anything about what the tables of numbers represent; but in practice, there’s an art to choosing which techniques to apply and which patterns to focus on, and knowing what the data is actually about can be helpful for that. In the end, I wrote a few paragraphs discussing the problem we were encountering, and that seemed quite effective. The inclusion of a small picture of a tropical beach didn’t seem to hurt, either!

In theory, a machine-learning expert can deal with your problem as a pure numerical exercise. In practice, building good algorithms requires a lot of judgment and intuition, so it’s well worth giving them a brief education on the domain in which you’re working. Most datasets have a lot of different features, and it helps to know which attributes they represent in the real world so you can tune your algorithms to focus on those with the greatest predictive power.

Respond to Questions

Once we’d launched, I kept an eye on the Kaggle forums and tried to answer questions as they came up. This was actually fairly tough, mostly because other community members or Kaggle staff would often get to them before me! It was very heartening to see how enthused everybody was about the problem. In the end, the three-week contest involved very little work for me, though I know the contestants were extremely busy.

This was one of the areas where our data contest approach led to quite a different experience than you’d find with a more traditionally outsourced project. All of the previous preparation steps had created an extremely well-defined problem for the contestants to tackle, and on our end we couldn’t update any data or change the rules part-way through. Working with a consultant or part-time employee is a much more iterative process, because you can revise your requirements and inputs as you go. Because those changes are often costly in terms of time and resources, up-front preparation is still extremely effective.

Several teams apparently tried to use external data sources to help improve their results, without much success. Their hope was that by adding in extra information about things like the geographic location where a photo was taken, they could produce better guesses about its quality. In fact, it appeared that the signals in the training set were better than any that could be gleaned from outside information. We would have been equally happy to use a data source from elsewhere as part of our rating algorithm if it had proven useful, but we were also a little relieved to confirm our own intuition that there weren’t preexisting datasets holding this type of information!

Initially, we were quite concerned about the risk of leaking private information. Even though our competition required entrants to agree not to redistribute or otherwise misuse the dataset we were providing, other contests have been won by teams who managed to de-anonymize the datasets. That is, those teams had used the attributes in the test set to connect the entities with external information about them, and in turn to discover the right answers directly rather than predict them. In our case, we had to make sure that no contestant could find out which albums were being referred to in our test set, lest they look at the photos they contained directly to make a direct human evaluation of their quality. The one-way nature of our encoding process, Facebook’s privacy controls, and lack of search capabilities for photos on the site kept the original images safe from the teams.

Integrate the Solutions

We had an amazing number of teams enter—over 400—and the top 10 entries were so close as to be almost identical, but we’d promised to split the prize money among the top 3. That also gave us a nonexclusive license to the code they’d used, and this was the payoff of the whole process for us. My colleague Chris Raynor evaluated the entries, and chose the code from the second place entrant, Jason Tigg. The top three entries were very close, but Tigg’s Java code was the easiest to add to our pipeline. We were running the evaluation as a batch process, so we ended up compiling the Java into a command-line app that communicated with Ruby, via sockets. It wasn’t the most elegant solution, but it proved robust and effective for our needs.

One unexpected benefit of the machine-learning algorithm was that it produced a real number expressing the probability that an album was good or bad, rather than the original binary up or down vote that the humans produced. That let us move high-probability photos to the front of slideshows, rather than just having two categories.

Conclusion

We’re still using the results of the competition very successfully in our product today. I hope this walk-through gave you an idea of how to work effectively with outside machine-learning experts, whether through a contest like Kaggle or through a more traditional arrangement.



[73] “Kaggle: making data science a sport.” (http://www.kaggle.com/)

[74] Amazon Mechanical Turk: “Artificial Artificial Intelligence.” (https://www.mturk.com/)

[75] scikit-learn: machine learning in Python (http://scikit-learn.org/)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.156.212