Pete Warden writes convincingly about computer scientists’ focus on improving machine learning algorithms, to the exclusion of improving the training data that the algorithms interpret, and how that focus has slowed the progress of machine learning. The problem is as old as data-processing itself: garbage in, garbage out.

Assembling the large, well-labeled datasets needed to train machine learning systems is a tedious job (indeed, the whole point and promise of machine learning is to teach computers to do this work, which humans are generally not good at and do not enjoy). The shortcuts we take to produce datasets come with steep costs that are not well-understood by the industry.

For example, in order to teach a model to recognize attractive travel photos, Jetpac paid low-waged Southeast Asian workers to label pictures. These workers had a very different idea of a nice holiday than the wealthy people who would use the service they were helping to create: for them, conference reception photos of people in suits drinking wine in air-conditioned international hotels were an aspirational ideal — I imagine that for some of these people, the beach and sea connoted grueling work fishing or clearing brush, rather than relaxing on a sun-lounger.

Warden says that people who are trying to improve vision systems for drones and other robots run into problems using the industry standard Imagenet dataset, because those images were taken by humans, not drones, and humans take pictures in ways that are significanty different from the way that machines do — different lenses, framing, subjects, vantage-points, etc. Warden’s advice is for machine learning researchers to sit with their training data: sift through it, hand-code it, review it and review it again.

Do the hard, boring work of making sure that PNGs aren’t labeled as JPGs, retrieve the audio samples that were classified as “other” and listen to them to see why the classifier barfed on them. It’s an important lesson for product design, but even more important when considering machine learning’s increasing role in adversarial uses like predictive policing, sentencing recommendations, parole decisions, lending decisions, hiring decisions, etc. Read more from…

thumbnail courtesy of