The Affirmative Action of Vocabulary

Modern humans use their computers to distinguish dogs from muffins, bagels, and croissants. Good job, humanity.

Word analogies

A frequent target of such criticism is Word2vec, a body of 3M words taken from the last 50 years of journalism. It’s used to train algorithms for things like sentiment analysis, predictive typing, automatic proofreading, and so on. Researchers have examined the relationship between words, and as you might expect, some words go together more often than others. You can ask the data to associate terms and it will show that “Man is to King as Woman is to Queen.”

Global vectors for word representation showing some examples of word associations along a gender vector.

Often true, seldom aspirational

Often, the algorithms’ predictions are factually correct. In an analysis of 150 years of British periodicals, researchers were able to accurately detect changes in society: When electricity replaced steam; when trains replaced horses; epidemics; wars; and so on. But the key here is history: The data on which the algorithms were trained comes from the past. The British periodicals study was also able to detect the under-representation of women in the media.

A very badly done trendline to make my point.

Don’t change the future by reinforcing the past

Data ethics is complicated stuff. The issue here is that historical data used for training purposes—like Word2vec, or 150 years of British periodicals, or the entire body of Enron emails—aren’t wrong. They’re actually frighteningly right. Rather, the issue is that we shouldn’t incorporate these assumptions in ways that change the future by reinforcing the past.

Are these suggestions helpful?



Alistair Croll

Writer, speaker, accelerant. Intersection of tech & society. Strata, Startupfest, Bitnorth, FWD50. Lean Analytics, Tilt the Windmill, HBS, Just Evil Enough.