In doing that,, I wanted to really open up this understanding of AI as neither artificial nor intelligent. It’s the opposite of artificial. It comes from the most material parts of the Earth’s crust and from human bodies laboring, and from all of the artifacts that we produce and say and photograph every day. Neither is it intelligent. I think there’s this great original sin in the field, where people assumed that computers are somehow like human brains and if we just train them like children, they will slowly grow into these supernatural beings.
That’s something that I think is really problematic—that we’ve bought this idea of intelligence when in actual fact, we’re just looking at forms of statistical analysis at scale that have as many problems as the data that it’s given.
Was it immediately obvious to you that this is how people should be thinking about AI? Or was it a journey?
It’s absolutely been a journey. I’d say one of the turning points for me was back in 2016, when I started a project called “Anatomy of an AI system” with Vladan Joler. We met at a conference specifically about voice-enabled AI, and we were trying to effectively draw what it takes to make an Amazon Echo work. What are the components? How does it extract data? What are the layers in the data pipeline?
We realized, well—actually, to understand that, you have to understand where the components come from. Where did the chips get produced? Where are the mines? Where does it get smelted? Where are the logistical and supply chain paths?
Finally, how do we trace the end of life of these devices? How do we look at where the e-waste tips are located in places like Malaysia and Ghana and Pakistan? What we ended up with was this very time-consuming two-year research project to really trace those material supply chains from cradle to grave.
When you start looking at AI systems on that bigger scale, and on that longer time horizon, you shift away from these very narrow accounts of “AI fairness” and “ethics” to saying: these are systems that produce profound and lasting geomorphic changes to our planet, as well as increase the forms of labor inequality that we already have in the world.
So that made me realize that I had to shift from an analysis of just one device, the Amazon Echo, to applying this sort of analytic to the entire industry. That to me was the big task, and that’s why Atlas of AI took five years to write. There’s such a need to actually see what these systems really cost us, because we so rarely do the work of actually understanding their true planetary implications.
The other thing I would say that’s been a real inspiration is the growing field of scholars who are asking these bigger questions around labor, data, and inequality. Here I’m thinking of Ruha Benjamin, Safiya Noble, Mar Hicks, Julie Cohen, Meredith Broussard, Simone Brown—the list goes on. I see this as a contribution to that body of knowledge by bringing in perspectives that connect the environment, labor rights, and data protection.
You travel a lot throughout the book. Almost every chapter starts with you actually looking around at your surroundings. Why was this important to you?
It was a very conscious choice to ground an analysis of AI in specific places, to move away from these abstract “nowheres” of algorithmic space, where so many of the debates around machine learning happen. And hopefully it highlights the fact that when we don’t do that, when we just talk about these “nowhere spaces” of algorithmic objectivity, that is also a political choice, and it has ramifications.
In terms of threading the locations together, this is really why I started thinking about this metaphor of an atlas, because atlases are unusual books. They’re books that you can open up and look at the scale of an entire continent, or you can zoom in and look at a mountain range or a city. They give you these shifts in perspective and shifts in scale.
There’s this lovely line that I use in the book from the physicist Ursula Franklin. She writes about how maps join together the known and the unknown in these methods of collective insight. So for me, it was really drawing on the knowledge that I had, but also thinking about the actual locations where AI is being constructed very literally from rocks and sand and oil.
What kind of feedback has the book received?
One of the things that I’ve been surprised by in the early responses is that people really feel like this kind of perspective was overdue. There’s a moment of recognition that we need to have a different sort of conversation than the ones that we’ve been having over the last few years.
We’ve spent far too much time focusing on narrow tech fixes for AI systems and always centering technical responses and technical answers. Now we have to contend with the environmental footprint of the systems. We have to contend with the very real forms of labor exploitation that have been happening in the construction of these systems.
And we also are now starting to see the toxic legacy of what happens when you just rip out as much data off the internet as you can, and just call it ground truth. That kind of problematic framing of the world has produced so many harms, and as always, those harms have been felt most of all by communities who were already marginalized and not experiencing the benefits of those systems.
What do you hope people will start to do differently?
I hope it’s going to be a lot harder to have these cul-de-sac conversations where terms like “ethics” and “AI for good” have been so completely denatured of any actual meaning. I hope it pulls aside the curtain and says, let’s actually look at who’s running the levers of these systems. That means shifting away from just focusing on things like ethical principles to talking about power.
How do we move away from this ethics framing?
Anti-vaxxers are weaponizing Yelp to punish bars that require vaccine proof
Smith’s Yelp reviews were shut down after the sudden flurry of activity on its page, which the company labels “unusual activity alerts,” a stopgap measure for both the business and Yelp to filter through a flood of reviews and pick out which are spam and which aren’t. Noorie Malik, Yelp’s vice president of user operations, said Yelp has a “team of moderators” that investigate pages that get an unusual amount of traffic. “After we’ve seen activity dramatically decrease or stop, we will then clean up the page so that only firsthand consumer experiences are reflected,” she said in a statement.
It’s a practice that Yelp has had to deploy more often over the course of the pandemic: According to Yelp’s 2020 Trust & Safety Report, the company saw a 206% increase over 2019 levels in unusual activity alerts. “Since January 2021, we’ve placed more than 15 unusual activity alerts on business pages related to a business’s stance on covid-19 vaccinations,” said Malik.
The majority of those cases have been since May, like the gay bar C.C. Attles in Seattle, which got an alert from Yelp after it made patrons show proof of vaccination at the door. Earlier this month, Moe’s Cantina in Chicago’s River North neighborhood got spammed after it attempted to isolate vaccinated customers from unvaccinated ones.
Spamming a business with one-star reviews is not a new tactic. In fact, perhaps the best-known case is Colorado’s Masterpiece bakery, which won a 2018 Supreme Court battle for refusing to make a wedding cake for a same-sex couple, after which it got pummeled by one-star reviews. “People are still writing fake reviews. People will always write fake reviews,” Liu says.
But he adds that today’s online audience know that platforms use algorithms to detect and flag problematic words, so bad actors can mask their grievances by blaming poor restaurant service like a more typical negative review to ensure the rating stays up — and counts.
That seems to have been the case with Knapp’s bar. His Yelp review included comments like “There was hair in my food” or alleged cockroach sightings. “Really ridiculous, fantastic shit,” Knapp says. “If you looked at previous reviews, you would understand immediately that this doesn’t make sense.”
Liu also says there is a limit to how much Yelp can improve their spam detection, since natural language — or the way we speak, read, and write — “is very tough for computer systems to detect.”
But Liu doesn’t think putting a human being in charge of figuring out which reviews are spam or not will solve the problem. “Human beings can’t do it,” he says. “Some people might get it right, some people might get it wrong. I have fake reviews on my webpage and even I can’t tell which are real or not.”
You might notice that I’ve only mentioned Yelp reviews thus far, despite the fact that Google reviews — which appear in the business description box on the right side of the Google search results page under “reviews” — is arguably more influential. That’s because Google’s review operations are, frankly, even more mysterious.
While businesses I spoke to said Yelp worked with them on identifying spam reviews, none of them had any luck with contacting Google’s team. “You would think Google would say, ‘Something is fucked up here,’” Knapp says. “These are IP addresses from overseas. It really undermines the review platform when things like this are allowed to happen.”
These creepy fake humans herald a new age in AI
Once viewed as less desirable than real data, synthetic data is now seen by some as a panacea. Real data is messy and riddled with bias. New data privacy regulations make it hard to collect. By contrast, synthetic data is pristine and can be used to build more diverse data sets. You can produce perfectly labeled faces, say, of different ages, shapes, and ethnicities to build a face-detection system that works across populations.
But synthetic data has its limitations. If it fails to reflect reality, it could end up producing even worse AI than messy, biased real-world data—or it could simply inherit the same problems. “What I don’t want to do is give the thumbs up to this paradigm and say, ‘Oh, this will solve so many problems,’” says Cathy O’Neil, a data scientist and founder of the algorithmic auditing firm ORCAA. “Because it will also ignore a lot of things.”
Realistic, not real
Deep learning has always been about data. But in the last few years, the AI community has learned that good data is more important than big data. Even small amounts of the right, cleanly labeled data can do more to improve an AI system’s performance than 10 times the amount of uncurated data, or even a more advanced algorithm.
That changes the way companies should approach developing their AI models, says Datagen’s CEO and cofounder, Ofir Chakon. Today, they start by acquiring as much data as possible and then tweak and tune their algorithms for better performance. Instead, they should be doing the opposite: use the same algorithm while improving on the composition of their data.
But collecting real-world data to perform this kind of iterative experimentation is too costly and time intensive. This is where Datagen comes in. With a synthetic data generator, teams can create and test dozens of new data sets a day to identify which one maximizes a model’s performance.
To ensure the realism of its data, Datagen gives its vendors detailed instructions on how many individuals to scan in each age bracket, BMI range, and ethnicity, as well as a set list of actions for them to perform, like walking around a room or drinking a soda. The vendors send back both high-fidelity static images and motion-capture data of those actions. Datagen’s algorithms then expand this data into hundreds of thousands of combinations. The synthesized data is sometimes then checked again. Fake faces are plotted against real faces, for example, to see if they seem realistic.
Datagen is now generating facial expressions to monitor driver alertness in smart cars, body motions to track customers in cashier-free stores, and irises and hand motions to improve the eye- and hand-tracking capabilities of VR headsets. The company says its data has already been used to develop computer-vision systems serving tens of millions of users.
It’s not just synthetic humans that are being mass-manufactured. Click-Ins is a startup that uses synthetic AI to perform automated vehicle inspections. Using design software, it re-creates all car makes and models that its AI needs to recognize and then renders them with different colors, damages, and deformations under different lighting conditions, against different backgrounds. This lets the company update its AI when automakers put out new models, and helps it avoid data privacy violations in countries where license plates are considered private information and thus cannot be present in photos used to train AI.
Mostly.ai works with financial, telecommunications, and insurance companies to provide spreadsheets of fake client data that let companies share their customer database with outside vendors in a legally compliant way. Anonymization can reduce a data set’s richness yet still fail to adequately protect people’s privacy. But synthetic data can be used to generate detailed fake data sets that share the same statistical properties as a company’s real data. It can also be used to simulate data that the company doesn’t yet have, including a more diverse client population or scenarios like fraudulent activity.
Proponents of synthetic data say that it can help evaluate AI as well. In a recent paper published at an AI conference, Suchi Saria, an associate professor of machine learning and health care at Johns Hopkins University, and her coauthors demonstrated how data-generation techniques could be used to extrapolate different patient populations from a single set of data. This could be useful if, for example, a company only had data from New York City’s more youthful population but wanted to understand how its AI performs on an aging population with higher prevalence of diabetes. She’s now starting her own company, Bayesian Health, which will use this technique to help test medical AI systems.
The limits of faking it
But is synthetic data overhyped?
When it comes to privacy, “just because the data is ‘synthetic’ and does not directly correspond to real user data does not mean that it does not encode sensitive information about real people,” says Aaron Roth, a professor of computer and information science at the University of Pennsylvania. Some data generation techniques have been shown to closely reproduce images or text found in the training data, for example, while others are vulnerable to attacks that make them fully regurgitate that data.
This might be fine for a firm like Datagen, whose synthetic data isn’t meant to conceal the identity of the individuals who consented to be scanned. But it would be bad news for companies that offer their solution as a way to protect sensitive financial or patient information.
Clinical trials are better, faster, cheaper with big data
“One of the most difficult parts of my job is enrolling patients into studies,” says Nicholas Borys, chief medical officer for Lawrenceville, N.J., biotechnology company Celsion, which develops next-generation chemotherapy and immunotherapy agents for liver and ovarian cancers and certain types of brain tumors. Borys estimates that fewer than 10% of cancer patients are enrolled in clinical trials. “If we could get that up to 20% or 30%, we probably could have had several cancers conquered by now.”
Clinical trials test new drugs, devices, and procedures to determine whether they’re safe and effective before they’re approved for general use. But the path from study design to approval is long, winding, and expensive. Today,researchers are using artificial intelligence and advanced data analytics to speed up the process, reduce costs, and get effective treatments more swiftly to those who need them. And they’re tapping into an underused but rapidly growing resource: data on patients from past trials
Building external controls
Clinical trials usually involve at least two groups, or “arms”: a test or experimental arm that receives the treatment under investigation, and a control arm that doesn’t. A control arm may receive no treatment at all, a placebo or the current standard of care for the disease being treated, depending on what type of treatment is being studied and what it’s being compared with under the study protocol. It’s easy to see the recruitment problem for investigators studying therapies for cancer and other deadly diseases: patients with a life-threatening condition need help now. While they might be willing to take a risk on a new treatment, “the last thing they want is to be randomized to a control arm,” Borys says. Combine that reluctance with the need to recruit patients who have relatively rare diseases—for example, a form of breast cancer characterized by a specific genetic marker—and the time to recruit enough people can stretch out for months, or even years. Nine out of 10 clinical trials worldwide—not just for cancer but for all types of conditions—can’t recruit enough people within their target timeframes. Some trials fail altogether for lack of enough participants.
What if researchers didn’t need to recruit a control group at all and could offer the experimental treatment to everyone who agreed to be in the study? Celsion is exploring such an approach with New York-headquartered Medidata, which provides management software and electronic data capture for more than half of the world’s clinical trials, serving most major pharmaceutical and medical device companies, as well as academic medical centers. Acquired by French software company Dassault Systèmes in 2019, Medidata has compiled an enormous “big data” resource: detailed information from more than 23,000 trials and nearly 7 million patients going back about 10 years.
The idea is to reuse data from patients in past trials to create “external control arms.” These groups serve the same function as traditional control arms, but they can be used in settings where a control group is difficult to recruit: for extremely rare diseases, for example, or conditions such as cancer, which are imminently life-threatening. They can also be used effectively for “single-arm” trials, which make a control group impractical: for example, to measure the effectiveness of an implanted device or a surgical procedure. Perhaps their most valuable immediate use is for doing rapid preliminary trials, to evaluate whether a treatment is worth pursuing to the point of a full clinical trial.
Medidata uses artificial intelligence to plumb its database and find patients who served as controls in past trials of treatments for a certain condition to create its proprietary version of external control arms. “We can carefully select these historical patients and match the current-day experimental arm with the historical trial data,” says Arnaub Chatterjee, senior vice president for products, Acorn AI at Medidata. (Acorn AI is Medidata’s data and analytics division.) The trials and the patients are matched for the objectives of the study—the so-called endpoints, such as reduced mortality or how long patients remain cancer-free—and for other aspects of the study designs, such as the type of data collected at the beginning of the study and along the way.
When creating an external control arm, “We do everything we can to mimic an ideal randomized controlled trial,” says Ruthie Davi, vice president of data science, Acorn AI at Medidata. The first step is to search the database for possible control arm candidates using the key eligibility criteria from the investigational trial: for example, the type of cancer, the key features of the disease and how advanced it is, and whether it’s the patient’s first time being treated. It’s essentially the same process used to select control patients in a standard clinical trial—except data recorded at the beginning of the past trial, rather than the current one, is used to determine eligibility, Davi says. “We are finding historical patients who would qualify for the trial if they existed today.”
Download the full report.
This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.