The Atlanta Journal-Constitution last week rolled out a remarkable, ambitious investigation into sexual assault and misconduct by doctors. The stories of women being abused by their physicians that the AJC uncovered are horrifying, and the impunity often enjoyed by perpetrators—many of whom are allowed to keep practicing, the details of their offenses kept confidential—is galling. Powerfully told and creatively presented, the project quickly won praise from journalists and others around the country:
The investigation stands out for two reasons in particular: First, it’s national in scope (it even made national network news). Second, in order to take the story national, the AJC had to deploy some programming skills that are rare to see at a regional paper.
The project got started when reporter Danny Robbins, while reviewing orders issued by Georgia’s medical board, discovered that many doctors in the state were allowed to continue practicing even after a finding that they had sexually violated patients.
After some further research, the paper suspected that Georgia wasn’t an outlier. So the AJC filed requests for discipline information with the equivalent boards or regulatory agencies in other states.
Those orders are often posted online, and the AJC sought the information as data sets. But the requests weren’t fruitful.
“We were disillusioned with that approach,” said Jeff Ernsthausen, a data reporter for the Journal-Constitution. “We told them that we wanted a copy of their websites and were told that such things do not exist. That’s not my understanding of how the internet works.”
And Ernsthausen, a former analyst for the Federal Reserve who interviewed with the AJC while attending the NICAR conference in 2013, understands the internet pretty well. So he wrote programs to scrape the public websites of those boards and agencies, retrieving the discipline information. Each state required a new program, though he was able to reuse some code over and over again.
Ernsthausen used DocumentCloud to host the 100,000 documents the scrapers found, and DocumentCloud’s optical character recognition to make the text searchable. The next step is described on the AJC’s page about how the investigation was done:
To assist us in identifying those involving sexual misconduct, we then created a computer program based on machine learning to read each case and, based on key words and their relationship to each other as well as other factors, give each a probability rating that it was related to a case of physician sexual misconduct.
That process flagged about 6,000 cases—still a lot to read through, but something the AJC team could handle. The information in those records is not a comprehensive accounting of sexual misconduct by doctors, as the paper explains. But the records provided a foundation for the detailed reporting in the series.
Derek Willis, a news application developer at ProPublica, said he was most impressed by the AJC’s use of machine learning to sift through the documents. That aspect of the project acted like a force-multiplier, while ensuring the kind of consistency that only a computer can apply to a massive amount of data.
“There are only a handful of people who do this that I’m aware of in newsrooms,” Willis said. “It’s super clever applying that to this kind of project. It allows news organizations to punch above their weight. The thing we’ve lost the most in the industry is staff. In certain situations, this is a replacement for resources.”
From Ernsthausen’s perspective, no single step along the way was particularly extraordinary. But stringing it all together—scraping the records, writing the program to sift through them, designing a database to make sense of the findings—was a big task. “This is the first thing of this scale that I’ve been involved with,” he said.
Of course, even with a technical assist, pulling off the investigation still took a lot of traditional resources. The AJC lists 44 people who had a role in the project, including a core team of seven.
Kevin Riley, the editor-in-chief, acknowledged that taking it on was a risk.
“Any time a regional paper decides to take on something this big, you have to worry, ‘Wow, are we going to be overwhelmed by this, and is it going to pay off?’” he said.
But a regional story, said Riley, “is not going to bring about the kind of change that’s needed”—and the paper is straightforward about the fact that it’s trying to force changes.
New installments in the series will continue to appear through the end of 2016. “It’s going to take a drumbeat,” said Riley. He added: “I’m not sure we’ve told a more important story at the AJC than this one. I really hope the system changes.”