Caroline Chen covers healthcare for ProPublica, and while she doesn’t consider herself a “data journalist,” her reporting frequently draws from and analyzes large datasets. Since early March, she has been publishing columns about how reporters can responsibly use data in their writing during the coronavirus pandemic.
The key for journalists, Chen says, is to understand that data collection is a way to understand what’s happening to people. This interview has been edited for clarity and brevity.
When we don’t acknowledge that things are confusing or contradictory, or maybe hard, we’re doing a disservice to readers. When I start to sense that there is a contradiction in messaging, that’s when I think, I need to write about this.
How did the ProPublica column start?
I was about to go to NICAR, the data journalism conference, to do a panel on covering the coronavirus. One of my editors said, “You should write that up.” The headline was, “I lived through SARS and reported on Ebola. These are the questions we should be asking about Coronavirus.”
I thought it was going to be limited to a very wonky audience. I did math in the post! I intended it as a guide for reporters. Then we got a ton of traffic. So many people wrote to me. That made me realize that there is a hunger for this. Smart readers want a clear explanation of all the numbers that are being thrown around in the news. First-person allows me to acknowledge parts that are confusing or contradictory.
Right now, the public wants to know, “Hey, if I get infected, how likely is it that I’m going to die?” That is a super reasonable question. My job as a reporter is to explain, “It’s very hard for me to give you the answer.” And that is what I’m trying to convey as I write these columns—not only what available data there is, but also the process through which scientists or researchers or doctors have gotten to that number and what more work needs to be done to come to a better answer in the future.
How do you toe the line between communicating what you know and pointing out the unknowable?
I’m learning along with everybody else. To give you another very specific example: the question of how many people are infected. You have options, as a reporter. You could just record a number. You could say, “There are X people infected in Y location.” You could also say “There are fifty people infected, but that’s likely an undercount right now, because there are not enough diagnostic tests available, so we’re only testing the sickest people.” You’re giving some context around that number—that’s already more helpful. What would be even more ideal is to be able to say, “And we’ve got another type of test coming along—antibody tests. We’re going to be able to start doing randomized testing in our population to see who has had past infections which will allow us to estimate what percentage of the population in this city or in this state.”
And then we really have to break that down into the difference between the diagnostic test and the antibody tests. Readers are smart enough to be able to understand and they’re actually hungry for that.
If you don’t give them that context, one day, they’re going see a very small number of cases. Then, a month later, when you’ve done the antibody survey, they’re going to see a very jarring headline that says, “There are way more infections than we thought!” That’s when people can come out and say, “We’ve been lied to,” when that’s not actually the case. We’re just able to measure something different now.
What are some of the most common pitfalls in data interpretation, and how do you avoid them?
I think the first step is to get to know the data inside out. If it comes in a spreadsheet, I always try to make sure I know exactly what every row stands for, and what every column means, and I want to make sure I know what is in the universe of the dataset, and what’s been left out, before I even get started messing with any analysis. I can’t say anything smart about a dataset if I don’t really understand it properly in the first place.
How do you train your brain to get beyond the obvious question to the question that is really going to give you a clear picture of what the numbers represent?
My first question is, What does the average person really want to know? And what are they going to hear?
The public can feel when there are contradictions. And they might not be able to put their finger on what it is, but they can sense it. There was this back and forth on masks. People could feel that contradiction—why are you telling us that masks don’t work at all? But they work for health care workers?
When we don’t acknowledge that things are confusing or contradictory, or maybe hard, we’re doing a disservice to readers. When I start to sense that there is a contradiction in messaging, that’s when I think, I need to write about this. And maybe I can’t give answers, but I want to at least call it out.
Can numbers tell a story? How should journalists who are accustomed to narrative forms report on data?
I think numbers can do a lot to strengthen a story and give weight to a narrative. That said, I think numbers alone aren’t enough. The reason I’m pursuing a story in the first place is because of the humans that the data represents.
TOP IMAGE: Photo via subject