Tow Center

Who’s behind this website? A checklist.

Hana Joy

Sign up for The Media Today, CJR’s daily newsletter.

By Priyanjana Bengani (@acookiecrumbles) and Jon Keegan (@jonkeegan) IRE NICAR Conference – March 4, 2022 Slides: English | Russian

The Tow Center would like to thank Dr. Svetlana Borodina and the Harriman Institute for translating this presentation into Russian. 

 

What is this?

This checklist is meant to be used as a reporting tool to help journalists and researchers when trying to find out who published a website. This is meant to be used in conjunction with offline reporting techniques.

Following this checklist does not guarantee that you can unmask an owner of a website who does not want to be found, but it can help surface crucial clues and connections that can act as leads for further reporting.

🌟 Strong recommendation: while running through this checklist, create a data diary—it can be a TextEdit doc, a Google Doc, just the Notes app, whatever. It is important to be able to retrace your steps.

 

Site Content

Text
  •  ✍️ Are there any authors listed?

  • 📫 Are there any email addresses or contact information?

  •  🕑 What’s the server’s local time?

    • Look at the datetime attribute in links on WordPress sites. GMT timestamp can reveal time zone based on GMT offset: <time class="updated" datetime="2022-03-04T10:21:40+06:00">March 4, 2022</time>
  • 🕶 Does the website have a privacy policy or terms and conditions that mentions an LLC or what regional laws apply?

  •  📡 Does the website have an RSS feed?

    • Does the RSS feed give any additional information about authors / stories that aren’t visible on the site?
    • You can pull RSS article links into Google sheets using IMPORTFEED
Features and functionality
  •  🗞 Does the website have a newsletter?
    • Check for the physical postal address—required by the CAN-SPAM Act in the US
  •  💸 Does the website collect donations?
  •  🛒 Does the website have an e-commerce store? Or, does it sell products?
    • Try walking through the checkout process (without paying). Sometimes the real payee name is revealed just before you confirm the payment.​
Links
  •  🔗 What domains does the website link to most? (Requires scraping)
  •  ❤️ Who links to the domain most often?
    • Google search operator: “link:yourwebsite.com”
    • Check backlinks on ahrefs.com 💵
  •  Do the links have UTM codes?​
Photos, images, and documents
  •  📸 Are there author photos?
    • Use reverse image search to see if the same images appear elsewhere
    • Check sensity.ai to see if the image is GAN-generated
    • Read more about spotting GAN-generated images here.
  •  🔎 Do the images have EXIF data?
    • Instructions here.
  •  👀 Do the images have any other identifying information?
    • Run through the list here
  •  🪣 Where are the images hosted?
    • If on AWS S3, the bucket name can be revealing—or you might find the bucket isn’t secure.
  •  📄 Are there PDFs hosted on the site?
    • On a search engine, “filetype:pdf site:<yourwebsite.com>”
    • If you find some, check the metadata with “Get Info” in your PDF viewer.​

 

Social Media

If there are any social media profiles mentioned on the site, they are worth investigating.

  •  👤 Are there any social media accounts in the <meta> section of the HTML?
  •  📅 When were the individual accounts created? Does it line up with the site history?
  •  📊 What platform has the biggest reach?
  •  📣 Is the messaging different across platforms?
  • 📇 Do they have completely distinct account names across social media platforms or are they more or less the same?
    • Note: just because you find the same account name across platforms doesn’t necessarily mean they belong to the same person!
Facebook

On the Facebook profile, go to Page Transparency:

  •  ☎️ Is there an address and phone number for the page?
  •  ⏪ Does the page history reveal a different name?
    • Has the page shifted topics?
  •  🐣 When was the Facebook page created?
  •  Is the page running any groups?
  •  🗳 Has the page run any ads? Has the page run political ads?
  • 🤖 Does Facebook flag any “related pages” for the given page? Rely on Facebook’s algorithms to find connections!​
Twitter

On Twitter, the account might be part of a pod or network that boosts it. Using en.whotwi.com, it’s worth checking:

  • 👯‍♀️ Who is the account engaging with?
  •  🐦 What are the account’s tweeting patterns?
  •  #️⃣ What hashtags are associated with the account?
  • Who were the account’s first follows / followers?
Other platforms

Don’t forget to check to see if the site has accounts on Youtube, Instagram, Reddit, Github…

 

Infrastructure

  •  🗄 Have you archived the website? (You always should!)

    • you can do this on archive.org or use their browser extension.
    • you can grab the whole website on Terminal with wgetwget -mpEk <yourwebsite.com>
  •  🖥 What is the website using?

    • Is it using WordPress, Squarespace, something else?
  •  ☁️ Where is it hosted?

    • Is it on Google Cloud, AWS, Cloudflare, something else?
  •  🪳 Are there any trackers present?

  • 🛍 How is the site monetized?

    • Are there any affiliate links (Amazon, etc.)?
  •  🧬 What are the various tracking identifiers, and are those shared with other domains?

    • Check Google Analytics, Facebook Pixel, Quantcast, NewRelic, etc.
    • Use tools like builtwithRiskIQ, or Dnslytics to see if other domains share the same ID.
  •  Are there any relevant subdomains?

  •  📜 Are there historic WHOIS records?

  •  ⌛️ Has the site changed over time?

    • Look at archive.org to see whether the domain shifted tremendously—and if so, when.
  •  🗑 Did the earlier version of the site have more information?

    • People can remove info when a site’s been up for a while.

 

Resources & Tools

Books

Open Source Intelligence Techniques – Michael Bazzell https://inteltechniques.com/book1.html

Verification Handbook – edited by Craig Silverman https://datajournalism.com/read/handbook/verification-3

Website Infrastructure
  • Blacklight: The Markup’s real-time website privacy inspector.
  • builtwith.com: gives you the infrastructure of the site, including IP addresses, analytics codes, tech stack, etc. Freemium model.
  • DNSDBScout: allows you to search and “flexible search” for passive DNS lookups including IP <-> domain mapping.
  • Dnslytics: offers a range of tools including reverse Analytics and reverse DNS lookups, as well as WHOIS data. Freemium.
  • RiskIQ: a “threat intelligence” tool that allows you to get reverse IP, reverse analytics, WHOIS, SSL, subdomains, etc.
  • Whoxy: a tool that lets you see historical WHOIS registrations. Free.
  • The Internet Archive browser extension.
Social Media Accounts
  • Sensity AI: check if an image is GAN-generated or not. Freemium.
  • whotwi.com: create a profile-at-a-glance for any account on Twitter. Free.

View this checklist on GitHub.

Sign up for CJR’s daily email

About the Tow Center

The Tow Center for Digital Journalism at Columbia's Graduate School of Journalism, a partner of CJR, is a research center exploring the ways in which technology is changing journalism, its practice and its consumption — as we seek new ways to judge the reliability, standards, and credibility of information online.

View other Tow articles »

Visit Tow Center website »