the audit

Business Insider and Over-Aggregation

September 30, 2011

Henry Blodget has a long and detailed response to Marco Arment, which is fascinating to anybody interested in the nuts and bolts behind a modern for-profit blog.

If you boil Blodget’s 4,000 words down to a single idea, it’s basically this: over-aggregation.

Now the concept of over-aggregation is not well defined, and means different things to different people. To Ryan McCarthy, who used to work at the Huffington Post and is acutely attuned to such things, over-aggregation is what happens when Outlet A writes a story and then Outlet B basically rewrites or copies the story so that there’s no reason to click through to A any more. HuffPo and Business Insider have both been accused of this, as have sites like Newser.

But that’s clearly not what was happening with Marco’s posts, so let’s put that kind of over-aggregation to one side for the moment. The dispute between Marco and Business Insider relates to something different — which is what happens when TBI links out directly to other people’s blog posts.

Now I’m a great believer in linking out directly to other people’s blog posts: I’ve built an entire website which does nothing else. And Counterparties.com doesn’t just have external links, either: each link also comes with a dedicated permalink, like this one.

But here’s the thing: we build Counterparties.com by hand, we write every headline on the site, we add a tag to it, and so on. What you see on Counterparties is our unique content. It links to other sites, but it doesn’t copy anything from those sites. And we link out maybe 20 or 30 times a day, tops. This is not some kind of copy-and-linking robot algorithm, it’s a hand-built list of artfully curated links.

Sign up for CJR's daily email

At TBI, by contrast, the areas of the site with nothing but external links work very differently. There are two such areas: one’s a column called “Read Me” which appears on the right hand side of the page if you scroll down a bit, and the other is a dedicated section called “The Tape“. For readers navigating the site, both of them work as they should: you see the headline, you click on the link, you go straight to the other website.

But behind each of those links is a huge CMS (content management system) architecture, whereby every external link is generated from a dedicated permalink page which people navigating the website are never supposed to see.

If you go to Yahoo Site Explorer, it’ll tell you that TBI has — get this — 465,825 separate pages. Now the likes of Henry Blodget and Joe Weisenthal are undeniably prolific, but there’s no way you get to 465,825 pages manually. TBI is about four years old, if you go back to its first incarnation as Silicon Alley Insider; 465,825 stories over four years works out at well over 300 stories per day.

So most of those pages, it turns out, were generated by robots without any human input at all: they look like this, or like this, and they’re just pages which copy-and-paste the headline, the author, and some of the content from third-party websites.

According to Blodget, this huge mass of robo-pages at TBI has an entirely innocent explanation. “To put something into the ReadMe box,” says Blodget, “we need to have a page with the headline and sub-head and author on our site, even if the page will never be seen by our readers.” It’s just a technical necessity! Nothing nefarious about it!

To be honest, it’s not a technical necessity. Other sites which link out a lot — Drudge, say — don’t have millions of hidden permalink pages generating every link on the home page. And Blodget protests a bit too much, I think, when he says he gets no googlejuice from these pages:

In the past, these pages have been indexed by Google, but because they include a link back to the originating site and page, they do not generate much (if any) SEO value for us. They exist only because it was easier for our developers to use the existing post-headline-author metaphor in our publishing system than to create the Tape entirely from scratch…

We always include a link to the original post on this stub page, so Google won’t conclude that we produced the original story.

I don’t think that Blodget is trying to get Google to link prominently to his stub permalink pages; nor is he trying to fool Google that those pages constitute original TBI content.

But those pages can do wonders for his googlejuice even if Google never links to them at all. The main reason is that because those pages are being created every minute of the day, Google is forced to spider TBI on a real-time basis, just to keep up with all that new content. Google likes those kind of sites, because it considers them to have lots of very fresh content — the more frequently you update your site, the higher your PageRank.

And of course since Google is spidering TBI on a real-time basis, it picks up TBI’s home-made stories the minute they appear. So if TBI writes a story about Fred Bloggs, and then someone searches Google for Fred Bloggs one minute later, the TBI story will come up at the top of the search results. Conversely, if I put a story about Fred Bloggs up on felixsalmon.com, which is almost never updated, it could take days to appear on Google. Having lots of robo-pages, then, helps boost the search prominence of TBI’s non-robo-pages.

Blodget does seem to have taken this criticism to heart:

We’re going to see if we can add “no follow” links to the stub pages to make sure that Google doesn’t index them. If we can’t do that, we’ll eventually redesign The Tape, so it doesn’t create stub pages at all.

I suspect that “no follow” links aren’t the best way to do this: I’d suggest instead that Henry put all the stub permalinks on a separate subdomain like articles.businessinsider.com, and then use the robots.txt file to tell Google not to index anything on that subdomain.

But more conceptually, the TBI over-aggregation problem will still exist, in the form of The Tape running huge numbers of other site’s headlines on an indiscriminate basis. (I think that ReadMe, at least, is more curated, although I’m not sure about that.) Henry says that “we created the Tape because we didn’t want to bother with RSS readers anymore”, but the fact is that The Tape is a really bad RSS reader. Building a good web-based RSS reader is hard: just ask Nick Denton, who put a huge amount of effort into building Kinja before abandoning it as a consumer product.

Instead, I think that the driving impetus behind The Tape was the more-is-more approach to web publishing: it has been clearly demonstrated again and again that the more content you put up, and the more frequently you update, the more pageviews and unique visitors you end up getting. That explains not only The Tape, of course, but also the large number of one- and two-paragraph stories on TBI. It’s good for business, but it’s not necessarily good for readers who want less sensationalism and more insight.

Felix Salmon is a financial writer, editor, and podcaster. A former finance blogger for Reuters and Condé Nast Portfolio, his work can be found at publications including Slate and Wired, as well as his own Substack newsletter.