Technorati Blog

Subscribe

D'Technology Blog posted today


So you’ve installed WordPress 2.5, now you want to show Technorati links on the dashboard. Here’s the code...


read the rest

If you're stuck on an old release because you didn't want to lose those inbound links in the administrative console, you're now free to move up to 2.5. Because the of the widespread hacking of legacy WordPress installations, we strongly urge you to upgrade ASAP.

We're seeing thousands of blogs per day that we're not indexing because they're bearing symptoms of being compromised (see the previous post on the matter). If you're not using versions 2.3.3 or 2.5, you must upgrade to protect yourself (perhaps 2.0.11 and 2.1.3 each fixes this issue too, I'm looking for confirmation on that).

This is a follow up on our post regarding a problem affecting thousands of WordPress blogs, Patch or Upgrade Your Wordpress Installation, Now. WordPress has since released version 2.5. However, we've noticed that a large number of blogs remain vulnerable to the security issue addressed by the 2.3.3 release.

Blogs that have been compromised by this security vulnerability are typified by having links to spam destinations inserted onto the blog page. These link insertions may be invisible to casual observations; the links are often obscured by style attributes that render them invisible. These links are still seen by crawlers such as Technorati's, Google's and Yahoo's. You can find these links by viewing the source of the blog pages or, when using Firefox, looking under "Tools" -> "Page Info" -> "Links". Blogs hosted on wordpress.com are not affected by this issue; only blogs hosted on their own installations of WordPress from wordpress.org require concern.

Because of this ongoing problem, we're discontinuing processing crawls of blogs that exhibit common symptoms of being compromised. We strongly recommend upgrading your WordPress installation. Even if you haven't been afflicted by a compromise, by the time you are aware that you have been a number of negative consequences may have already occurred (for instance, flagged spam by Technorati, Google or Yahoo!) -- this has been reported by many WordPress users.

If you have questions about installing WordPress or maintaining a WordPress installation, please refer to the WordPress Documentation or the WordPress Forums. If you feel that your blog is not vulnerable to this hack but your WordPress blog is not being updated, please contact Technorati support staff.

Technorati has seen a number of blogs exploited by a recently announced WordPress vulnerability. The fix for it is simple: upgrade your installation or patch it. If you're running a WordPress installation, please read about the WordPress 2.3.3. release to review your options.

Sorry about the goofy title, I'm in grave need of levity now due to some indexing troubles we had this past week and the ensuing recovery effort. We're currently in the midst of repairing most of the effected data but I wanted to share what's going on with it.

Technorati's spiders were shutdown for several hours on Thursday and various intervals since then while we investigated a number of anomalies that were appearing in our data; essentially, a small percentage of recently created blogs were having their data scrambled. An example of this appears in this blog post. The spidering outages allowed us time to investigate, diagnose and make corrections that prevented further data corruption. We started running some corrective measures on Friday but found over the weekend that that was only partially effective. Technorati handles a large volume of data everyday; isolating and devising remedies for these kinds of issues that effect a small percentage of the data flow is tricky. However, we think we're recovering now and the backlog of data processing is getting worked through.

Just to peek into the works a little bit, many distributed data systems rely on centrally dispensing identifiers for data elements and Technorati has such a beast. What was found were cases of blogs new to our system (from within the last 3 weeks) losing thier identifiers and those identifiers getting re-associated to other new blogs. No blogs that existed in our system before Dec. 18th (the vast majority) were impacted at all. The outward manifestations visible were posts for blogs with a shared ID mingled (a mashup the authors naturally were unhappy with) and mis-associated blog claims ("And you may tell yourself, this is not my beautiful blog").

This was a unprecedented case for us; while it had been occurring in about 8% of those blogs (created on or after December 18) for about 2 days (beginning on Tuesday, January 8th) we had until that time never encountered this phenomenon. An intensive investigation was launched, reconstructing operational timelines and correlating facts. What we found was that this stemmed from a failure incident with the primary system for identifier dispensing, another failure in the secondary system that took its place and then a corrupted data set mistakenly taking over that one, ouch! The first two blows appeared to be handled routinely but the third time was cursed; propagation of corrupted data was not detected for about 48 hours between Tuesday when it started and Thursday when we pulled the emergency brakes on the spiders.

So we're recovering now, most of the data is being restored to its previous state and we have had a number of internal postmortem discussions about earlier fault detection and recovery. If your blog was created in our system within the prior three weeks (since December 18th) and you're seeing aberrant data associated with it or it's no longer there (try http://technorati.com/blogs/YOUR_BLOG_URL to check), please visit the support request page. A selection for 'The January 8th System Outage' will be available this month while we shake out any remaining issues that aren't covered by the remedial action under way now.

Over the holiday break we found and fixed a bug that inflated authority counts for certain blogs. The blogs affected were those on domains that also have linked-to sub-domains. The links to the sub-domains were erroneously counting toward the blog authority of the blog on the parent domain. Since Technorati Authority is a calculation of how much attention is being paid to a blog and the posts beneath it, we do not include sub-domains. Sub-domains are treated as separate entitities and often are references to tools, utilities, features, and other non-blog resources.

Examples:

http://chinese.engadget.com
http://desktops.engadget.com
http://hdtv.engadget.com
http://storage.engadget.com

Well, we fixed the bug yesterday. The impact of this change is mostly limited to the Top 100 and the overwhelming majority of the blogosphere is unaffected. Thanks for bearing with us while the Top 100 experiences some turbulence.

We're always thinking about how to improve and develop new meaningful metrics for the blogosphere and we welcome your feedback on these issues.

If you ping Technorati directly via our web form, it reduces the number of moving parts required to process the ping. It also offers a crawl-time advantage to Technorati members who have claimed the URL that they are pinging for; those pings go into a higher priority queue. However, we realize that most bloggers rely on the XML-RPC ping capabilities of their blog content management systems (CMS) and, much of the time, that works just fine. However, this past week we isolated a distinct but minor drop off in the update flow to us via Ping-o-Matic's XML-RPC interface. The web-form pings on Ping-o-Matic appear to have been flowing to us uninterrupted but not the XML-RPC pings. The problem was resolved last night, all of the Ping-o-Matic pings are flowing in again and we'd like to thank the folks at Ping-o-Matic for addressing this issue promptly.

The significance of this is that the Ping-O-Matic XML-RPC interface is the default ping destination used by Wordpress installations, as well as some other blog CMS'. If that's the case for your blog and it was not crawled this week to pick up a posting you've made, please ping us directly ("When in doubt, ping the direct route!").

Here's a tip: when you see the link to ping your claimed blogs on the Technorati home page or on the ping page itself, drag that link to your browser's bookmarks and put it on the browser toolbar. Then, whenever you post to your blog you can conveniently hit that bookmark. That ping will come in to us directly and, as long as you're logged into Technorati, be given high priority in our crawl queue.

Do you hate having to remember passwords, logging into lots of different services, each with a different password? User-centric identity is a fancy way to describe putting you in control of the logins and passwords required to authenticate your identity on different services; this is an idea that Technorati is fully behind. Technorati launched OpenID support in October 2007 for blog claiming and followed up with identity provider support two months later. This enables you to comment on LiveJournal blogs and log in to other services supporting OpenID as clients simply by being logged in to Technorati.

Since that time AOL, Wordpress, Vox and other great services have released their support for OpenID as well. Yesterday, I was very excited to see Eric Case post that Blogger is now supporting OpenID commenting! The more we can reduce the password-overload and identity fragmentation with all of these services we use, the better. I thought it might be helpful to show you how you can use your Technorati profile to authenticate for commenting on Blogger blogs.

First, you must be logged into Technorati to begin with. When you're reading a Blogger blog and want to submit a comment, look where the form says "Sign-in using" and select "Any OpenID".



The URL box that opens up just needs the URL for your Technorati profile, which has the form http://technorati.com/people/technorati/USERNAME. Put that URL in the "URL" box and hit the "Publish Your Comment" button.


The final step is to tell Technorati to permit Blogger to know that you're logged in to Technorati. You can grant that permission for the future so that Blogger can always get that confirmation from Technorati or make it a one-time access. Click "Set Permission" and you're done!


There are a lot of great resources and services available for learning about OpenID. I'm expecting 2008 to finally usher in the time when this stuff becomes more manageable. If there are other identity services that you would value as a Technorati user, please let us know!

Those who have been following Technorati over the years may remember the basic proposition emblazoned on the web site, What's Happening on the Web Right Now. One of the exciting things about working at Technorati (particularly if you're a data geek like me) is that the web is changing in real time. Searching it in real time and discovering the significant happenings realizes the promise of the web to catalyze and connect us. While blog search continues to be a core focus of what we offer, today we're releasing discovery features that have been tooled up to bring you the Technorati Percolator.

Search & Discover

Our search application surfaces keyword, tag and link query results as they unfold. Our discovery features will tell you what's going on, not requiring that you know what to ask. In the 1.5M blog posts passing through our turnstiles everyday, some of them have to be good and we want to help you find them.

The Technorati Percolator combs the sea of posts and other media flowing through our systems to find the ones that are emerging as significant at any given time. Finding the needles in a fast-moving haystack and organizing them into topical groupings isn't easy. Items in the Percolator are sampled from our update stream, primarily ranked by the age of the item, the authority of its source, the authority of the referring blogs and the density of recent links to it. We found that by taking all of these factors into account, an effective algorithmic filter and magnifier emerges. A lot of great applications have already appeared on the landscape that try to solve this kind of problem. From what we can tell, those applications started with a small corpus of blogs and grew their coverage from there. Technorati has come at the problem from the perspective of starting with broad coverage, sampling it and winnowing it down to the good conversations. Of course, if you want to explore the social connectivity, Technorati's search systems are there to help.

Our primary goal with the Percolator is to highlight the significant things grabbing the blogosphere's attention regardless of the blogger's "A-list" or "Z-list" status. Our broader coverage should enable us to better serve the broader blogosphere. Yes, we have stories and sources from the main stream press as well as the "A list" bloggers we're all familiar with but we're also striving to provide more comprehensive coverage by going deeper into our data set than "page one". By exploiting our broader coverage, we're seeking to move meme-emergence applications further along the long tail.

While we're very proud of this release, the Percolator is a work in progress, so please keep your feedback coming. We're going to continue iterating on our technologies to better serve the blogosphere and help those navigating it search and discover what's happening.

In a few short years, weblogs have come to represent the fundamental connected conversation of our pubic lives. The numbers involved have become very large: at Technorati we index over 65,000 blog posts an hour along with 2,800 fresh links a minute. Worldwide, we index over 100 million blogs. We built Technorati on blog search, helping bloggers, readers, journalists and brands understand the online conversation on a topic: who's talking, who's influential, and what's being said about me. And in a very rudimentary way, we've always shown what’s hot right now based on top searches and top tags.

Finding the good stuff in a world of 100 million sources

But the incredible growth in user generated content means we need to go beyond the "search" paradigm and find a better way to highlight and present the best of what's happening now, and moreover what's gaining attention in topical areas of interest to our audience. It’s a rich problem to solve. We're very good at showing all the latest posts on a topic, but the latest is different than posts that are actually rising in attention. It takes a few hours to a couple of days for an interesting post way down the long tail to build a string of inbound likes and start rising in attention. By which time its probably lost on Technorati search result page seven.

I suppose we could build a system that looks at all the posts on a topic from a tightly proscribed white list (there are many services that do this) but that loses much of the emergent serendipity of the blogoshere and excludes everyone except for a proscribed static elite. That’s neither fair nor interesting nor scalable. We're interested in what the whole blogosphere has to say on a topic. But to find the interesting relevant stuff we want to give a little more weight to bloggers that are revealed to be authoritative in a subject (not just because they say so in a tag, but because we observe many other topical bloggers linking to them in a democratic vote of editorial goodness).

Social, meet mainstream media

But bloggers don’t just link to other blog posts; one of the most powerful things bloggers do is collectively vote on the relevance of mainstream media articles at any moment based on linking behavior. That too is a key measure of the blogosphere's attention. It's also a key measure of a news article's relevance, because blog links are essentially editorial selection. And since blogs and mainstream media are, via links, reciprocal participants in a global conversation, we knew we had to put both on the same page to do each justice.

So, we incorporated all this thinking (and more) into designing the new Technorati.com.

My first two months at Technorati are in the books, and in that time, a number of you have asked me “what is Technorati up to?” or “now that you’re on board, what are your plans?” Today, I’m happy to introduce a new Technorati.com that answers both those questions. And it really is new: it leverages the best of Technorati’s definitive resources to offer an evolved, discovery-driven experience that is more powerful, more relevant, and more instant than ever before.

We have harnessed our best-on-the-Web blog index and search to power a rich, topical experience that’s complimented by new and improved search. And this is just a step, one of many more to come towards leveraging our blog search heritage to provide richer, more relevant, algorithmically-driven search and discovery that connects our Web 2.0 startup roots with our blossoming business.

But this relaunch isn’t just about new pages. In parallel to public facing events like today’s launch, we’re also working hard behind the curtains to build our infrastructure and business, with initiatives like improved search/site performance, and streamlined monetization. This release is momentous, but it’s also multi-faceted, and to do the nuances justice I’ve asked a couple veteran team members to post some rationale and descriptions of the new features. We’ll be listening for your feedback as we progress from this step forward, and we hope you enjoy the new discovery.

View Archived Posts