Back on crack
Apr 23, 2004
Reading the silly things Dave Winer writes is a very complicated experience. First you laugh, sometimes even out loud, despite of the fact that nobody really does that. Unfortunately, you very quickly realize that it's not a joke. Too quickly. You choke on your laugh, taste of vomit ensues, and already it's become a rather disgusting experience.
I'm not saying that everything Dave writes is silly, stupid, confusing and kind of disgusting to read, just that it does happen every so often. This latest one is no exception.
The backstory is that Google has started to look for atom.xml and index.rdf on websites, even if neither of those files are even linked. Google is checking to see if there are any feeds of these types that exists in spite of not being linked. You know, just in case. Traditionally, Google does not do that. Traditionally, Google only crawls pages and files that are linked from other pages. Looking for un-linked index.rdf and atom.xml is an exception.
This, as you can or cannot imagine, depending on how familiar you are with Dave Winer's sick mind, is appalling to him. In fact, "appalling" doesn't begin to describe how extremely awful this is to Dave Winer. Here, lemme quote him:
If this is what it appears to be, it's both tying in an anti-trust sense, and PR disaster in the making. I never in a million years thought Google would stoop this low, even Microsoft on its worst day never played this dirty. Dave Winer
I bet you're really confused right now. What about that could possibly be worse than Microsoft's infamous predatory business practices? What about Google looking for a couple of files could be so low, so as to suggest that doing so is far worse than anything Microsoft would do, on its "worst day" even? I mean, Microsoft has done some shady shit, right? What, pray tell, is so awful, so nasty, so blow-below-the-belt, so unethical, about this?
The fact that Google isn't looking for non-linked files by the name of index.xml, or rss.xml, or index.rss, or feed.rss, or any other filename which might just be a version of RSS that Dave supports (as opposed to index.rdf which, most of the time, is RSS 1.0, which Dave thinks is a bastard child).
Doesn't it all make perfect sense now? No, it doesn't? Well, congratufuckinglations, you have a healthy mind! In all fairness, perhaps you've got to "be there", in Dave Winer's place I mean, to appreciate just how awful a transgression this is on Google's part. And, by "Dave Winer's place" I actually don't mean the Bizarro universe. Please remember that RSS, his versions, is very dear to Dave. Dave Winer has spent a lot of time and energy in stealing it, evolving it, and condemning others for doing the exact same thing.
So, in light of the fact that RSS is really dear to Dave, perhaps this makes sense:
Developers, no matter what format they prefer, are going to be outraged that Google, which is a search engine, is trying to control and define publishing. This should be illegal, although of course I am not a lawyer. Dave Winer
No, you're right. It doesn't make any sense at all. Control and define publishing? What? Why would Google be interested in "controlling" the use of RSS, any of its versions, or Atom?
That's pretty brilliant. I wish I'd have thought of it. Of course, since there is no clear business reason for Google to get people to switch between various flavors of open(ish) XML formats, and a clear disadvantage to limiting themselves to them, in terms of comprehensiveness, that would seem like a strange thing to do. Evan Williams
Nevermind that, why should it be illegal for Google to look for specific unlinked files, unless they look for other specific unlinked files as well? Wouldn't that be, like, communism or something?
More interestingly, one might wonder what Google is planning with that syndication data they're crawling, and why they're not as excited about RSS, Dave's versions, as Dave is? All we can do is speculate about it. I guess it might have something to do with the difficulty of parsing and/or making sense of Dave's plethora of different and incompatible versions of RSS. And perhaps Google isn't looking for unlinked RSS, Dave's versions, because they're mostly not unlinked.
I dunno what Google's planning, few people do. But certainly, there is no reason to think that what they're doing right now has anything to do with "controlling and defining publishing".
Related posts:
- We've been botspotted! - Google employee (and founder of Blogger) Evan Williams explains.
- Googlebot and RSS
- What is Google cooking?
- No Fishing! - Why fishing for any unlinked files are bad, for entirely different reasons.
Comments
There's an issue here worth screaming about, but it's not what Dave is screaming about. The fact that Google appears to be "fishing" for unlinked resources is *extremely bad practice*, enough so that it makes me wonder whether the entire report is based on mistaken information.
More info: http://bitworking.org/news/No_Fishing
Comment by Mark at 16:01, 23 Apr, 2004 #
Mark: It's shady, I agree, but isn't it "okay" as long as they obey robots.txt? (Later: Okay, you're right, it's not.)
Comment by Tomas at 16:03, 23 Apr, 2004 #
What is "okay"? In my probably naive opinion is some-one/business has posted any form of content on a webserver and it's not restricted by any passwords etc then why can't it be crawled? What is the purpose of a webserver if no one knows about the content?
Comment by cyberhill at 16:05, 23 Apr, 2004 #
Cyberhill: I have absolutely no problem with Google (or anyone else) crawling my feeds. I have both RSS feeds and Atom feeds, they are not disallowed by robots.txt, and Google is as welcome to them as Feedster or anyone else.
But they should be crawling and following links to find them. I popularized the autodiscovery mechanism that provides a standard way to link to RSS and Atom feeds via LINK tags in the HEAD of a document. Many sites also use normal [A HREF] links to point to their feeds, with cute little icons or whatever. Google knows all of this, they are smart enough to auto-detect that the resource that they've found through their usual crawling algorithms is a feed, and I hope they do something interesting from there.
The key point here is "... that they've found through their usual crawling algorithms ..." Fishing for URLs is not Google's normal algorithm. It's extremely rude. There are a few legacy "fishing"-based standards (robots.txt being the big one) that people tolerate, but we shouldn't be creating new ones, and we shouldn't tolerate such behavior on a massive scale.
Anyway, Google is smart; they know all this. Which is what makes me wonder whether the entire report is based on misinformation.
Comment by Mark at 17:28, 23 Apr, 2004 #
The guy, certainly, has issues. But we do more, for paying attention to him :-)
Comment by David Collantes at 21:14, 23 Apr, 2004 #
I think Dave Winer is to you what Jeffrey Veen is to me, except I don't pay as much attention to my blowhard as you do to yours.
Comment by Cheshire at 21:45, 23 Apr, 2004 #
Mark,
Google is definately fishing for index.rdf on my site. Never had a link there myself.
Comment by Darryl at 23:13, 23 Apr, 2004 #
Winer is both paranoid and self-aggrandizing. Somehow the little XML-based format he created, according to his opinion, is a work of genius and the next big thing, although a high school sophomore with some XML knowledge could've arrived at RSS format at his own.
I don't think he realizes that few people care about his pet project. Yeah, I broadcast my blog in RSS today, but if someone asks me for atom, how long would it take me to switch? As long as helpful folks from MovableType provide proper support, less than 5 minutes. Maybe an hour if I have to do it by myself.
That's the nature of XML-based formats, they can be interchanged and associated with one another, if possible. It's like getting into a discussion on which kind of shoelaces you prefer on your running shoes. The answer for 99% of people in this world would be "Who the hell cares?"
Comment by Alex Moskalyuk at 00:22, 24 Apr, 2004 #
$ grep index.rdf < frozenskies.net-access | wc -l
13
Yup, already crawled. I don't like this method of crawling at all. It's one thing if I get a 404 from a removed file, but when my log is filled by 404s from files that never existed, I get annoyed. Same deal with favicon.ico, which I rewrite to the actual site icon for browsers who are too stupid for their own good.
I could always rewrite /index.rdf and /atom.xml to my real feeds, but I shouldn't have to just because Googlebot can't behave properly.
Comment by Johan Svensson at 00:42, 24 Apr, 2004 #
Well, I tried to not say anything, but I succumbed !
Comment by anu at 02:55, 24 Apr, 2004 #
Johan, you do not even have a dot on your website! What do you worry so much about? Your feeds are empty, for what I can see.
I really do not understand the hype of all this. I see nothing wrong, nothing "rude" on Googlebot behaviour. If you do not want your apples seeing, do not put them on display (not even covered with a napkin).
Comment by David Collantes at 04:00, 24 Apr, 2004 #
David: I guess you managed to visit my site during those 20 seconds that I broke Textpattern horriby so it didn't generate any output. :-)
Also, the point is that Google is going around poking peoples napkins, whether or not there is any chance of there being any apples under them.
Yeah, I know, it's really a very small thing to be annoyed at. But we're bloggers, we're supposed to be analretentive and complain about the smallest things.
Comment by Johan Svensson at 08:00, 24 Apr, 2004 #
I want to have your babby. Smoochy smoochy.
Comment by Mrs Melones at 08:19, 24 Apr, 2004 #
I think it's pretty apparent what Google is doing. Not to be outdone by all the other feed search sites, Google is going to introduce a feed searcher, and by God they're going to find every feed whether it's linked or not!
Or it could be that they're populating a test database with feed information and thought it would be nice and easy to just see if a site has one of the standard feed formats in the standard place with the standard name, instead of parsing out the home page and looking for links. Lazy? Sure. Effective? Totally. The Right Thing to do? Probably not. At least it's checking robots.txt first.
Comment by Tom Werner at 02:33, 25 Apr, 2004 #
Tomas,
The key issue here is that google is requesting atom.xml and index.rdf - files that do not exist ! While it is *not* requesting rss.xml , a file which exists and is linked to. I'm not religious about syndication formats but I think this is extremely bad practice on google's part.
I seem to be in the minority who think this is underhanded.
The validity/authenticity of the log records seems to be doubted. You can see the relevant server log snippets at
http://www.xanadb.com/archive/about/20040426
Walter
Comment by Walter at 12:55, 26 Apr, 2004 #
Walter: Actually, that isn't true. Google does crawl linked RSS files as well. Perhaps not for you, I wouldn't know, but you're lying if you claim that it doesn't do that for anyone. It does.
The issue is that it crawls for unlinked atom.xml and index.rdf, everything else is just speculation, and while speculation can be fun at times, I don't think it's fair to blame Google for your own imagination.
Comment by Tomas at 16:03, 26 Apr, 2004 #
Tomas,
I can only recount googlebot behaviour on my own websites - the internet overlords not seeing fit to grant me access to other's webserver logs - yet ;)
[but you're lying if you claim that it doesn't do that for anyone]
- Tomas, if I didn't know better I'd say you were trolling - I don't think I ever made claims for anyone other than myself.
If google requests rss.xml in the next month then we can all move on to the next storm-in-a-teacup.
You're right - speculation is fun. Don't lose sight of that.
Walter
Comment by Walter at 17:40, 26 Apr, 2004 #
Walter: You sure made it sound like Googlebot did this but didn't do that, in broad general terms. Whatever problems you have with Googlebot isn't related to the general behaviour of it. It does crawl RSS feeds, regardless of whether it crawls yours or not.
This post isn't about your site, it isn't about your personal problems with Googlebot, it's about the general behaviour as described in the post.
Comment by Tomas at 17:45, 26 Apr, 2004 #
Tomas,
I was wrong. Digging around an archived server log I see google last requested rss.xml from my site on 25th march.
Walter
Comment by walter at 19:28, 26 Apr, 2004 #
The discussion has been closed on this entry. Thanks to everybody who participated.