Statistics

  Jun 15, 2004

Finding a decent host proved to be far more work than I first anticipated. I thought I was done two weeks ago, when I, subsequent to rigorous research, signed up with UnitedHosting. I was not.

It's not that UnitedHosting aren't a serious business, they very much are. They're real professional and run a very serious business. However, it seems like they've gotten more customers than they can handle. It seems like they just don't have time or capacity to manage us all. Besides that, the connection speed to the server varied quite a bit, from very fast to very slow.

So I did them and myself a favor and moved to a Swedish hosting provider; equally serious, not nearly as big, but one that has time and capacity to handle their customers (knock on wood). So far they've been quite flexible and very helpful. They even tried their best to get my installation of AwStats running properly, even though that's kind of crossing the border of the scope of their services.

I almost gave up on AwStats, tried hard and long to find an alternative free statistics software, but there was none. None that I found to be as good as AwStats, at least. I did in fact not get AwStats running in CGI-mode at my new host, but instead settled for having it generate static pages. Doesn't matter really.

As I was poking around with AwStats though, changing settings back and forth, etc, I realized that my website statistics have become more and more ambigous, as more and more people subscribe to the XML feeds. Crawlers and aggregators are pounding on my feeds, giving me a harder and harder time to read anything out of the statistics.

Before I started syndicating content, it was my opinion that hits were insignificant; in terms of numbers, what mattered was page downloads, visits, and unique vistors. But what is a hit on an XML feed? Is it just a hit, or is it a page download? It's a page download, because it's content being downloaded, not graphics or CSS that -- roughly speaking -- only serve to make the content prettier. But it's also just a hit, because just because the XML feed was requested, that doesn't mean anyone read it, perhaps it was only checked by the aggregator software?

The first and absolutely necessary step is to differentiate between a web browser and a crawler. By default, AwStats considers NetNewsWire to be a browser, not a robot. I don't agree with that definition. It might be semantically correct, but it reduces meaning from my statistics. So I removed NetNewsWire from the list of browsers (browsers.pm in the lib directory), and instead added it, and all other aggregators that showed up in the list of "unknown" user agents, to robots.pm, also in the lib directory.

AwStats handles robots differently than it handles web browsers. For instance, a visit by a robot does not count as a visit, only as a hit. I'm not sure if AwStats considers it a page download though, but wether it does or not, this is still not enough to completely make sure that aggregators aren't polluting the statistics with their regular and, often times, fanatical polling of feeds. Because, the list of aggregators in robots.pm is and never will be complete, nor inclusive.

So what I had to do was to isolate the requests of the XML feeds from the page downloads, because otherwise the page downloads statistics just didn't make any sense at all. I did that by changing what AwStats considers to be a "page", by adding RDF, XML and RSS files to the "NotPageList" list:

NotPageList="css js class gif jpg jpeg png bmp ico xml rdf rss"

Great, now my page statistics aren't polluted by aggregators and crawlers pounding on my feeds every hour, or minute. My page statistics will show only page downloads, not downloads of XML feeds.

Super, but that doesn't bring any clarity to how often my feeds are downloaded, nor how many people do. At this point, I'm glad I stuck with AwStats instead of finding an alternative log analyzer, because AwStats has this neat feature called "Extra Sections", or "Marketing Sections", which allows you to make an additional customized chart of traffic regarding a specific page, user agent, host or referrer.

Having added the following lines to the configuration file, AwStats presents me with a chart of any (existing) feed being requested, how many times each was downloaded, and how much data was downloaded:

# Report of requests of xml/rdf/rss feeds
ExtraSectionName1="Feed Requests"
ExtraSectionCodeFilter1="200 304"
ExtraSectionCondition1=
"URL,(\.xml)$|URL,(\.rss)$|URL,(\.rdf)$"
ExtraSectionFirstColumnTitle1="Feed"
ExtraSectionFirstColumnValues1="URL,(.+)"
ExtraSectionFirstColumnFormat1="%s"
ExtraSectionStatTypes1=HBL
ExtraSectionAddAverageRow1=1
ExtraSectionAddSumRow1=1
MaxNbOfExtra1=20
MinHitExtra1=1

In this particular case, I chose to include requests which were answered with HTTP Status 304, "Not Modified", but it makes just as much sense to only include those responded to with HTTP Status 200, "Ok", which should equate to the number of times a person has read your content using his/her aggregator.

I've chosen to add charts for feed requests (the one above), feed downloads (the one above, except only for 200 responses), as well as top aggregators by host (i.e. crawlers), and top aggregators by user agent.

Oh, and I thought I'd hook up three two one of my regulars with Gmail, give a shout if you still haven't got one. Sorry, I'm all out now. We'll do this again if/when Google hands me more invitations.

Update: If you're interested in getting a Gmail invitation, read the comments to this post, my pal cyberhill has a few left.

Permanent link

Comments

  1. a silent regular without a gmail account? indeed. if you feel so inclined, fell free to send some help my way.

    Comment by Jblount at 20:14, 15 Jun, 2004 #

  2. I have recently been investigating stats packages and haven't really ever used one before. I am just getting my new site together on a new host and I think I will try AwStats, thanks for the nice tips.

    I still have no gMail account yet either, but would be willing to accept one as a gift ;)

    Comment by acidmike at 20:19, 15 Jun, 2004 #

  3. I'm also using AwStats for my sites, but I had totally missed those "Extra sections". Guess I should take a peek in the manual again. And if there's still a GMail account up for grabs I'm interested.

    Comment by Peter at 20:34, 15 Jun, 2004 #

  4. The Gmail invitations are now used up. Enjoy it, guys.

    Peter: Remember to also upgrade to a recent version of AwStats, it can now also log screen resolutions.

    Comment by T. Jogin at 20:41, 15 Jun, 2004 #

  5. To further complicated matters, there is the Bloglines crawler -- as a centralized aggregator, it will hit your feeds the same (very large) number of times whether there is one Bloglines user subscribed to your site or a hundred. From an engineering perspective, it's great -- it reduces the bandwidth load on large sites to one crawler instead of a thousand separate users. But it does throw off your statistics.

    In my case, awstats shows 1573 hits from crawler01.bloglines.com so far this month.

    Comment by Aaron at 21:18, 15 Jun, 2004 #

  6. Aaron: Exactly, with crawlers like that, site statistics don't mean shite if you don't isolate the feeds from the rest of the site.

    Comment by T. Jogin at 21:20, 15 Jun, 2004 #

  7. G'day everyone.

    I've just got 6 more invites lol. Too many for me. So if you want one give me an email and I'll hook you up.

    Comment by cyberhill at 03:30, 16 Jun, 2004 #

  8. Update: 2 left. Now you have to give me a reason why I should give you the invite above everyone else.

    Comment by cyberhill at 13:20, 16 Jun, 2004 #

  9. Update2: No invites left :)

    Comment by cyberhill at 08:46, 17 Jun, 2004 #

  10. Tomas, I am getting a server ready at serverbeach.com, colocated that is. I am tired of sharing a server with others.

    On the Gmail invites... I got 13 left... details at collantes.US (not a plug, just related)

    Comment by David Collantes at 03:01, 18 Jun, 2004 #

The discussion has been closed on this entry. Thanks to everybody who participated.