I’ve been working on a new APL workspace to read the access logs for this site. Most web hosts make such a log available. Network Solutions, our host, also provides web pages with fancy graphs displaying the number kind of web pages requested by users.
These logs can be a valuable tool in promoting your web site. Once you decipher the file you can determine how many requests you’ve gotten for each post to your blog; you can determine how many users you have; and you may determine what posts were worth your time and effort.
The new workspace is called Web Logs and you can download it with APL-Library at https://sourceforge.net/projects/apl-library.
It took me a while to get to something useful. I kept finding requests that I didn’t think should be counted to get to basic numbers like total numbers of hits. If you look at the top of this page you’ll see the Daly Web and Edit logo, an image. Each time your browser requests a page, it can potentially request that JPEG file. That second request is in the access file along with the first for the HTML. We only want to count one.
I kept writing predicates, which what the computer scientist call a function that returns true or false. The truth that I was seeking feels more like an opinion than a matter of fact. Computer programs deal only in facts.
On one day last week we posted ‘Clerk of Bucks Quarterly Meeting’. I wrote the post and returned to edit it several times. Then I turned it over to Kate who edited the posts again saving it several times. I could guess that many of the requests logged that day were actually Kate and I preparing the post for publication. How to exclude? I noticed that all the requests came from the same web address. A little looking and I concluded that the address must be our router so I excluded them.
On the same day, about the same time there were requests I just didn’t understand. All from the same web address all calling a program ‘/wordpress1/wp-cron.php’. I did some research on WordPress. Apparently WordPress needs to do housekeeping on occasion. WordPress triggers this housekeeping when actual requests are made. I excluded these based on the web address.
I came up with four rules.
1. An actual web page was returned. Errors occur and there is a code for success.
2. The request is not a search engine spider. This predicate is beyond the scope of the blog. Its probably not right either, but I haven’t caught it in any errors
3. Exclude style sheets and graphics
4. Exclude requests from the Daly’s and those created by WordPress itself.
I get a result which I like to believe. I still think the whole thing is a matter of opinion while all I have is facts. Some of those facts are past opaque. One datum delivered with each entry is the agent. As best that I can tell this is the name of the program that actually made the request. Its how I identify the search engine requests. Looking at the actual data, especially the entries that survive my four tests, there is much more to this field if only obfuscation.
If I look at the fancy colored graphs the Network Solutions provides, I get comparable results to my own algorithms. It settling for that.