Last night I started looking into the “Analytics” of this very blog. It’s something I had promised myself to do often, but it never happened. Looking at the graphs, prepared by the online (free) tool supplied by my provider, was/is simply such an un-rewarding and dry experience. The fact is, the information I am truly looking for is simply not there.
On the other side the original log files (cvs) are available but are really long and manipulating them with Excel is a tedious and (very) error prone process, even for a “Spreadsheet Master” like me (I have been in marketing for almost 15 years now…).
Enters the Pandas… no, not those in the picture! I am referring to the Python Data Analysis Library , a tool I had heard of many times in the past but always ignored as I considered it a thing for web jockeys… (scoff)!
Turns out I was so wrong! I took the 10 Minutes Intro and … well 10 minutes later I was looking at the data I wanted or rather I had dreamed of for so long!
To be fully honest, four hours later I was still there fiddling with that same data, but that was simply because I could not stop playing and refining my views!
Turns out all I needed were quite literally 3 lines of (Pandas) code. Here is the first one where I import the cvs log file:
import numpy as np import pandas as pd log = pd.read_csv( argv[1], sep=' ', header=None, names=[u'ip', u'B', u'C', u'DTime', u'E', u'Request', u'G', u'H', u'From', u'L', u'M', u'N'])
Next, filtering the rows I want:
dff = log[ log.Request.str.startswith('GET /201')]
Finally, grouping, counting and sorting the data:
dfgo = dff[['Request','ip']].groupby('Request').count().sort('ip',ascending=False).head(25) print dfgo
That’s it!!
As a bonus, I got to play with the Python plotting libraries (Matplotlib) which are also well integrated with Pandas. Here are a couple of more lines to get quite a refined bar chart to replace the crude print out:
dfgo.plot(kind='barh', legend = False, left = 0.65) plt.title('Top Requests:'+name) plt.show()