Screen Scraping with python.

A simple script to screen scrape data from the CricInfo StatsGuru engine can be written in python using the lxml and requests modules. The code looks like

 
from lxml import html
import requests

data={}
for i in range(100):
    page = requests.get('http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;filter=advanced;groupby=overall;orderby=runs;runsmax1=%d;runsmin1=%d;runsval1=runs;template=results;type=batting'%(i,i))
    tree = html.fromstring(page.content)
    data[i] = [x.text for x in tree.xpath('//tr[@class="data1"]')[0].getchildren()]
    print "%3d : %5s %4s"%( i, data[i][4], data[i][5])

This outputs a list of scores by batsmen in Tests, the number of innings achieving that score and the number of those innings which were undefeated (i.e. with the batsman not out). As an exercise, lets graph the results:

import matplotlib
matplotlib.rcParams['text.usetex'] = False
import pylab
pylab.xkcd()
f = lambda x : eval(data[x][4])-eval(data[x][5])
x=range(100)
pylab.plot(x,map(f,x))
pylab.xlabel('Score')
pylab.ylabel('log of frequency')
pylab.title('logarithmic frequency of Test scores')

alt text

tags: python

James Percival

My Github Pages material

Screen Scraping with python.