26
February
2013

Comments Off | Shares | | Tags:

Recently I have been working on python web scrapers. The main tools used in this project are urllib2 and Beautifulsoup3.

Web scrapping : A software technique to extract information from websites. A computer program which extracts required data from the web page and organizes it in structured database for analysis and processing.

Urllib2 : A python library to handle HTTP requests and responses. urllib2 is easy to use and supports different protocols such as FTP, HTTPS etc.  Cookies are handled by the API.

Beautifulsoup3: A third party python module to extract information from HTML pages.
Lets take a simple example on how to scrape a web page.

1
2
3
4
5
6
7
8
9
10
# used to handle request and response
import urllib2
#used to parse the HTML pages.
from BeautifulSoup import BeautifulSoup
response  = urllib2.urlopen('http://coregurukul.com/')
html = response.read()
soup = BeautifulSoup(html)
span_tag_list = soup.findAll('span',{'class':'headings'})
for span in span_tag_list:
print span.text
# used to handle request and response
import urllib2
#used to parse the HTML pages.
from BeautifulSoup import BeautifulSoup
response  = urllib2.urlopen('http://coregurukul.com/')
html = response.read()
soup = BeautifulSoup(html)
span_tag_list = soup.findAll('span',{'class':'headings'})
for span in span_tag_list:
print span.text

A sample POST request

1
2
3
4
5
6
7
8
9
10
11
#passing value for a request.
 
import urllib
import urllib2
 
url = 'http://www.examplexyz.com'
values = {'user_name' : 'test','password' : 'test_admin'}
post_data = urllib.urlencode(values)
request = urllib2.Request(url, post_data)
response = urllib2.urlopen(request)
html = response.read()
#passing value for a request.

import urllib
import urllib2

url = 'http://www.examplexyz.com'
values = {'user_name' : 'test','password' : 'test_admin'}
post_data = urllib.urlencode(values)
request = urllib2.Request(url, post_data)
response = urllib2.urlopen(request)
html = response.read()

Headers
Headers are necessary when making a request to a web server. Many websites do not like when a bot tries to access the information. In some cases you may get blocked by the server. To avoid this situation we use headers (user-agent)  in the request to imitate a browser.

Different browsers render HTML in different styles. Below given code pretends to be a Mozilla Firefox browser.

1
2
3
4
5
6
7
8
9
import urllib
import urllib2
 
url = 'http://www.examplexyz.com'
values = {'user_name' : 'test','password' : 'test_admin'}
post_data = urllib.urlencode(values)
user_agent = 'Mozilla/5.0 (Windows NT 6.0; rv:13.0) Gecko/20100101 Firefox/13.0.1)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, post_data, headers)
import urllib
import urllib2

url = 'http://www.examplexyz.com'
values = {'user_name' : 'test','password' : 'test_admin'}
post_data = urllib.urlencode(values)
user_agent = 'Mozilla/5.0 (Windows NT 6.0; rv:13.0) Gecko/20100101 Firefox/13.0.1)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, post_data, headers)

 

Updated on February 26, 2013 at