Sunday, February 10, 2013

Link Scraper using Python: article 201301

As part of the SecurityTube Python Scripting Expert course the below is a simple script written to extract the absolute paths from a provided webpage.

Written in Python 2.7.2 using urllib, re, and Beautiful Soup 4 using the LXML parser.

Here is screen shot of an example:

And here is the code:


import re
import urllib
from bs4 import BeautifulSoup

print "#" * 50
print "#    Enter a url in the format http://site.domain"
print "#    i.e"
url = raw_input("#    Enter a URL: ")
print "#" *50
print "\n"
print ">>>>  Retrieving and parsing the page. This could take several seconds. <<<<"
print "\n"
htmlPage = urllib.urlopen(url)

soup = BeautifulSoup(htmlPage, 'lxml')

allLinks = soup.find_all('a')

for i in allLinks:
link = (i.get('href'))
if link:
matchobj ='HTTP', link, re.I)
if matchobj:
print link

print "\n"


No comments:

Post a Comment