Sunday, February 10, 2013

Link Scraper using Python: article 201301


As part of the SecurityTube Python Scripting Expert course the below is a simple script written to extract the absolute paths from a provided webpage.

Written in Python 2.7.2 using urllib, re, and Beautiful Soup 4 using the LXML parser.

Here is screen shot of an example:




And here is the code:

#!/usr/bin/python

import re
import urllib
from bs4 import BeautifulSoup

print "#" * 50
print "#    Enter a url in the format http://site.domain"
print "#    i.e http://whyjoseph.com"
url = raw_input("#    Enter a URL: ")
print "#" *50
print "\n"
print ">>>>  Retrieving and parsing the page. This could take several seconds. <<<<"
print "\n"
htmlPage = urllib.urlopen(url)

soup = BeautifulSoup(htmlPage, 'lxml')

allLinks = soup.find_all('a')

for i in allLinks:
link = (i.get('href'))
if link:
matchobj = re.search('HTTP', link, re.I)
if matchobj:
print link

print "\n"


.




No comments:

Post a Comment