Why Joseph: Link Scraper using Python: article 201301

Sunday, February 10, 2013

Link Scraper using Python: article 201301

As part of the SecurityTube Python Scripting Expert course the below is a simple script written to extract the absolute paths from a provided webpage.

Written in Python 2.7.2 using urllib, re, and Beautiful Soup 4 using the LXML parser.

Here is screen shot of an example:

And here is the code:

#!/usr/bin/python

import re
import urllib
from bs4 import BeautifulSoup

print "#" * 50
print "# Enter a url in the format http://site.domain"
print "# i.e http://whyjoseph.com"
url = raw_input("# Enter a URL: ")
print "#" *50
print "\n"
print ">>>> Retrieving and parsing the page. This could take several seconds. <<<<"
print "\n"
htmlPage = urllib.urlopen(url)

soup = BeautifulSoup(htmlPage, 'lxml')

allLinks = soup.find_all('a')

for i in allLinks:
link = (i.get('href'))
if link:
matchobj = re.search('HTTP', link, re.I)
if matchobj:
print link

print "\n"

Why Joseph

Sunday, February 10, 2013

Link Scraper using Python: article 201301

No comments:

Post a Comment

Open Hack Study Group

About Me

Blog Archive