As part of the SecurityTube Python Scripting Expert course the below is a simple script written to extract the absolute paths from a provided webpage.
Written in Python 2.7.2 using urllib, re, and Beautiful Soup 4 using the LXML parser.
Here is screen shot of an example:
And here is the code:
#!/usr/bin/python
import re
import urllib
from bs4 import BeautifulSoup
print "#" * 50
print "# Enter a url in the format http://site.domain"
print "# i.e http://whyjoseph.com"
url = raw_input("# Enter a URL: ")
print "#" *50
print "\n"
print ">>>> Retrieving and parsing the page. This could take several seconds. <<<<"
print "\n"
htmlPage = urllib.urlopen(url)
soup = BeautifulSoup(htmlPage, 'lxml')
allLinks = soup.find_all('a')
for i in allLinks:
link = (i.get('href'))
if link:
matchobj = re.search('HTTP', link, re.I)
if matchobj:
print link
print "\n"
.
No comments:
Post a Comment