python - Practice scrapping -
(python 2.7.4)
i print have called url if contains word 'watch' have conducted trial , error no avail. know if possible capture name of each video (from html) , assign corresponding video or pointers appreciated.
the link im using 'http://thenewboston.org/list.php?cat=36'
import urllib2 import re def open_url(url): req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0 (windows; u; windows nt 5.1; en-gb; rv:1.9.0.3) gecko/2008092417 firefox/3.0.3') response = urllib2.urlopen(req) link=response.read() response.close() return link link=open_url('http://thenewboston.org/list.php?cat=36') match=re.compile('href="(.+?)"').findall(link) url in match: url='http://thenewboston.org/'+url print url
you can use html parser beautiful soup handle quite easily.
to check substring membership can use in
'watch.php' in url
also beautiful soup or html parser allow more exact parse match
<li class="contentlist"> <a href="watch.php?cat=36&number=11">11 - editing sequences</a> </li>
instead of links looks want links inside of contentlist
's? can queried using xpath, or beautifulsoup, difficult using regexes?
Comments
Post a Comment