python - Practice scrapping -


(python 2.7.4)

i print have called url if contains word 'watch' have conducted trial , error no avail. know if possible capture name of each video (from html) , assign corresponding video or pointers appreciated.

the link im using 'http://thenewboston.org/list.php?cat=36'

import urllib2 import re  def open_url(url): req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0 (windows; u; windows nt 5.1; en-gb; rv:1.9.0.3) gecko/2008092417 firefox/3.0.3') response = urllib2.urlopen(req) link=response.read() response.close() return link  link=open_url('http://thenewboston.org/list.php?cat=36') match=re.compile('href="(.+?)"').findall(link) url in match:     url='http://thenewboston.org/'+url     print url 

you can use html parser beautiful soup handle quite easily.

to check substring membership can use in

'watch.php' in url

also beautiful soup or html parser allow more exact parse match

<li class="contentlist">   <a href="watch.php?cat=36&amp;number=11">11 - editing sequences</a> </li> 

instead of links looks want links inside of contentlist's? can queried using xpath, or beautifulsoup, difficult using regexes?


Comments

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

charts - What graph/dashboard product is facebook using in Dashboard: PUE & WUE -