python - How to extract information from ODP accurately? -
i building search engine in python.
i have heard google fetches description of pages odp (open directory project) in case google can't figure out description using meta data page... wanted similar.
odp online directory mozilla has descriptions of pages on net, wanted fetch descriptions search results odp. how accurate description of particular url odp, , return python type "none" if couldn't find (which means odp has no idea page looking for)?
ps. there url called http://dmoz.org/search?q=your+search+params dont know how extract information there.
to use odp data, you'd download rdf data dump. rdf xml format; you'd index dump map urls descriptions; i'd use sql database this.
note urls can present in multiple locations in dump. stack overflow listed @ twice, example. google uses text this entry site description, bing uses this 1 instead.
the data dump of course rather large. use sensible tools such elementtree iterparse()
method parse data set iteratively add entries database. need <externalpage>
elements, taking <d:title>
, <d:description>
entries underneath.
using lxml
(a faster , more complete elementtree implementation) that'd like:
from lxml import etree et import gzip import sqlite3 conn = sqlite3.connect('/path/to/database') # create table conn: cursor = conn.cursor() cursor.execute(''' create table if not exists odp_urls (url text primary key, title text, description text)''') count = 0 nsmap = {'d': 'http://purl.org/dc/elements/1.0/'} gzip.open('content.rdf.u8.gz', 'rb') content, conn: cursor = conn.cursor() event, element in et.iterparse(content, tag='{http://dmoz.org/rdf/}externalpage'): url = element.attrib['about'] title = element.xpath('d:title/text()', namespaces=nsmap) description = element.xpath('d:description/text()', namespaces=nsmap) title, description = title , title[0] or '', description , description[0] or '' # no longer need this, remove memory again, preceding siblings elem.clear() while elem.getprevious() not none: del elem.getparent()[0] cursor.execute('insert or replace odp_urls values (?, ?, ?)', (url, title, description)) count += 1 if count % 1000 == 0: print 'processed {} items'.format(count)
Comments
Post a Comment