ruby - How do I parse this HTML code with Nokogiri? -

i have following html:

<h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br> <b>64295 darmstadt</b><p> <h3>kommunikationsdaten: </h3>  <p>

but  ,   tags not closed.

how extract address information:

hochschule darmstadt technologietransfercentrum d19, raum 221, 222 schöfferstraße 10 64295 darmstadt

starting basis:

# encoding: utf-8 require 'nokogiri'  doc = nokogiri::html(<<eot) <h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br> <b>64295 darmstadt</b><p> <h3>kommunikationsdaten: </h3>  <p> eot  puts doc.errors puts doc.to_html

i when run code:

<!doctype html public "-//w3c//dtd html 4.0 transitional//en" "http://www.w3.org/tr/rec-html40/loose.dtd"> <html><body> <h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br><b>64295 darmstadt</b></p> <p> </p> <h3>kommunikationsdaten: </h3> <p></p> </body></html>

notice nokogiri has added <html> , <body> tags. also, has closed  tags, adding . can tell parse html fragment, , not add headers using instead:

nokogiri::html::documentfragment.parse

which generates:

<h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br><b>64295 darmstadt</b></p><p> </p><h3>kommunikationsdaten: </h3> <p></p>

there's still fixup on html happening, it's basic html passed in. either way, resulting html technically correct.

on finding text in question: if there 1  tag, or it's first one:

doc.at('p').text => "\nhochschule darmstadt\ntechnologietransfercentrum\nd19, raum 221, 222\nschöfferstraße 1064295 darmstadt"

or:

doc.at('h3').next_sibling.next_sibling.text => "\nhochschule darmstadt\ntechnologietransfercentrum\nd19, raum 221, 222\nschöfferstraße 1064295 darmstadt"

two next_sibling methods needed. first finds text node following end of <h3> node:

doc.at('h3').next_sibling => #<nokogiri::xml::text:0x3fef59dedfb8 "\n    ">

Search This Blog

Brande

ruby - How do I parse this HTML code with Nokogiri? -

Comments

Post a Comment

Popular posts from this blog

php - Why I am getting the Error "Commands out of sync; you can't run this command now" -

ruby - Nesting modules inside of a Rails eninge gem -

Eclipse formatter for java ending braces -