regex - Coldfusion - Simple HTML Parsing -
we have articles posted onto our site. can appear following types of html
<p>this article<br> <img src="someimage"> </p> <p>this article<br> <img src="someimage"> </p> <p>this article<br> <img src="someimage"> </p> <p>this article<br> <img src="someimage"> </p> or
<p><img src="someimage"> article<br> </p> <p>this article<br> <img src="someimage"> </p> <p><img src="someimage"> article<br> </p> some other html tags may inside sometimes, cant head around how scrape page using coldfusion achieve this
esentially need grab hold of first paragraph text , image , able arrange it.
is possible using coldfusion 8 ? able point me in direction on how learn ?
100% possible!
now, don't put off i'm going suggest, it's easy going this.
download library called jsoup...it's sole purpose scraping contents dom in web page:
you use java class doing like:
<!--- page. ---> <cfhttp method="get" url="http://example.com/" resolveurl="true" useragent="#cgi.http_user_agent#" result="mypage" timeout="10" charset="utf-8"> <cfhttpparam type="header" name="accept-encoding" value="*" /> <cfhttpparam type="header" name="te" value="deflate;q=0" /> </cfhttp> <!--- load jsoup , parse document it. ---> <cfset jsoup = createobject("java", "org.jsoup.jsoup") /> <cfset document = jsoup.parse(mypage.filecontent) /> <!--- search parsed document contents of title tag. ---> <cfset title = document.select("title").first() /> <!--- let's see got. ---> <cfdump var="#title#" /> this example pretty simple can show how easy work with. scraping images , whatever else easy if check out docs on jsoup.
there examples on page, can use css style selectors:
http://jsoup.org/cookbook/extracting-data/selector-syntax
try avoid using regex task - believe me, i've tried , it's absolute can of worms!
hope helps. mikey.
Comments
Post a Comment