Extract number from html page using ant and regex -



Extract number from html page using ant and regex -

i have extract number web page using ant. have downloaded page using task. ma page is:

<!doctype html public "-//w3c//dtd html 3.2 final//en"> <html> <head> <title>index of .......</title> </head> <body> <h1>index of .....</h1> <pre><img src="/icons/blank.gif" alt=" "> <a href="?n=a">name</a> <a href="?m=d">last modified</a> <a href="?s=a">size</a> <a href="?d=a">description</a> <hr> <img src="/icons/back.gif" alt="[dir]"> <a href="/projects/i/">parent directory</a> 19-dec-2012 11:39 - <img src="/icons/folder.gif" alt="[dir]"> <a href="20120114-1731/">20120114-1731/</a> 14-feb-2012 17:40 - <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-1055/">20120115-1055/</a> 15-feb-2012 11:04 - <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-1336/">20120115-1336/</a> 15-feb-2012 13:44 - <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-1656/">20120115-1656/</a> 15-feb-2012 17:05 - <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-2157/">20120115-2157/</a> 15-feb-2012 22:06 - </pre><hr> <address>apache/1.3.41 server @ romgsa.ibm.com port 443</address> </body></html>

from: <img src="/icons/folder.gif" alt="[dir]"> <a href="20120114-1731/"& gt;20120114-1731/</a> i have extract "20120114-1731"

the next illustration embeds groovy script. groovy has useful grab annotation can used download java libraries htmlcleaner, enables html page parsed xml.

example

the bootstrap target download , install groovy.

$ ant bootstrap

running build produces next expected output:

$ ant .. parse: [groovy] 20120114-1731/ [groovy] 20120115-1055/ [groovy] 20120115-1336/ [groovy] 20120115-1656/ [groovy] 20120115-2157/ build.xml <project name="demo" default="parse"> <target name="bootstrap"> <mkdir dir="${user.home}/.ant/lib"/> <get dest="${user.home}/.ant/lib/groovy-all.jar" src="http://search.maven.org/remotecontent?filepath=org/codehaus/groovy/groovy-all/2.1.1/groovy-all-2.1.1.jar"/> <get dest="${user.home}/.ant/lib/ivy.jar" src="http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar"/> </target> <target name="parse"> <taskdef name="groovy" classname="org.codehaus.groovy.ant.groovy"/> <groovy> import org.htmlcleaner.htmlcleaner; import org.htmlcleaner.simplexmlserializer; @grab(group='net.sourceforge.htmlcleaner', module='htmlcleaner', version='2.2.1') // html page parse def address = 'file:///path/to/example/page.html' // clean messy html def cleaner = new htmlcleaner() def node = cleaner.clean(address.tourl()) // convert html xml def serializer = new simplexmlserializer(cleaner.getproperties()) def xml = serializer.getxmlasstring(node) // parse xml document can work def page = new xmlslurper(false,false).parsetext(xml) // retrieve anchor tag values matching pattern def numbers = page.body.pre.a.findall { it.tostring().startswith("2012") } numbers.each { println } </groovy> </target> </project>

regex ant

Comments

Popular posts from this blog

web services - java.lang.NoClassDefFoundError: Could not initialize class net.sf.cglib.proxy.Enhancer -

Accessing MATLAB's unicode strings from C -

javascript - mongodb won't find my schema method in nested container -