Extract number from html page using ant and regex -

June 15, 2015

i have extract number web page using ant. have downloaded page using task. ma page is:

<!doctype html public "-//w3c//dtd html 3.2 final//en"> <html>  <head>   <title>index of .......</title>  </head>  <body> <h1>index of .....</h1> <pre><img src="/icons/blank.gif" alt="     "> <a href="?n=a">name</a>                    <a href="?m=d">last modified</a>       <a href="?s=a">size</a>  <a href="?d=a">description</a> <hr> <img src="/icons/back.gif" alt="[dir]"> <a href="/projects/i/">parent directory</a>            19-dec-2012 11:39       -   <img src="/icons/folder.gif" alt="[dir]"> <a href="20120114-1731/">20120114-1731/</a>          14-feb-2012 17:40       -   <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-1055/">20120115-1055/</a>          15-feb-2012 11:04       -   <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-1336/">20120115-1336/</a>          15-feb-2012 13:44       -   <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-1656/">20120115-1656/</a>          15-feb-2012 17:05       -   <img src="/icons/folder.gif" alt="[dir]"> <a href="20120115-2157/">20120115-2157/</a>          15-feb-2012 22:06       -   </pre><hr> <address>apache/1.3.41 server @ romgsa.ibm.com port 443</address> </body></html>

from: <img src="/icons/folder.gif" alt="[dir]"> <a href="20120114-1731/"& gt;20120114-1731/</a> i have extract "20120114-1731"

the next illustration embeds groovy script. groovy has useful grab annotation can used download java libraries htmlcleaner, enables html page parsed xml.

example

the bootstrap target download , install groovy.

$ ant bootstrap

running build produces next expected output:

$ ant .. parse:    [groovy] 20120114-1731/    [groovy] 20120115-1055/    [groovy] 20120115-1336/    [groovy] 20120115-1656/    [groovy] 20120115-2157/

build.xml

<project name="demo" default="parse">      <target name="bootstrap">         <mkdir dir="${user.home}/.ant/lib"/>         <get dest="${user.home}/.ant/lib/groovy-all.jar" src="http://search.maven.org/remotecontent?filepath=org/codehaus/groovy/groovy-all/2.1.1/groovy-all-2.1.1.jar"/>         <get dest="${user.home}/.ant/lib/ivy.jar" src="http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar"/>     </target>      <target name="parse">          <taskdef name="groovy" classname="org.codehaus.groovy.ant.groovy"/>          <groovy>             import org.htmlcleaner.htmlcleaner;             import org.htmlcleaner.simplexmlserializer;              @grab(group='net.sourceforge.htmlcleaner', module='htmlcleaner', version='2.2.1')              // html page parse             def address = 'file:///path/to/example/page.html'              // clean messy html             def cleaner = new htmlcleaner()             def node = cleaner.clean(address.tourl())              // convert html xml             def serializer = new simplexmlserializer(cleaner.getproperties())             def xml = serializer.getxmlasstring(node)              // parse xml document can work             def page = new xmlslurper(false,false).parsetext(xml)              // retrieve anchor tag values matching pattern             def numbers = page.body.pre.a.findall { it.tostring().startswith("2012") }             numbers.each {                 println             }         </groovy>     </target>  </project>

regex ant

Search This Blog

Pages Vivanta

Extract number from html page using ant and regex -

Comments

Post a Comment

Popular posts from this blog

web services - java.lang.NoClassDefFoundError: Could not initialize class net.sf.cglib.proxy.Enhancer -

Accessing MATLAB's unicode strings from C -

javascript - mongodb won't find my schema method in nested container -