bash - how to parse html by sed - extract two strings delimited by two strings - on different lines, sequentially -



bash - how to parse html by sed - extract two strings delimited by two strings - on different lines, sequentially -

i have bash script:

v1='value="' v2='" type' do_parse_html_file() { sed -n "s/.*${v1}//;s/${v2}.*//p" "${_script_path}/iblocklistlists.html"|egrep '^http' >${_tmp_file} }

... extracting html file urls. have on output:

somename url somename url

--- illustration of input html file following:

</tr> <tr class="alt01"> <td><b><a href="http://www.iblocklist.com/list.php?list=bcoepfyewziejvcqyhqo">iana-reserved</a></b></td> <td>bluetack</td> <td><img style="border:0;" src="i-blocklist%20%7c%20lists_files/star_4.png" alt="" height="15" width="75"></td> <td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="bcoepfyewziejvcqyhqo" readonly="readonly" onclick="select_text('bcoepfyewziejvcqyhqo');" value="http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td> </tr> <tr class="alt02"> <td><b><a href="http://www.iblocklist.com/list.php?list=cslpybexmxyuacbyuvib">iana-private</a></b></td> <td>bluetack</td> <td><img style="border:0;" src="i-blocklist%20%7c%20lists_files/star_4.png" alt="" height="15" width="75"></td> <td><input style="width:200px; outline:none; border-style:solid; border-width:1px; border-color:#ccc;" id="cslpybexmxyuacbyuvib" readonly="readonly" onclick="select_text('cslpybexmxyuacbyuvib');" value="http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&amp;fileformat=p2p&amp;archiveformat=gz" type="text"></td> </tr>

--- result should following:

iana-reserved http://list.iblocklist.com/?list=bcoepfyewziejvcqyhqo&fileformat=p2p&archiveformat=gz iana-private http://list.iblocklist.com/?list=cslpybexmxyuacbyuvib&fileformat=p2p&archiveformat=gz

---is possible have sed on 1 line command ? if so, please help.

the first part of list - "somename" first next url sitting on next /does not have second/ line.

>somename ... delimited 'href="url">' , '</a>' on 1 line >url ... delimited 'value="' , '" type' on next line

thank you, kind regards. m.

with cli html parser xidel single line:

xidel "${_script_path}/iblocklistlists.html" -e '//a/concat(., " ", @href)'

string bash sed extract

Comments

Popular posts from this blog

web services - java.lang.NoClassDefFoundError: Could not initialize class net.sf.cglib.proxy.Enhancer -

Accessing MATLAB's unicode strings from C -

javascript - mongodb won't find my schema method in nested container -