java - Getting distinct elements from HTML -

- February 15, 2013

i want distinct elements contains text, html page, such redundancy minimal. example:

<div class="business_card">     <p><span id="title"><b><a href="board" target="_self">john abc</a></b></span>     <br>     director <br>     123 456 78<br>     <span class="email">         <a href="mailto:john.abc@example.com">send me email &raquo;</a> </span></p> </div>

for above html have these items element:

<a href="board" target="_self">john abc</a>
<a href="mailto:john.abc@example.com">send me email »</a>
<p>director<br>123 456 78</b>

here code have written, far it's working quite well, except on above example, text director 123 456 78 not collected. tried add || element.owntext() != "" after !element.isblock() causes many duplications.

private static def collectchildren(element element) {     if (element.children().size() > 0) {         element.children().collect { ->             if (!element.isblock())                 [element, collectchildren(it)]             else collectchildren(it)         }     } else if (element.hastext() || element.attr("alt") != ""             || element.attr("title") != "" || element.attr("href") != "") {         element     } else {         []     } }

what this: iterate on children , containing interesting nodes added result node. childs removed current node. if "remaining" node hastext (which drops whitespace nodes) , isblock (optional), add element result too.

this works @ least given example. if want have <p> node without other valid elements inside either have build or have create removing unwanted. might still need additional filtering in remaining node (e.g. remove remaining childs, block). hope gives imspiration:

@grab('org.jsoup:jsoup:1.8.1') import org.jsoup.* import org.jsoup.nodes.*  def doc = jsoup.parse('''\ <div class="business_card"> <p> <span id="title"><b><a href="board" target="_self">john abc</a></b></span> <br> director <br> 123 456 78<br> <span class="email"><a href="mailto:john.abc@example.com">send me email &raquo;</a></span> </p> </div>''')  def collectchildren(element element) {     if (element.children().size() > 0) {         def found = []         element.children().findall{             def c = collectchildren(it).flatten()             if (c) {                 found.addall(c)             }              return c         }*.remove()         if (element.hastext() && element.isblock()) {             found << element         }         found     } else if (element.hastext() || element.attr("alt") || element.attr("title") || element.attr("href")) {         [element]     } else {         []     } }  println collectchildren(doc.body()).flatten().join("\n") // <a href="board" target="_self">john abc</a> // <a href="mailto:john.abc@example.com">send me email »</a> // <p>  <br> director <br> 123 456 78<br>  </p>

Search This Blog

Add

java - Getting distinct elements from HTML -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -