java - Getting distinct elements from HTML -
i want distinct elements contains text, html page, such redundancy minimal. example:
<div class="business_card"> <p><span id="title"><b><a href="board" target="_self">john abc</a></b></span> <br> director <br> 123 456 78<br> <span class="email"> <a href="mailto:john.abc@example.com">send me email »</a> </span></p> </div>
for above html have these items element
:
<a href="board" target="_self">john abc</a>
<a href="mailto:john.abc@example.com">send me email »</a>
<p>director<br>123 456 78</b>
here code have written, far it's working quite well, except on above example, text director 123 456 78
not collected. tried add || element.owntext() != ""
after !element.isblock()
causes many duplications.
private static def collectchildren(element element) { if (element.children().size() > 0) { element.children().collect { -> if (!element.isblock()) [element, collectchildren(it)] else collectchildren(it) } } else if (element.hastext() || element.attr("alt") != "" || element.attr("title") != "" || element.attr("href") != "") { element } else { [] } }
what this: iterate on children , containing interesting nodes added result node. childs removed current node. if "remaining" node hastext
(which drops whitespace nodes) , isblock
(optional), add element result too.
this works @ least given example. if want have <p>
node without other valid elements inside either have build or have create removing unwanted. might still need additional filtering in remaining node (e.g. remove remaining childs, block
). hope gives imspiration:
@grab('org.jsoup:jsoup:1.8.1') import org.jsoup.* import org.jsoup.nodes.* def doc = jsoup.parse('''\ <div class="business_card"> <p> <span id="title"><b><a href="board" target="_self">john abc</a></b></span> <br> director <br> 123 456 78<br> <span class="email"><a href="mailto:john.abc@example.com">send me email »</a></span> </p> </div>''') def collectchildren(element element) { if (element.children().size() > 0) { def found = [] element.children().findall{ def c = collectchildren(it).flatten() if (c) { found.addall(c) } return c }*.remove() if (element.hastext() && element.isblock()) { found << element } found } else if (element.hastext() || element.attr("alt") || element.attr("title") || element.attr("href")) { [element] } else { [] } } println collectchildren(doc.body()).flatten().join("\n") // <a href="board" target="_self">john abc</a> // <a href="mailto:john.abc@example.com">send me email »</a> // <p> <br> director <br> 123 456 78<br> </p>
Comments
Post a Comment