java - Getting distinct elements from HTML -
i want distinct elements contains text, html page, such redundancy minimal. example:
<div class="business_card"> <p><span id="title"><b><a href="board" target="_self">john abc</a></b></span> <br> director <br> 123 456 78<br> <span class="email"> <a href="mailto:john.abc@example.com">send me email »</a> </span></p> </div> for above html have these items element:
<a href="board" target="_self">john abc</a><a href="mailto:john.abc@example.com">send me email »</a><p>director<br>123 456 78</b>
here code have written, far it's working quite well, except on above example, text director 123 456 78 not collected. tried add || element.owntext() != "" after !element.isblock() causes many duplications.
private static def collectchildren(element element) { if (element.children().size() > 0) { element.children().collect { -> if (!element.isblock()) [element, collectchildren(it)] else collectchildren(it) } } else if (element.hastext() || element.attr("alt") != "" || element.attr("title") != "" || element.attr("href") != "") { element } else { [] } }
what this: iterate on children , containing interesting nodes added result node. childs removed current node. if "remaining" node hastext (which drops whitespace nodes) , isblock (optional), add element result too.
this works @ least given example. if want have <p> node without other valid elements inside either have build or have create removing unwanted. might still need additional filtering in remaining node (e.g. remove remaining childs, block). hope gives imspiration:
@grab('org.jsoup:jsoup:1.8.1') import org.jsoup.* import org.jsoup.nodes.* def doc = jsoup.parse('''\ <div class="business_card"> <p> <span id="title"><b><a href="board" target="_self">john abc</a></b></span> <br> director <br> 123 456 78<br> <span class="email"><a href="mailto:john.abc@example.com">send me email »</a></span> </p> </div>''') def collectchildren(element element) { if (element.children().size() > 0) { def found = [] element.children().findall{ def c = collectchildren(it).flatten() if (c) { found.addall(c) } return c }*.remove() if (element.hastext() && element.isblock()) { found << element } found } else if (element.hastext() || element.attr("alt") || element.attr("title") || element.attr("href")) { [element] } else { [] } } println collectchildren(doc.body()).flatten().join("\n") // <a href="board" target="_self">john abc</a> // <a href="mailto:john.abc@example.com">send me email »</a> // <p> <br> director <br> 123 456 78<br> </p>
Comments
Post a Comment