python - calculating coverage of a scrapy webspider -

- February 15, 2011

i writing web spiders scrap products form websites using scrapy framework in python. wondering what's best practices calculate coverage , missing items of written spiders.

what i'm using right logging cases that's unable parse or raises exceptions. example that: when expect specific format price of product or address of place , find written regular expressions doesn't match scrapped strings. or when xpath selectors specific data returns nothing.

sometimes when products listed in 1 page or multiple ones use curl , grep calculate number of products. wondering if there's better practices handle this.

the common approach is, yes, use logging log error , exit callback returning nothing.

example (product price required):

loader = productloader(productitem(), response=response) loader.add_xpath('price', '//span[@class="price"]/text()') if not loader.get_output_value('price'):     log.msg("error fetching product price", level=log.error)     return

you can use signals catch , log kind of exceptions happened while crawling, see:

how process kinds of exception in scrapy project, in errback , callback?

this follows easier ask forgiveness permission principle when let spider fail , catch , process error in single, 1 particular place - signal handler.

other thoughts:

you can place response urls , error tracebacks database following review - still "logging", in structured manner can more convenient go through later
a idea might create custom exceptions represent different crawling errors, instance: missingrequiredfielderror, invalidfieldformaterror can raise in case crawled fields haven't passed validation.

Search This Blog

Add

python - calculating coverage of a scrapy webspider -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -