Gain Improvements - Ludaro

**re.findall issue**

I reviewed the tests in this project after experiencing issues with my regex also catching some html as part of the process.

So I reviewed this test file: https://github.com/gaojiuli/gain/blob/master/tests/test_parse_multiple_items.py and catched the response of [abstract_url.py ](https://github.com/gaojiuli/gain/blob/2c8160c92943837613a773f681fb190a8c434bb2/gain/parser.py#L79)

Version 0.1.4 of this project catches this as response:
```
URLS we found: ['/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/']
```
`re.findall` returns what is requested by your regex but not what is matched!

**Test incorrect**

The base url  http://quotes.toscrape.com/ and http://quotes.toscrape.com/page/1 are the same page and if you look into the html you shall only find a reference to "/page/2" but not to "/page/1". For this reason the test seems to work but it was actually flawed from the start.

![image](https://user-images.githubusercontent.com/6716689/45589440-c849ac00-b925-11e8-83ee-3c7fbe977189.png)

**re.match**

I rewrote function abstract_url to:
```python
    def abstract_urls(self, html, base_url):
        _urls = []

        try:
            document = lxml.html.fromstring(html)
            document_domain = urlparse.urlparse(base_url).netloc
            
            for (al, attr, link, pos) in document.iterlinks():
                link = re.sub("#.*", "", link or "")

                if not link:
                    continue

                _urls.append(link)
        except (etree.XMLSyntaxError, etree.ParserError) as e:
            logger.error("While parsing the html for {} we received the following error {}.".format(base_url, e))

        # Cleanup urls
        r = re.compile(self.rule)
        urls = list(filter(r.match, _urls))

        return urls
```

and now this is the result of abstract_url:
```
['/static/bootstrap.min.css', '/static/main.css', '/', '/login', '/author/Albert-Einstein', '/tag/change/page/1/', '/tag/deep-thoughts/page/1/', '/tag/thinking/page/1/', '/tag/world/page/1/', '/author/J-K-Rowling', '/tag/abilities/page/1/', '/tag/choices/page/1/', '/author/Albert-Einstein', '/tag/inspirational/page/1/', '/tag/life/page/1/', '/tag/live/page/1/', '/tag/miracle/page/1/', '/tag/miracles/page/1/', '/author/Jane-Austen', '/tag/aliteracy/page/1/', '/tag/books/page/1/', '/tag/classic/page/1/', '/tag/humor/page/1/', '/author/Marilyn-Monroe', '/tag/be-yourself/page/1/', '/tag/inspirational/page/1/', '/author/Albert-Einstein', '/tag/adulthood/page/1/', '/tag/success/page/1/', '/tag/value/page/1/', '/author/Andre-Gide', '/tag/life/page/1/', '/tag/love/page/1/', '/author/Thomas-A-Edison', '/tag/edison/page/1/', '/tag/failure/page/1/', '/tag/inspirational/page/1/', '/tag/paraphrased/page/1/', '/author/Eleanor-Roosevelt', '/tag/misattributed-eleanor-roosevelt/page/1/', '/author/Steve-Martin', '/tag/humor/page/1/', '/tag/obvious/page/1/', '/tag/simile/page/1/', '/page/2/', '/tag/love/', '/tag/inspirational/', '/tag/life/', '/tag/humor/', '/tag/books/', '/tag/reading/', '/tag/friendship/', '/tag/friends/', '/tag/truth/', '/tag/simile/', 'https://www.goodreads.com/quotes', 'https://scrapinghub.com']
```

**This test:** tests/test_parse_multiple_items.py now fails as it should.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gain Improvements - Ludaro #46

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development