+*.css +*.js -ad.doubleclick.net/* -mime:application/foobar -*title=* -*Category:* -*Org:* -*Meta:* -*Talk:* -*User:* -*Special:* -*File:* -*action=* -*section=* -*Dev:* -*Help:* -*Template:*
It's important to make sure you don't have "+*.png" or any other image types scan rule! instead use "Get non-html files related to a link" option in links tab to get images. Wiki is known use fake link to image file which actually a html file which confuse the spider.
Edit: since we don't download the intermediate html file, a regex like this could be used to clean the broken link on all images:
<a href="http://.*?>(<img.*?)</a>
replace with $1
No comments:
Post a Comment