RE: Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1 by zerocoolrocker

View this thread on: d.buzz | hive.blog | peakd.com | ecency.com

Viewing a response to: @zerocoolrocker/re-scipio-learn-python-series-13-mini-project-developing-a-web-crawler-part-1-20180322t042417781z

utopian-io·@scipio·7 years ago

0.000 HBD

Thanks for your in-depth reply! I know Scrapy, but I don't see any real value in it compared to using BS4 + Requests vs using Scrapy.

Selenium is an option, having a nodeJS subprocess using nightmareJS instead is another. The pyvirtualdisplay + xvfb option to "give a head" to a headless browser is indeed an option, but what's the core functionality of having a headless browser for "clientside event automation"? Automating things without the need to do so manually and/or rendering the DOM. Nothing wrong with a few `print()` statements while developing / debugging! Works just fine! ;-)

PS1: This was just part 1 of the web crawler mini-series. More will come very soon, be sure to follow along!
PS2: Since I'm treating my total `Learn Python Series` as an interactive Python Book in the making (publishing episodes as tutorial parts via the Steem blockchain as we go), I must consider the **sorted order**or subjects already discussed very carefully. Therefore I might not use technical mechanisms I would normally use in a real-life software development situation. For example, I haven't explained using a mongoDB instance for example, interfacing with it using pymongo, which of course would be my GoTo tool at developing a real-life web crawler. Instead, because I did just (earlier this week, 2 episodes ago) explained "handling files", I will (for now, in part 2 of the web crawler mini-series) just use plain .txt files (for I haven't discussed using CSV nor JSON either) for intermediate data storage.

See you around!
@scipio

👍 penghuren, scipio,

properties (23)vote details (2)