Scraping the Web With Python

View this thread on: d.buzz | hive.blog | peakd.com | ecency.com
programming·@rabm·7 years ago
0.000 HBD
Scraping the Web With Python
![scraping_the_web_with_python.png](https://ipfs.busy.org/ipfs/QmZ5WJcCbSAVS6ne8NqRdMJf8BmKFivzMdk3pAFDLBJ2Jk)


Good day. So today I was working on a project that I had to do some [web scraping](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Web_scraping.html). It was, basically, about extracting remote data from web pages, just [HTML ](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/HTML.html) code, that you have to parse to extract what you need. So having that html code [parsed](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Parsing.html), I had to find and extract some info contained between tables rows, information that was going to be used later in data analisis / [data science](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Data_science.html). And I thought that, maybe, somebody can benefit from the simple and easy to use [Python ](https://www.python.org/) code that I made, well "part of it",  to achieve something similar, in the same or different scenarios.

Let's set a simple scenario. Let's say I have a big bunch of images I composed together with [Gimp](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/GIMP.html), and used them in [my Busy blog](https://busy.org/@rabm). The cat pulled the power cord and y laptop fell, smashed into the ground, and now my mechanical hard drive is dead. Yeah, I know we have SSD now, but anyways, imagine situation :-). Oh no!, I just remembered that I didn't backup these images!, what do I do, what do I do!... Oh, I have these images in my blog. Hmmm, I have two options; either copy the photos from their [IPFS](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/InterPlanetary_File_System.html) links, and expend few minutes saving each photos, one by one, "remember I had and used a big bunch of images", or I can write a super easy [Python ](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Python_(programming_language).html) script that will pull the photos for me. Hmmm, yeah I think it is better to use [Python ](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Python_(programming_language).html), right? , so let's do it!

So, which [python package](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Python_Package_Index.html) should I use? Well, I will need to pull remote data from the web, so definitely the best option is to use [Requests](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Requests_(software).html). Then, I will need to [parse](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Parsing.html) that information, in a way that will make it easy to extract what I need. Best option is [BeautifulSoup](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Beautiful_Soup_(HTML_parser).html). I will also need to read the image data, which will be in binary, and I will need to work with images, so I will use [Pillow](https://en.wikipedia.org/wiki/Python_Imaging_Library) to work with the image, and [IO](https://docs.python.org/3/library/io.html) to read the data easily into Pillow. 

Alright. That's basically it. Now let's code!

Let's Import the [Packages](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Python_Package_Index.html) first.:

![image1.png](https://ipfs.busy.org/ipfs/QmbqrnAL61Czgv65Gu9jiYMEuSHAMLATVw4nN81uYhcFvk)

Ok, we have our packages in place. Now let's define a [string variable](https://en.wikibooks.org/wiki/Python_Programming/Variables_and_Strings#String) with the [URL ](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/URL.html) I want to use to pull information. In this case, I will use [my blog post](https://busy.org/@rabm/the-mate-amargo) url:

![image2.png](https://ipfs.busy.org/ipfs/QmZEtjgTgc2iz5ruwSm2H7JgMmJDcHr4AfGYSkmxm3ciPJ)

Perfect. Next, a very complicated code!?. Let's pull all html code from that [URL ](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/URL.html) into a [string variable](https://en.wikibooks.org/wiki/Python_Programming/Variables_and_Strings#String) we will call "html", by using Requests!

![image3.png](https://ipfs.busy.org/ipfs/QmSxiyTzSnqBMZU4ptNN1ruUziSFFruEvF2jtdqruxuwGs)

Amazing. Now we have our html code! I had to write so much, I need a coffee! But there's still work to do. Now we need to parse our html, so we can then pull all [img tags](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/HTML_element.html#mwBUg) into a [python list](https://en.wikibooks.org/wiki/Python_Programming/Lists):

![image4.png](https://ipfs.busy.org/ipfs/QmWCXygq3kkpQGtwszthiNRwgsP3wvwK4mrDPE7TqqkUMb)

Done! We used the amazing [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package to parse and extract all img tags into a python list, in just 2 lines!

Alright. Now, having all tags in our "images" list variable, we will need to pull the url from the src attribute. So let's declare a new list that will contain all urls, then use a [for loop](https://en.wikibooks.org/wiki/Python_Programming/Loops#For_Loops) to travel through all items in the list and pull src attribute content using BeautifulSoup's awesome syntax:

![image.png](https://ipfs.busy.org/ipfs/QmQeJZ8mmCAKugJrgvkDzD1t9Xz5v3FfAwRyxa61izAZWb)

Great!, at this point we have all of our ipfs urls as a python list variable. Now, for the final part, we will need to travel the new list, using a [for loop](https://en.wikibooks.org/wiki/Python_Programming/Loops#For_Loops) in this case, and then make a Requests request to get the content of the url, create a pillow image object by pulling the binary data caught by our request, and because [pillow image open method](https://pillow.readthedocs.io/en/3.1.x/reference/Image.html#PIL.Image.open) can be a file object in binary mode, we will make use of a binary memory stream, easily, thanks to [Python's IO package](https://docs.python.org/3/library/io.html#binary-i-o). Then, we will extract the "hash" from the url, simply using split, and lastly we will save the image as [JPEG](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/JPEG.html), keeping the quality at 100%... Let's do it!

![image.png](https://ipfs.busy.org/ipfs/QmfVP3Kia7EJqivxEv2wSaebazKE687ViT4hFTXT7KpfWg)

And that's it guys! This is the basic code required to [scrap](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Web_scraping.html) my web site, extract the images, and save it locally. If everything goes well (do you have Internet connection? ;-), we will have a list of images named as "<hash>.jpg", where <hash> is the IPFS hash like this:

![image.png](https://ipfs.busy.org/ipfs/Qmc7UiSJfBHUskR2rY1LNNiBp74CfhHfbuci4pJLaNfjCr)

Isn't that amazing? How easy it is to do this in Python with the help of all these awesome packages? And you can reuse this over and over... Of course, this is a limited script and you might run into some issues, like when Pillow is not able to recognize or load the image format, in which case you will need to make conditions with conditions using [if statements](https://en.wikibooks.org/wiki/Python_Programming/Conditional_Statements#If_statements), but nothing crazy. Also there's an[ IPFS package for python](https://github.com/ipfs/py-ipfs-api) that you can use/implement to figure figure out the real names of the files, and things like that.

If you want to try this code, I wrote and share it publicly on [my Repl.it account](https://repl.it/@iBobX/scrapingtewebwithpython). 

Peace and love to everybody,
And, thanks for reading
Roberto.
👍 rabm, ilovecoding, youngogmarqs, murathe, crokkon, jacekw.dev,
properties (23)vote details (6)