The urllib library is a library used to access the internet and get information from websites, including their publicly available source code. You can access and get data from a website using the ult library. Data obtained are generally in JSON format, HTML, or XML. In the tutorial, you will see how to get data from a website using the urllib library. By the end of this tutorial, you will know:
- How to a send a request to a url
- How to read HTML files from a URL
- How to get response headers from a URL
Let’s jump into it.
How to Send a Request to a URL
You can send a request to a URL using the urllib.request() method. Let us see how to run the code
#import the request library from urllib import request #send a request to open the website url = request.urlopen('https://www.h2kinfosys.com/blog/') Â #print the result code print('The result code is ', url.getcode()) #print the status print('The status is ', url.status)
Output: The result code is 200 The status is 200
Let’s unpack the code above. We begin by importing the request function from urllib library. Afterward, parse the URL you wish to open with the urlopen() function. Finally, check whether the process was successful or not by printing the result code or status.
In both cases, the number 200 was returned. 200 is an HTTP code that shows the request was processed successfully. Another successful HTTP code is 301.Â
However, numbers such as 404 or 500 are error codes.
How to read HTML files from a URL
You can read HTML file from a url using the request() method. The code below reads the HTML codes for the website defined.
Output:
The output is a bunch of HTML codes.
How to get response headers from a URL
You can get the website headers using the getheaders() method. If you don’t know what a header is, the header of a website is simply the website’s metadata. The code below gets the header of the URL passed in.
#import the request library from urllib import request #send a request to open the website url = request.urlopen('https://www.h2kinfosys.com/blog/') Â #print the header print(url.getheaders())
Output:
[('Date', 'Sun, 07 Feb 2021 14:32:30 GMT'), ('Server', 'Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.4.5'), ('X-Powered-By', 'PHP/7.4.5'), ('Link', '<https://www.h2kinfosys.com/blog/wp-json/>; rel="https://api.w.org/"'), ('Link', '<https://www.h2kinfosys.com/blog/>; rel=shortlink'), ('Vary', 'Accept-Encoding'), ('Cache-Control', 'max-age=172800'), ('Expires', 'Tue, 09 Feb 2021 14:32:30 GMT'), ('Strict-Transport-Security', 'max-age=31536000'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked'), ('Content-Type', 'text/html; charset=UTF-8')]
Note that there is a cleaner way of scraping data from a website – using Beautifulsoup and the requests library. You may decide to use the urllib library to avoid external dependencies.
One more thing. It is crucial to point out that many popular websites such as Google, Twitter, Facebook, Amazon, Wikipedia, etc. are not in support of manually requesting data from their website. They would rather have you use their API to access data as it is cleaner and frees traffic they get on their URL address. Manually scraping data over a period of time may trigger their system and have your IP blocked especially if you hit them with too many requests in a short time.Â
If you have any questions, feel free to leave them in the comment section, and I’d do my best to answer them.