What is a Web Crawler?
Web crawler is an internet bot that is used for web indexing in World Wide Web.All types of search engines use web crawler to provide efficient results.Actually it collects all or some specific hyperlinks and HTML content from other websites and preview them in a suitable manner.When there are huge number of links to crawl , even the largest crawler fails.For this reason search engines early 2000 were bad at providing relevant results,but now this process has improved much and proper results are given in an instant
Python Web Crawler
The web crawler here is created in python3.Python is a high level programming language including object-oriented, imperative, functional programming and a large standard library.
For the web crawler two standard library are used - requests
and BeautfulSoup4
. requests
provides a easy way to connect to world wide web and BeautifulSoup4
is used for some particular string operations.
Example Code
import requests from bs4 import BeautifulSoup def web(page,WebUrl): if(page>0): url = WebUrl code = requests.get(url) plain = code.text s = BeautifulSoup(plain, "html.parser") for link in s.findAll('a', {'class':'s-access-detail-page'}): tet = link.get('title') print(tet) tet_2 = link.get('href') print(tet_2) web(1,'http://www.amazon.in/s/ref=s9_acss_bw_cts_VodooFS_T4_w?rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031%2Cp_98%3A10440597031%2Cp_36%3A1500000-99999999&bbn=1805560031&rw_html_to_wsrp=1&pf_rd_m=A1K21FY43GMZF8&pf_rd_s=merchandised-search-3&pf_rd_r=2EKZMFFDEXJ5HE8RVV6E&pf_rd_t=101&pf_rd_p=c92c2f88-469b-4b56-936e-0e65f92eebac&pf_rd_i=1389432031')
Output:
C:\Python34\python.exe C:/Users/Babuya/PycharmProjects/Youtube/web_cr.py Apple iPhone 6 (Gold, 32GB) http://www.amazon.in/Apple-iPhone-6-Gold-32GB/dp/B0725RBY9V OnePlus 5 (Slate Gray 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-Slate-Gray-64GB-memory/dp/B01NAKTR2H OnePlus 5 (Midnight Black 8GB RAM + 128GB memory) http://www.amazon.in/OnePlus-Midnight-Black-128GB-memory/dp/B01MXZW51M Apple iPhone 6 (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01NCN4ICO OnePlus 5 (Soft Gold, 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-Soft-Gold-64GB-memory/dp/B01N1TYZR2 Mi Max 2 (Black, 64 GB) http://www.amazon.in/Mi-Max-Black-64-GB/dp/B073VLGL5Y Moto G5 Plus (32GB, Fine Gold) http://www.amazon.in/Moto-Plus-32GB-Fine-Gold/dp/B071ZZ8N5Y Apple iPhone SE (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-SE-Space-Grey/dp/B071DF166C Honor 8 Pro (Blue, 6GB RAM + 128GB Memory) http://www.amazon.in/Honor-Pro-Blue-128GB-Memory/dp/B01N4FMUFH Apple iPhone 7 (Black, 32GB) http://www.amazon.in/Apple-iPhone-7-Black-32GB/dp/B01LZKSVRB BlackBerry KEYone (LIMITED EDITION BLACK) http://www.amazon.in/BlackBerry-KEYone-LIMITED-EDITION-BLACK/dp/B073ZLLVQ9 Apple iPhone SE (Gold, 32GB) http://www.amazon.in/Apple-iPhone-SE-Gold-32GB/dp/B071RC52N6 Apple iPhone SE (Rose Gold, 32GB) http://www.amazon.in/Apple-iPhone-SE-Rose-Gold/dp/B06ZXWWD6R Apple iPhone 6s (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01LX3A7CC Samsung Galaxy J7 Max (Gold, 32GB) http://www.amazon.in/Samsung-Galaxy-J7-Max-Gold/dp/B073PWKTRS Honor 8 Pro (Black, 6GB RAM + 128GB Memory) http://www.amazon.in/Honor-Pro-Black-128GB-Memory/dp/B01MQXNY1L Samsung Galaxy J7 Max (Black, 32GB) http://www.amazon.in/Samsung-Galaxy-J7-Max-Black/dp/B073PWDMHD OnePlus 3T (Soft Gold, 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-3T-Soft-Gold-memory/dp/B01FM7J3NA Apple iPhone 6s (Gold, 32GB) http://www.amazon.in/Apple-iPhone-6s-Gold-32GB/dp/B01M0CJNVL Apple iPhone 6s (Rose Gold, 32GB) http://www.amazon.in/Apple-iPhone-Rose-Gold-32GB/dp/B01LXF3SP9 Samsung Galaxy C7 Pro (Navy Blue, 64GB) http://www.amazon.in/Samsung-Galaxy-Navy-Blue-64GB/dp/B01LXMHNMQ Samsung J7 Prime 32GB ( Gold ) 4G VoLTE http://www.amazon.in/Samsung-J7-Prime-32GB-VoLTE/dp/B06Y3HFZBQ Vivo V5s (Matte Black) with Offers http://www.amazon.in/Vivo-V5s-Matte-Black-Offers/dp/B071P2FNF2 Vivo V5s (Crown Gold) with Offers http://www.amazon.in/Vivo-V5s-Crown-Gold-Offers/dp/B071VT6RG2
Here this crawler collects all the product headings and respective links of the products pages from a page of amazon.in . User just need to specify what kind of data or links to be crawled.Though the main use of web crawler is in search engines,this way it can also be used to collect some useful information.
Here all the HTML of the page is fetched using requests
in plain text form.Then it is converted into a BeautifulSoup
object.From that object all title and href with class s-access-detail-page
is accessed.That's all how this basic web crawler works.
Top comments (3)
In addition to scraping, you do need a proxy in most of the cases. Smartproxy seems to have the best cost to quality ratio at the moment. Are you covering your proxies?
Any of the residential services does work alright for scraping. Have tried Smartproxy and Luminati - both are quality. However, Smartproxy is a lot cheaper, Luminati has a higher IP pool.
Great, scraping is so great with python. Have you ever been wondering about using something like scrapy from here
Some comments may only be visible to logged-in visitors. Sign in to view all comments.