@@ -6,13 +6,13 @@ Web Scraping
66
77Web sites are written using HTML, which means that each web page is a
88structured document. Sometimes it would be great to obtain some data from
9- them and preserve the structure while we're at it, but this isn't always easy.
10- It's not often that web sites provide their data in comfortable formats
11- such as ``.csv ``.
9+ them and preserve the structure while we're at it. Web sites provide
10+ don't always provide their data in comfortable formats such as ``.csv ``.
1211
13- This is where web scraping comes in. Web scraping is the practice of using
12+ This is where web scraping comes in. Web scraping is the practice of using a
1413computer program to sift through a web page and gather the data that you need
15- in a format most useful to you.
14+ in a format most useful to you while at the same time preserving the structure
15+ of the data.
1616
1717lxml and Requests
1818-----------------
@@ -43,12 +43,12 @@ we can go over two different ways: XPath and CSSSelect. In this example, I
4343will focus on the former.
4444
4545XPath is a way of locating information in structured documents such as
46- HTML or XML pages . A good introduction to XPath is ` here <http://www.w3schools.com/xpath/default.asp >`_ .
46+ HTML or XML documents . A good introduction to XPath is on ` W3Schools <http://www.w3schools.com/xpath/default.asp >`_ .
4747
48- One can also use various tools for obtaining the XPath of elements such as
49- FireBug for Firefox or in Chrome you can right click an element, choose
50- 'Inspect element', highlight the code and the right click again and choose
51- 'Copy XPath'.
48+ There are also various tools for obtaining the XPath of elements such as
49+ FireBug for Firefox or if you're using Chrome you can right click an
50+ element, choose 'Inspect element', highlight the code and then right
51+ click again and choose 'Copy XPath'.
5252
5353After a quick analysis, we see that in our page the data is contained in
5454two elements - one is a div with title 'buyer-name' and the other is a
@@ -90,10 +90,10 @@ Lets see what we got exactly:
9090 '$15.00', '$114.07', '$10.09']
9191
9292Congratulations! We have successfully scraped all the data we wanted from
93- a web page using lxml and we have it stored in memory as two lists. Now we
94- can either continue our work on it, analyzing it using python or we can
95- export it to a file and share it with friends.
93+ a web page using lxml and Requests. We have it stored in memory as two
94+ lists. Now we can do all sorts of cool stuff with it: we can analyze it
95+ using Python or we can save it a file and share it with the world.
9696
97- A cool idea to think about is writing a script to iterate through the rest
98- of the pages of this example data set or making this application use
99- threads to improve its speed.
97+ A cool idea to think about is modifying this script to iterate through
98+ the rest of the pages of this example dataset or rewriting this
99+ application to use threads for improved speed.
0 commit comments