Networked Programs Chapter 12 Python for Everybody www.py4e.com
A Free Book on Network Architecture • If you find this topic area interesting and/or need more detail • www.net-intro.com
Transport Control Protocol (TCP) • Built on top of IP (Internet Protocol) • Assumes IP might lose some data - stores and retransmits data if it seems to be lost • Handles “flow control” using a transmit window • Provides a nice reliable pipe Source: http://en.wikipedia.org/wiki/Internet_Protocol_Suite
http://www.flickr.com/photos/kitcowan/2103850699/ http://en.wikipedia.org/wiki/Tin_can_telephone
TCP Connections / Sockets http://en.wikipedia.org/wiki/Internet_socket “In computer networking, an Internet socket or network socket is an endpoint of a bidirectional inter-process communication flow across an Internet Protocol-based computer network, such as the Internet.” Internet Process Process
TCP Port Numbers • A port is an application-specific or process-specific software communications endpoint • It allows multiple networked applications to coexist on the same server • There is a list of well-known TCP port numbers http://en.wikipedia.org/wiki/TCP_and_UDP_port
www.umich.edu Incoming E-Mail Login Web Server 25 Personal Mail Box 23 80 443 109 110 74.208.28.177 blah blah blah blah Clipart: http://www.clker.com/search/networksym/1
Common TCP Ports http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
Sometimes we see the port number in the URL if the web server is running on a “non-standard” port.
Sockets in Python Python has built-in support for TCP Sockets import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect( ('data.pr4e.org', 80) ) http://docs.python.org/library/socket.html Host Port
http://xkcd.com/353/
Application Protocols
Application Protocol • Since TCP (and Python) gives us a reliable socket, what do we want to do with the socket? What problem do we want to solve? • Application Protocols - Mail - World Wide Web Source: http://en.wikipedia.org/wiki/Internet_Protocol_Suite
HTTP - Hypertext Transfer Protocol • The dominant Application Layer Protocol on the Internet • Invented for the Web - to Retrieve HTML, Images, Documents, etc. • Extended to retrieve data in addition to documents - RSS, Web Services, etc. Basic Concept - Make a Connection - Request a document - Retrieve the Document - Close the Connection http://en.wikipedia.org/wiki/Http
HTTP The HyperText Transfer Protocol is the set of rules to allow browsers to retrieve web documents from servers over the Internet
What is a Protocol? • A set of rules that all parties follow so we can predict each other’s behavior • And not bump into each other - On two-way roads in USA, drive on the right- hand side of the road - On two-way roads in the UK, drive on the left-hand side of the road
http://www.dr-chuck.com/page1.htm protocol host document Robert Cailliau CERN http://www.youtube.com/watch?v=x2GylLq59rI 1:17 - 2:19
Getting Data From The Server • Each time the user clicks on an anchor tag with an href= value to switch to a new page, the browser makes a connection to the web server and issues a “GET” request - to GET the content of the page at the specified URL • The server returns the HTML document to the browser, which formats and displays the document to the user
Browser Web Server 80
Browser Web Server 80 Click
Browser Web Server 80 Request GET http://www.dr- chuck.com/page2.htm Click
Browser Web Server GET http://www.dr- chuck.com/page2.htm 80 Request Click
Browser Web Server <h1>The Second Page</h1><p>If you like, you can switch back to the <a href="page1.htm">First Page</a>.</p> 80 Request Response GET http://www.dr- chuck.com/page2.htm Click
Browser Web Server <h1>The Second Page</h1><p>If you like, you can switch back to the <a href="page1.htm">First Page</a>.</p> 80 Request Response Parse/ Render GET http://www.dr- chuck.com/page2.htm Click
Internet Standards • The standards for all of the Internet protocols (inner workings) are developed by an organization • Internet Engineering Task Force (IETF) • www.ietf.org • Standards are called “RFCs” - “Request for Comments” Source: http://tools.ietf.org/html/rfc791
http://www.w3.org/Protocols/rfc2616/rfc2616.txt
Making an HTTP request • Connect to the server like www.dr-chuck.com" • Request a document (or the default document) • GET http://www.dr-chuck.com/page1.htm HTTP/1.0 • GET http://www.mlive.com/ann-arbor/ HTTP/1.0 • GET http://www.facebook.com HTTP/1.0
Browser Web Server Note: Many servers do not support HTTP 1.0 $ telnet data.pr4e.org 80 Trying 74.208.28.177... Connected to data.pr4e.org. Escape character is '^]'. GET http://data.pr4e.org/page1.htm HTTP/1.0 HTTP/1.1 200 OK Date: Tue, 30 Jan 2024 15:30:13 GMT Server: Apache/2.4.18 (Ubuntu) Last-Modified: Mon, 15 May 2017 11:11:47 GMT Content-Length: 128 Content-Type: text/html <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://data.pr4e.org/page2.htm">Second Page</a>.</p> Connection closed by foreign host.
Accurate Hacking in the Movies • Matrix Reloaded • Bourne Ultimatum • Die Hard 4 • ... http://nmap.org/movies.html
Let’s Write a Web Browser!
An HTTP Request in Python import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0rnrn'.encode() mysock.send(cmd) while True: data = mysock.recv(512) if (len(data) < 1): break print(data.decode(),end='') mysock.close()
HTTP/1.1 200 OK Date: Sun, 14 Mar 2010 23:52:41 GMT Server: Apache Last-Modified: Tue, 29 Dec 2009 01:31:22 GMT ETag: "143c1b33-a7-4b395bea" Accept-Ranges: bytes Content-Length: 167 Connection: close Content-Type: text/plain But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief while True: data = mysock.recv(512) if ( len(data) < 1 ) : break print(data.decode()) HTTP Header HTTP Body
About Characters and Strings…
https://en.wikipedia.org/wiki/ASCII http://www.catonmat.net/download/ascii-cheat-sheet.png ASCII American Standard Code for Information Interchange
Representing Simple Strings • Each character is represented by a number between 0 and 256 stored in 8 bits of memory • We refer to "8 bits of memory as a "byte" of memory – (i.e. my disk drive contains 3 Terabytes of memory) • The ord() function tells us the numeric value of a simple ASCII character >>> print(ord('H')) 72 >>> print(ord('e')) 101 >>> print(ord('n')) 10 >>>
ASCII >>> print(ord('H')) 72 >>> print(ord('e')) 101 >>> print(ord('n')) 10 >>> In the 1960s and 1970s, we just assumed that one byte was one character
http://unicode.org/charts/
Multi-Byte Characters To represent the wide range of characters computers must handle we represent characters with more than one byte • UTF-16 – Fixed length - Two bytes • UTF-32 – Fixed Length - Four Bytes • UTF-8 – 1-4 bytes - Upwards compatible with ASCII - Automatic detection between ASCII and UTF-8 - UTF-8 is recommended practice for encoding data to be exchanged between systems https://en.wikipedia.org/wiki/UTF-8
Two Kinds of Strings in Python Python 3.5.1 >>> x = '이광춘' >>> type(x) <class 'str'> >>> x = u'이광춘' >>> type(x) <class 'str'> >>> Python 2.7.10 >>> x = '이광춘' >>> type(x) <type 'str'> >>> x = u'이광춘' >>> type(x) <type 'unicode'> >>> In Python 3, all strings are Unicode
Python 2 versus Python 3 Python 3.5.1 >>> x = b'abc' >>> type(x) <class 'bytes'> >>> x = '이광춘' >>> type(x) <class 'str'> >>> x = u'이광춘' >>> type(x) <class 'str'> Python 2.7.10 >>> x = b'abc' >>> type(x) <type 'str'> >>> x = '이광춘' >>> type(x) <type 'str'> >>> x = u'이광춘' >>> type(x) <type 'unicode'>
Python 3 and Unicode • In Python 3, all strings internally are UNICODE • Working with string variables in Python programs and reading data from files usually "just works" • When we talk to a network resource using sockets or talk to a database we have to encode and decode data (usually to UTF-8) Python 3.5.1 >>> x = b'abc' >>> type(x) <class 'bytes'> >>> x = '이광춘' >>> type(x) <class 'str'> >>> x = u'이광춘' >>> type(x) <class 'str'>
Python Strings to Bytes • When we talk to an external resource like a network socket we send bytes, so we need to encode Python 3 strings into a given character encoding • When we read data from an external resource, we must decode it based on the character set so it is properly represented in Python 3 as a string while True: data = mysock.recv(512) if ( len(data) < 1 ) : break mystring = data.decode() print(mystring) socket1.py
An HTTP Request in Python import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0nn'.encode() mysock.send(cmd) while True: data = mysock.recv(512) if (len(data) < 1): break print(data.decode()) mysock.close() socket1.py
https://docs.python.org/3/library/stdtypes.html#bytes.decode https://docs.python.org/3/library/stdtypes.html#str.encode
Network Socket Bytes UTF-8 String Unicode Bytes UTF-8 recv() decode() encode() send() import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0nn'.encode() mysock.send(cmd) while True: data = mysock.recv(512) if (len(data) < 1): break print(data.decode()) mysock.close() socket1.py
Making HTTP Easier With urllib
Since HTTP is so common, we have a library that does all the socket work for us and makes web pages look like a file import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') for line in fhand: print(line.decode().strip()) Using urllib in Python urllib1.py
But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief urllib1.py import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') for line in fhand: print(line.decode().strip())
Like a File... import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') counts = dict() for line in fhand: words = line.decode().split() for word in words: counts[word] = counts.get(word, 0) + 1 print(counts) urlwords.py
Reading Web Pages import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print(line.decode().strip()) <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://www.dr-chuck.com/page2.htm">Second Page</a>. </p> urllib2.py
Following Links import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print(line.decode().strip()) <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://www.dr-chuck.com/page2.htm">Second Page</a>. </p> urllib2.py
The First Lines of Code @ Google? import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print(line.decode().strip()) urllib2.py
Parsing HTML (a.k.a. Web Scraping)
What is Web Scraping? • When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages • Search engines scrape web pages - we call this “spidering the web” or “web crawling” http://en.wikipedia.org/wiki/Web_scraping http://en.wikipedia.org/wiki/Web_crawler
Why Scrape? • Pull data - particularly social data - who links to who? • Get your own data back out of some system that has no “export capability” • Monitor a site for new information • Spider the web to make a database for a search engine
Scraping Web Pages • There is some controversy about web page scraping and some sites are a bit snippy about it. • Republishing copyrighted information is not allowed • Violating terms of service is not allowed
The Easy Way - Beautiful Soup • You could do string searches the hard way • Or use the free software library called BeautifulSoup from www.crummy.com https://www.crummy.com/software/BeautifulSoup/
# To run this, you can install BeautifulSoup # https://pypi.python.org/pypi/beautifulsoup4 # Or download the file # http://www.py4e.com/code3/bs4.zip # and unzip it in the same directory as this file import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup ... urllinks.py BeautifulSoup Installation
import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup url = input('Enter - ') html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, 'html.parser') # Retrieve all of the anchor tags tags = soup('a') for tag in tags: print(tag.get('href', None)) python urllinks.py Enter - http://www.dr-chuck.com/page1.htm http://www.dr-chuck.com/page2.htm
Summary • The TCP/IP gives us pipes / sockets between applications • We designed application protocols to make use of these pipes • HyperText Transfer Protocol (HTTP) is a simple yet powerful protocol • Python has good support for sockets, HTTP, and HTML parsing
Acknowledgements / Contributions Thes slide are Copyright 2010- Charles R. Severance (www.dr- chuck.com) of the University of Michigan School of Information and open.umich.edu and made available under a Creative Commons Attribution 4.0 License. Please maintain this last slide in all copies of the document to comply with the attribution requirements of the license. If you make a change, feel free to add your name and organization to the list of contributors on this page as you republish the materials. Initial Development: Charles Severance, University of Michigan School of Information … Insert new Contributors here ...

Pythonlearn-12-HTTP- Network Programming

  • 1.
    Networked Programs Chapter 12 Pythonfor Everybody www.py4e.com
  • 2.
    A Free Bookon Network Architecture • If you find this topic area interesting and/or need more detail • www.net-intro.com
  • 3.
    Transport Control Protocol(TCP) • Built on top of IP (Internet Protocol) • Assumes IP might lose some data - stores and retransmits data if it seems to be lost • Handles “flow control” using a transmit window • Provides a nice reliable pipe Source: http://en.wikipedia.org/wiki/Internet_Protocol_Suite
  • 4.
  • 5.
    TCP Connections /Sockets http://en.wikipedia.org/wiki/Internet_socket “In computer networking, an Internet socket or network socket is an endpoint of a bidirectional inter-process communication flow across an Internet Protocol-based computer network, such as the Internet.” Internet Process Process
  • 6.
    TCP Port Numbers •A port is an application-specific or process-specific software communications endpoint • It allows multiple networked applications to coexist on the same server • There is a list of well-known TCP port numbers http://en.wikipedia.org/wiki/TCP_and_UDP_port
  • 7.
  • 8.
  • 9.
    Sometimes we seethe port number in the URL if the web server is running on a “non-standard” port.
  • 10.
    Sockets in Python Pythonhas built-in support for TCP Sockets import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect( ('data.pr4e.org', 80) ) http://docs.python.org/library/socket.html Host Port
  • 11.
  • 12.
  • 13.
    Application Protocol • SinceTCP (and Python) gives us a reliable socket, what do we want to do with the socket? What problem do we want to solve? • Application Protocols - Mail - World Wide Web Source: http://en.wikipedia.org/wiki/Internet_Protocol_Suite
  • 14.
    HTTP - HypertextTransfer Protocol • The dominant Application Layer Protocol on the Internet • Invented for the Web - to Retrieve HTML, Images, Documents, etc. • Extended to retrieve data in addition to documents - RSS, Web Services, etc. Basic Concept - Make a Connection - Request a document - Retrieve the Document - Close the Connection http://en.wikipedia.org/wiki/Http
  • 15.
    HTTP The HyperText TransferProtocol is the set of rules to allow browsers to retrieve web documents from servers over the Internet
  • 16.
    What is aProtocol? • A set of rules that all parties follow so we can predict each other’s behavior • And not bump into each other - On two-way roads in USA, drive on the right- hand side of the road - On two-way roads in the UK, drive on the left-hand side of the road
  • 17.
    http://www.dr-chuck.com/page1.htm protocol host document RobertCailliau CERN http://www.youtube.com/watch?v=x2GylLq59rI 1:17 - 2:19
  • 18.
    Getting Data FromThe Server • Each time the user clicks on an anchor tag with an href= value to switch to a new page, the browser makes a connection to the web server and issues a “GET” request - to GET the content of the page at the specified URL • The server returns the HTML document to the browser, which formats and displays the document to the user
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    Browser Web Server <h1>The Second Page</h1><p>Ifyou like, you can switch back to the <a href="page1.htm">First Page</a>.</p> 80 Request Response GET http://www.dr- chuck.com/page2.htm Click
  • 24.
    Browser Web Server <h1>The Second Page</h1><p>Ifyou like, you can switch back to the <a href="page1.htm">First Page</a>.</p> 80 Request Response Parse/ Render GET http://www.dr- chuck.com/page2.htm Click
  • 25.
    Internet Standards • Thestandards for all of the Internet protocols (inner workings) are developed by an organization • Internet Engineering Task Force (IETF) • www.ietf.org • Standards are called “RFCs” - “Request for Comments” Source: http://tools.ietf.org/html/rfc791
  • 26.
  • 28.
    Making an HTTPrequest • Connect to the server like www.dr-chuck.com" • Request a document (or the default document) • GET http://www.dr-chuck.com/page1.htm HTTP/1.0 • GET http://www.mlive.com/ann-arbor/ HTTP/1.0 • GET http://www.facebook.com HTTP/1.0
  • 29.
    Browser Web Server Note: Many serversdo not support HTTP 1.0 $ telnet data.pr4e.org 80 Trying 74.208.28.177... Connected to data.pr4e.org. Escape character is '^]'. GET http://data.pr4e.org/page1.htm HTTP/1.0 HTTP/1.1 200 OK Date: Tue, 30 Jan 2024 15:30:13 GMT Server: Apache/2.4.18 (Ubuntu) Last-Modified: Mon, 15 May 2017 11:11:47 GMT Content-Length: 128 Content-Type: text/html <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://data.pr4e.org/page2.htm">Second Page</a>.</p> Connection closed by foreign host.
  • 30.
    Accurate Hacking in theMovies • Matrix Reloaded • Bourne Ultimatum • Die Hard 4 • ... http://nmap.org/movies.html
  • 31.
    Let’s Write aWeb Browser!
  • 32.
    An HTTP Requestin Python import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0rnrn'.encode() mysock.send(cmd) while True: data = mysock.recv(512) if (len(data) < 1): break print(data.decode(),end='') mysock.close()
  • 33.
    HTTP/1.1 200 OK Date:Sun, 14 Mar 2010 23:52:41 GMT Server: Apache Last-Modified: Tue, 29 Dec 2009 01:31:22 GMT ETag: "143c1b33-a7-4b395bea" Accept-Ranges: bytes Content-Length: 167 Connection: close Content-Type: text/plain But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief while True: data = mysock.recv(512) if ( len(data) < 1 ) : break print(data.decode()) HTTP Header HTTP Body
  • 34.
  • 35.
  • 36.
    Representing Simple Strings •Each character is represented by a number between 0 and 256 stored in 8 bits of memory • We refer to "8 bits of memory as a "byte" of memory – (i.e. my disk drive contains 3 Terabytes of memory) • The ord() function tells us the numeric value of a simple ASCII character >>> print(ord('H')) 72 >>> print(ord('e')) 101 >>> print(ord('n')) 10 >>>
  • 37.
    ASCII >>> print(ord('H')) 72 >>> print(ord('e')) 101 >>>print(ord('n')) 10 >>> In the 1960s and 1970s, we just assumed that one byte was one character
  • 38.
  • 39.
    Multi-Byte Characters To representthe wide range of characters computers must handle we represent characters with more than one byte • UTF-16 – Fixed length - Two bytes • UTF-32 – Fixed Length - Four Bytes • UTF-8 – 1-4 bytes - Upwards compatible with ASCII - Automatic detection between ASCII and UTF-8 - UTF-8 is recommended practice for encoding data to be exchanged between systems https://en.wikipedia.org/wiki/UTF-8
  • 40.
    Two Kinds ofStrings in Python Python 3.5.1 >>> x = '이광춘' >>> type(x) <class 'str'> >>> x = u'이광춘' >>> type(x) <class 'str'> >>> Python 2.7.10 >>> x = '이광춘' >>> type(x) <type 'str'> >>> x = u'이광춘' >>> type(x) <type 'unicode'> >>> In Python 3, all strings are Unicode
  • 41.
    Python 2 versusPython 3 Python 3.5.1 >>> x = b'abc' >>> type(x) <class 'bytes'> >>> x = '이광춘' >>> type(x) <class 'str'> >>> x = u'이광춘' >>> type(x) <class 'str'> Python 2.7.10 >>> x = b'abc' >>> type(x) <type 'str'> >>> x = '이광춘' >>> type(x) <type 'str'> >>> x = u'이광춘' >>> type(x) <type 'unicode'>
  • 42.
    Python 3 andUnicode • In Python 3, all strings internally are UNICODE • Working with string variables in Python programs and reading data from files usually "just works" • When we talk to a network resource using sockets or talk to a database we have to encode and decode data (usually to UTF-8) Python 3.5.1 >>> x = b'abc' >>> type(x) <class 'bytes'> >>> x = '이광춘' >>> type(x) <class 'str'> >>> x = u'이광춘' >>> type(x) <class 'str'>
  • 43.
    Python Strings toBytes • When we talk to an external resource like a network socket we send bytes, so we need to encode Python 3 strings into a given character encoding • When we read data from an external resource, we must decode it based on the character set so it is properly represented in Python 3 as a string while True: data = mysock.recv(512) if ( len(data) < 1 ) : break mystring = data.decode() print(mystring) socket1.py
  • 44.
    An HTTP Requestin Python import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0nn'.encode() mysock.send(cmd) while True: data = mysock.recv(512) if (len(data) < 1): break print(data.decode()) mysock.close() socket1.py
  • 45.
  • 46.
    Network Socket Bytes UTF-8 String Unicode Bytes UTF-8 recv() decode() encode() send() import socket mysock= socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0nn'.encode() mysock.send(cmd) while True: data = mysock.recv(512) if (len(data) < 1): break print(data.decode()) mysock.close() socket1.py
  • 47.
    Making HTTP EasierWith urllib
  • 48.
    Since HTTP isso common, we have a library that does all the socket work for us and makes web pages look like a file import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') for line in fhand: print(line.decode().strip()) Using urllib in Python urllib1.py
  • 49.
    But soft whatlight through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief urllib1.py import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') for line in fhand: print(line.decode().strip())
  • 50.
    Like a File... importurllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') counts = dict() for line in fhand: words = line.decode().split() for word in words: counts[word] = counts.get(word, 0) + 1 print(counts) urlwords.py
  • 51.
    Reading Web Pages importurllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print(line.decode().strip()) <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://www.dr-chuck.com/page2.htm">Second Page</a>. </p> urllib2.py
  • 52.
    Following Links import urllib.request,urllib.parse, urllib.error fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print(line.decode().strip()) <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://www.dr-chuck.com/page2.htm">Second Page</a>. </p> urllib2.py
  • 53.
    The First Linesof Code @ Google? import urllib.request, urllib.parse, urllib.error fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print(line.decode().strip()) urllib2.py
  • 54.
  • 55.
    What is WebScraping? • When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages • Search engines scrape web pages - we call this “spidering the web” or “web crawling” http://en.wikipedia.org/wiki/Web_scraping http://en.wikipedia.org/wiki/Web_crawler
  • 56.
    Why Scrape? • Pulldata - particularly social data - who links to who? • Get your own data back out of some system that has no “export capability” • Monitor a site for new information • Spider the web to make a database for a search engine
  • 57.
    Scraping Web Pages •There is some controversy about web page scraping and some sites are a bit snippy about it. • Republishing copyrighted information is not allowed • Violating terms of service is not allowed
  • 58.
    The Easy Way- Beautiful Soup • You could do string searches the hard way • Or use the free software library called BeautifulSoup from www.crummy.com https://www.crummy.com/software/BeautifulSoup/
  • 59.
    # To runthis, you can install BeautifulSoup # https://pypi.python.org/pypi/beautifulsoup4 # Or download the file # http://www.py4e.com/code3/bs4.zip # and unzip it in the same directory as this file import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup ... urllinks.py BeautifulSoup Installation
  • 60.
    import urllib.request, urllib.parse, urllib.error frombs4 import BeautifulSoup url = input('Enter - ') html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, 'html.parser') # Retrieve all of the anchor tags tags = soup('a') for tag in tags: print(tag.get('href', None)) python urllinks.py Enter - http://www.dr-chuck.com/page1.htm http://www.dr-chuck.com/page2.htm
  • 61.
    Summary • The TCP/IPgives us pipes / sockets between applications • We designed application protocols to make use of these pipes • HyperText Transfer Protocol (HTTP) is a simple yet powerful protocol • Python has good support for sockets, HTTP, and HTML parsing
  • 62.
    Acknowledgements / Contributions Thesslide are Copyright 2010- Charles R. Severance (www.dr- chuck.com) of the University of Michigan School of Information and open.umich.edu and made available under a Creative Commons Attribution 4.0 License. Please maintain this last slide in all copies of the document to comply with the attribution requirements of the license. If you make a change, feel free to add your name and organization to the list of contributors on this page as you republish the materials. Initial Development: Charles Severance, University of Michigan School of Information … Insert new Contributors here ...