University of Science, VNU-HCM
Faculty of Information Technology
 Môn
Introduction toCơ SởScience
 Data Trí Tuệ Nhân Tạo
 Course
 Data Collection
 Le Ngoc Thanh
 lnthanh@fit.hcmus.edu.vn
 Department of Computer Science
 Ho Chi Minh City
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
 2
Data Science Process
◎ Give the question to answer
◎ Collecting data From one step you will probably need
◎ Data Discovery & preprocessing to obtain to go back to the previous steps to
 readjust, which will probably need to
 data that can be analyzed
 go to the retreat a number of time.
◎ Data analysis (in statistics, visualizations,
 Required attitude: calm, intuitive.
 machine learning)
 ○ à answers (hypotheses) for the question Tools to know how to use: Python and
◎ Evaluation libraries, Jupyter Notebook.
◎ Decision Making
 3
 Collecting data
 ◎ General notes when collecting data
 ○ Is the data correct and sufficient to answer the question?
 ○ Garbage in à garbage out
 ○ Is collecting such data valid? Does it affect others?
 ◎ Ways to collect Data
 ○ Data is available in company, organization: ok, use it
 ○ Data is available but out there (online)
Scope ◉ Pre-packaged data (file csv, excel, ...): download
of the ◉ Data provided through the website's API: use API
course ◉ Data is on the site but no API: parse HTML
 ○ The data is not yet available: created by yourself in ways such as
 conducting surveys, using sensor devices, ...
 4
Ask Question
 How is the recruitment situation of the Data Science in Vietnam now?
 ○ Initially, the question was often broad and vague
 ○ At a later time, it will go back to this step a number of times to adjust
 the question to be more clear and more specific.
 5
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam
Q: What are the recruitment sites in Vietnam?
A: Ask Google ...
A: à http://www.vietnamworks.com/, http://careerbuilder.vn/, ...
Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...
Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: On each recruitment, copy-paste information to take into file L
Q: After you've got data from different job pages, or from the same page, but with different
keywords, how do you merge these data?
A: ...
 6
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam
Q: What are the recruitment sites in Vietnam?
A: Ask Google ...
A: à http://www.vietnamworks.com/, http://careerbuilder.vn/, ...
Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...
Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: Write a program that automatically parse HTML, get the information to retrieve and write down the file J
Q: After you've got data from different job pages, or from the same page but with different keywords, how
do you merge these data?
A: ...
 7
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
 8
Collecting data from the CareerBuilder site with the keyword
"data scientist"
 9
Collecting data from the CareerBuilder site with the keyword
"data scientist"
◎ For each recruitment, draw out the information:
 ○ Title
 ○ Recruiters
 ○ Locations
 ○ Wage
 ○ Date Posted Notice
 ○ Link to detailed content
 ○ Detailed content
◎ Save to CSV file (each recruitment is one line)
 10
 Collecting data from the CareerBuilder site with the keyword
 "data scientist"
◎ For each recruitment, draw out the information:
 ○ Title
 ○ Recruiters The steps taken?
 ○ Locations 1. Get the website's HTML content
 ○ Wage 2. Parse HTML to retrieve data
 ○ Date Posted Notice needed
 3. Write data to CSV file
 ○ Link to detailed content
 ○ Detailed content
◎ Save to CSV file (each recruitment is one line) 11
HTML Code of a Web page
◎ HTML code is composed of tags and tree form with tag <html> as root
 node
◎ Common structure of a tag:
 ○ <head>...</head>: tag contains meta information of the site
 ○ <body>...</body>: tag contains content that will be displayed by the site
 ○ <h1>...</h1>: tag defines the Heading 1
 ○ <p>...</p>: tag defines the paragraphs
 ○ ...
◎ Tags can have the attribute to provide more information about the tag
 ○ <a href=“https://www.google.com/” class=“link”>google link</a>: tag
 contains links
 ○ <h1 id=“myHeader”>my header</h1>
 ○ ...
 12
Retrieving and parse the HTML of your website using Python
◎ Use libary requests-HTML
◎ Install: PowerShell / cmd type
◎ pip install requests-html
 13
Use basic requests-HTML libraries
(document lookups as needed)
◎ Import the library
 ○ from requests_html import HTMLSession
◎ Get the website's HTML code
 ○ session = HTMLSession()
 ○ r = session.get(‘web address’)
 ○ # r contains all the data sent from the site's server, including the HTML of the website
◎ Parse HTML and Tag Search
 ○ tag = r.html.find(selectors, first=True)
 ○ # selectors are written in the manner of CSS Selector (for example, '#about' means to find the
 tag with the ID about), how to define the search criteria: using the inspect function of the Web
 browser
 ○ # first=True this means returning only the first tag found, first=False returns the list
 containing all the found tags
 ○ # From the found tag, it is possible to call .find(...) to find next in this tag
◎ Retrieving tag elements
 ○ tag.html: tag's HTML string
 ○ tag.text: tag’s text string
 ○ tag.attrs: dictionaries containing tag attributes
 14
Demo
◎ Do it by yourself
 15
Note Privacy and Copyright about Data
Note: Avoid doing good things
 ○ Check the "robots. txt" file of the website to see what data is allowed
 to collect, what data are not allowed
 ◉ For Example: https://careerbuilder.vn/robots.txt
 ○ It is not advisable to send too many request to the site in a short time
 (for example, it is possible to give the program a little sleep between
 the submitted request)
 16
Note Privacy and Copyright about Data
◎ Check file “robots.txt” of the site (Example,
 https://careerbuilder.vn/robots.txt)
◎ The following Python code can be used to automatically
 check
 ○ import urllib.robotparser
 ○ rp = urllib.robotparser.RobotFileParser()
 ○ rp.set_url('https://careerbuilder.vn/robots.txt')
 ○ rp.read()
 ○ rp.can_fetch('*', 'https://careerbuilder.vn/viec-
 lam/data-science-k-vi.html')
 ○ # The result will be True or False
 17
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
 18
Ask Question
How is the recruitment situation of the Data Science in Vietnam
now?
 A more specific question is
 ↓
What programming languages are often required in the
recruitment of DS in Vietnam now?
Assumption: we only demo focus on careerbuilder with keywords
“data scientist”
 19
How can we get answers?
◎ To the detailed content of each recruitment, see which
 programming languages are required, and update the
 corresponding counting variables
◎ How do I create program to do that automatically?
 ○ Need to make a list of programming languages to be counted
 ◉ Where to get this list?
 ○ Then, with the detailed content of each recruitment and for each language
 in the list, check if the language appears in the content, if so, update the
 corresponding count variable
 ◉ From the string you can switch to the set of words and then check
 ◉ Example: ‘Proficiency requirements in python, R.’
 ◉ à {‘Proficiency’, ‘requirements’, ‘python’, ‘R’}
 20
 ?
Content → Set of words
◎ One way is to use Regular Expression
◎ Regular Expression allows to perform complex searches
 on the string
 21
How to use Regular Expression
Example 1
s = ‘An has a student ID number 1612345 and email
an@gmail.com\nHà has a student ID number 1654321 and email
1654321@hcmus.edu.vn'
# Request: Find strings ‘hcmus’ in s
import re
results = re.findall(r'hcmus', s)
# results: ['hcmus']
 Raw string
 Using strings is also but in some cases will be more troublesome than the
 raw string
 22
How to use Regular Expression
Example 2
s = ‘An has a student ID number 1612345 and email
an@gmail.com\nHà has a student ID number 1654321 and email
1654321@hcmus.edu.vn'
# Request: Find the student code (7 digits) in s
import re
results = re.findall(r'\d{7}', s)
# Results: ['1612345', '1654321', '1654321’]
# Can cast to the set type to remove the duplication
 Find the string:
 • with numeric characters (from 0 to 9)
 • and there are 7 characters
 23
How to use Regular Expression
Example 3
s = ‘An has a student ID number 1612345 and email
an@gmail.com\nHà has a student ID number 1654321 and email
1654321@hcmus.edu.vn'
# Request: Find the email addresses in S
import re
results = re.findall(r'\w+@[\w.]+', s)
# Results:
# ['an@gmail.com', '1654321@hcmus.edu.vn']
 Find the string:
 • with alphabet character, and there are one or more such characters
 • then the character @
 • then the characters in set include word and character ., and there are
 one or more such characters
 24
How to use Regular Expression
Example 4
s = 'Required to know c, c++, c#, r, python.'
# Request: Find the words in S
import re
results = re.findall(r'[\w+#]+', s)
# Results:
# ['Required', ‘to', ‘know', 'c', 'c++', 'c#', 'r',
# 'python']
 25
 𝒓𝒆
Content → set of words
and count the number of occurrences of the languages
Demo..
 26
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
 27
What is the problem with JavaScript?
◎ Example: Get string “Yay! Supports javascript” in
 http://avi.im/stuff/js-or-no-js.html
◎ Using the inspect function of the Web browser, you should
 see the string ID “intro-text”
◎ Use Requests-HTML to retrieve ...
◎ The result is a string “No javascript support”
 ○ Cause: HTML content obtained by Requests-HTML is the original
 content sent from the server, if in this content there is a JavaScript,
 HTML content when using the inspect function of Web browser in the
 client as HTML content after it has been run JavaScript
 28
How to solve the problem of a website with JavaScript?
◎ As document of Requests-HTML: “Full JavaScript support” J
 ○ session = HTMLSession()
 ○ r = session.get(‘...’)
 ○ r.html.render()
◎ Function .render() will run a browser (without an interface) to
 fetch HTML content after a JavaScript has been run, and then
 replace the existing (unjavascript) content with this content (already
 running JavaScript)
◎ Function .render() currently not running at Jupyter Notebook
 due to this is somewhat clashed with each other
◎ One way to run is write code in File *.py and run this file in
 PowerShell/cmd by typing:
◎ python file-name.py
 29
Selenium Library
◎ Rather than using the Render() method in Requests-HTML,
 we can programmatically control a Web browser and retrieve
 the HTML content after it has been run by JavaScript.
◎ In Python, there are Selenium libraries to do that
 ○ Selenium doesn't clash with Jupyter Notebook J
 ○ Selenium allows programmers to interactive (fill in information, select,
 check, Push button,...) with Web browser J (Requests-HTML can't do this)
 ○ Selenium can be made from A to Z, but will usually run faster if Selenium
 does not do the Requests-HTML jobs and let the rest Requests-HTML
 30
Trying with Selenium?
◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is
 the cheapest price in the next 5 days (not include today)?
 31
How to use Selenium?
◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is the
 cheapest price in the next 5 days (not include today)?
◎ Steps:
 1. Use Selenium to open web browser and https://www.vietjetair.com/Sites/Web/vi-
 VN/Home
 2. Use Selenium to choose where to go is "Ho Chi Minh City (SGN)", the Destination
 "Da Nang (DAD), select "One Way", select the departure date is tomorrow, then
 press the "Find flights“ button
 3. After the results page has been loaded, use Selenium to obtain HTML content, and
 then give the Requests-HTML for Requests-HTML to handle the rest (parse HTML
 and search for the data you need)
 4. Repeat step 1 to 3 with the travel date of the next and loop until the full 5 days
 5. From the data collected, find the cheapest flight
 32
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
 33
Collecting data using Web APIs
◎ Some websites offer API (Application Programming Interface)
 to make external apps retrieve data easier
◎ Use the web API "more official" than parse HTML
 ○ As this is the path that “host" opens to "guests" entering the data
 à If the site has API, use it first.
◎ Need to read the host’s document to know what data to take,
 which way to go, …
◎ This is a list (incomplete) of sites providing API
 ○ https://github.com/public-apis/public-apis
 ○ Large sites like Twitter, Facebook, Google, ... Often provide API
 ○ Some sites require registration to use the API (charges may apply)
 34
Example: Get information about current weather in Ho Chi Minh
City
 Parse HTML
 35
Example: Get information about current weather in Ho Chi Minh
City
 Use API: Almost immediately receive data J
 This is the XML (eXtensible
 Markup Language) format,
 which similar with HTML
 - HTML used to display data to
 viewers
 - XML for performing data to
 exchange between computer
 applications through a network
 path
 - XML easier parse than HTML
 36
Example: Get information about current weather in Ho Chi Minh
City
 Use API: Almost immediately receive data J
 Another format for using API is JSON (JavaScript Object Notation)
 • JSON is simpler, easier parse than XML (however, the
 representation is not equal to the XML)
 • The simplicity of JSON is sufficient for many cases in practice à
 JSON is more common than XML
 • In the course, we will focus on JSON
 37
Source: http://www.json.org/
 JSON
 “JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy
 for humans to read and write. It is easy for machines to parse and generate. It is based
 on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition
 - December 1999. JSON is a text format that is completely language independent but
 uses conventions that are familiar to programmers of the C-family of languages,
 including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These
 properties make JSON an ideal data-interchange language.
 JSON is built on two structures:
 ◎ A collection of name/value pairs. In various languages, this is realized as an object,
 record, struct, dictionary, hash table, keyed list, or associative array.
 ◎ An ordered list of values. In most languages, this is realized as an array, vector, list,
 or sequence.”
 38
Example: JSON
 [
 { "id" : 1,
 "name" : "Hoa"
 {"employees": [ "student": true
 { "firstName":"John", "lastName":"Doe" }, "email": null
 { "firstName":"Anna", "lastName":"Smith" }, },
 { "firstName":"Peter", "lastName":"Jones" } { "id" : 2,
 ]} "name" : "Mai“
 "student": true
 "email": null
 }
 File *.ipynb ]
 { "cells": [
 { "cell_type": "markdown",
 "metadata": {},
 "source": ["# Continue "]
 },
 ...
 39
How to use Web API in Python?
Q: Get the JSON content that the site returns through the
API?
A: Use Requests library
Q: Parse JSON (converting from JSON string to Python data
structure)?
A: Use JSON library
 40
Requests Library
◎ It is same author with library Requests-HTML
 ○ if only get site content: use Requests
 ○ if get site content + parse HTML: use Requests-HTML
◎ It is installed when installing Requests-HTML. Otherwise:
 pip install requests
◎ Basic usage:
 ○ import requests
 ○ r = requests.get(‘site path’)
 ○ r.text # Content string (HTML/XML/JSON)
 ○ sent from server
 41
JSON Library
◎ It is built-in library of Python
◎ Basic usage:
 ○ import json
 ○ # JSON string à data structure of python (parse JSON):
 ○ json_pydata = json.loads(json_str)
 ○ # Data structure of python à JSON string:
 ○ json_str = json.dumps(json_pydata)
 ○ # JSON File à data structure of python:
 ○ json_pydata = json.load(json_fileobj)
 ○ # Data structure of python à JSON file:
 ○ json.dump(json_pydata, json_fileobj)
 42
43
References
◎ Slides from Tran Trung Kien
 44