Open in app
Search Write
Data Engineering Concepts #2 —
Sending Data Using an API
Bar Dadon · Follow
Published in Dev Genius · 7 min read · Jul 17
151 1
Photo by Myriam Jessier on Unsplash
Introduction
One of the main responsibilities of data engineers is to transfer data between
a source and a destination. Data engineers can do it in many different ways.
Depending on the problem, this job often requires a data engineer to build
and maintain a complex data pipeline. However, data pipelines are not the
only way we can transfer data between machines or services.
In many cases, we can complete the request by building a simple API that
allows authorized users to request data from our services.
What is an API?
An API is simply an interface that allows users to send HTTP requests over
the internet to a server. Using these HTTP requests, a user can interact with
various services on the server, such as querying a database or executing a
function.
The developers who create the API control which operations users can
activate when they send HTTP requests.
For example, we can create an API that, given the correct request, activates a
function that is in charge of calling a query that retrieves the five most active
customer ids in the last month from a table called “customers”.
When to use an API instead of a data pipeline
APIs can be a great replacement for pipelines, but we should be aware when
to use them.
First, because APIs are used to send data over the internet, we can only send
relatively small amounts of data in each request. Also, if there’s a need for
highly complex processing of the data, then the API will be slow and
inefficient. In those cases, we should create a data pipeline instead.
However, APIs can replace a pipeline when the data needed is lightweight
and there’s no need for scheduling.
APIs also allow users to pull the data on their own. Users can interact with a
service whenever they choose, without having to request a data engineer to
execute a certain pipeline.
Of course, we can always use a hybrid approach. We can create a data
pipeline for transferring and processing large amounts of data into a
repository of our choice. Then create an API that can retrieve small amounts
of that processed data to users.
Example
To make this more concrete, let’s build a simple API using Flask. This API
will allow users to send a GET request to our service. If the request is valid,
the API will scrape the website: “example.com” and retrieve the requested
amount of letters from the website.
http://example.com/
1. Setting the environment
To get started, let’s create a virtual environment:
root@DESKTOP-3U7IV4I:/projects# python3 -m venv api_example
Then activate it:
root@DESKTOP-3U7IV4I:/projects/api_example# source bin/activate
To verify that we are currently in the virtual environment, the prompt should
look like this:
(api_example) root@DESKTOP-3U7IV4I:/projects/api_example#
Next, we need to pip install the libraries: flask, bs4 and requests.
(api_example) root@DESKTOP-3U7IV4I:/projects/api_example# pip install flask bs4
Next, create a folder called “app” and a file app.py:
app/app.py
Great. Now we can build the API.
2. Building the API
First, let’s write the function for scraping the website example.com and
retrieving all the text we can find.
from bs4 import BeautifulSoup
import requests
def scrape_data(url = "<http://example.com/>"):
'''
1. Send a GET request to <http://example.com/>.
2. Parse the response.
3. Return all the text in the website.
Args:
- url(str)
default("<http://example.com/>")
Returns:
- text(str)
'''
def extract():
response = requests.get(url)
if response.status_code == 200:
print("Connection Succesful")
else:
raise ConnectionError("Something Went Wrong!")
return response
def transform(response):
text = ''
soup = BeautifulSoup(response.text, 'html.parser')
elements = soup.find_all(name='p')
for ele in elements:
text += ele.text
return text
return transform(extract())
if __name__ == "__main__":
data = scrape_data()
print(data)
Output:
Connection Succesful
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.Mor
Seems to be functioning properly. Let’s start building the API now. I will use
flask to create a local app that listens to port 5000.
Any user that sends a GET request to the URL: localhost:5000/ will activate
the above function and receive the text that we just scraped.
from flask import Flask
from bs4 import BeautifulSoup
import requests
def scrape_data(url = "<http://example.com/>"):
'''
1. Send a GET request to <http://example.com/>.
2. Parse the response.
3. Return all the text in the website.
Args:
- url(str)
default("<http://example.com/>")
Returns:
- text(str)
'''
def extract():
response = requests.get(url)
if response.status_code == 200:
print("Connection Succesful")
else:
raise ConnectionError("Something Went Wrong!")
return response
def transform(response):
text = ''
soup = BeautifulSoup(response.text, 'html.parser')
elements = soup.find_all(name='p')
for ele in elements:
text += ele.text
return text
return transform(extract())
# Create a flask app
app = Flask(__name__)
# Implement a route to scrape data
@app.route('/')
def get_data(data):
data = scrape_data()
return data
# Run the app
if __name__ == "__main__":
app.run(debug=True, host = "localhost", port=5000)
To run the app, go to the folder “app” and run:
(api_example) root@DESKTOP-3U7IV4I:/projects/api_example# cd app
(api_example) root@DESKTOP-3U7IV4I:/projects/api_example/app# flask run
If we go to localhost:5000/ we will see the scraped text in our simple app:
Our app at: localhost:5000
3. Using the API
Now, let’s say that we are users that need this data and want to use this API
that the developers built. To access this data we need to send a GET request
to localhost:5000/.
We can do that in many different ways. There are tons of tools for that, the
simplest one is to just use the Linux command “curl”.
Let’s use a curl command to grab this data and store it in a text file called
“scraped_data.txt”
(api_example) root@DESKTOP-3U7IV4I:/projects/api_example# curl -o scraped_data.t
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 175 100 175 0 0 634 0 --:--:-- --:--:-- --:--:-- 636
We should now have all the scraped text in the text file:
scraped_data.txt
4. Improving the API
Let’s go back to “playing” the developers. As the developers that built this
API, we are also tasked with adding some layer of security. We can’t allow
anyone that sends a simple GET request to grab out data.
a. Adding an API key
A very common way of adding a layer of security is by adding an API key.
For this simple example, let’s say that the API key is 12345. We want to
modify the code so that only requests to the URL
localhost:5000/api_key=12345 will be granted data. All other requests will
fail.
This will ensure that only users that know the API key that we chose are
authorized to send GET requests.
from flask import Flask
from bs4 import BeautifulSoup
import requests
def scrape_data(url = "<http://example.com/>"):
'''
1. Send a GET request to <http://example.com/>.
2. Parse the response.
3. Return all the text in the website.
Args:
- url(str)
default("<http://example.com/>")
Returns:
- text(str)
'''
def extract():
response = requests.get(url)
if response.status_code == 200:
print("Connection Succesful")
else:
raise ConnectionError("Something Went Wrong!")
return response
def transform(response):
text = ''
soup = BeautifulSoup(response.text, 'html.parser')
elements = soup.find_all(name='p')
for ele in elements:
text += ele.text
return text
return transform(extract())
# Create a flask app
app = Flask(__name__)
API_KEY = '12345'
# Implement a route to scrape data
@app.route('/api_key=<api_key>')
def get_data(api_key):
if api_key != API_KEY:
raise ConnectionRefusedError("Wrong API key!")
else:
data = scrape_data()
return data
# Run the app
if __name__ == "__main__":
app.run(debug=True, host = "localhost", port=5000)
Now, let’s send a GET request, but this time with the API key 12345:
(api_example) root@DESKTOP-3U7IV4I:/projects/api_example# curl -o scraped_data.t
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 175 100 175 0 0 653 0 --:--:-- --:--:-- --:--:-- 655
Great. Now, only authorized users that know that the API key is 12345 can
scrape our data.
b. Controlling the amount of data
Next, let’s allow users to control the amount of data they receive. Instead of
receiving all the data, users will be able to choose how many letters they
want. This can look like this:
from flask import Flask
from bs4 import BeautifulSoup
import requests
def scrape_data(url = "<http://example.com/>"):
'''
1. Send a GET request to <http://example.com/>.
2. Parse the response.
3. Return all the text in the website.
Args:
- url(str)
default("<http://example.com/>")
Returns:
- text(str)
'''
def extract():
response = requests.get(url)
if response.status_code == 200:
print("Connection Succesful")
else:
raise ConnectionError("Something Went Wrong!")
return response
def transform(response):
text = ''
soup = BeautifulSoup(response.text, 'html.parser')
elements = soup.find_all(name='p')
for ele in elements:
text += ele.text
return text
return transform(extract())
# Create a flask app
app = Flask(__name__)
API_KEY = '12345'
# Implement a route to scrape data
@app.route('/api_key=<api_key>/number_of_letters=<number_of_letters>')
def get_data(api_key, number_of_letters):
if api_key != API_KEY:
raise ConnectionRefusedError("Wrong API key!")
else:
data = scrape_data()
return data[0:int(number_of_letters)]
# Run the app
if __name__ == "__main__":
app.run(debug=True, host = "localhost", port=5000)
Now let’s say that we want only the first 100 letters. We can send a GET
request like this:
(api_example) root@DESKTOP-3U7IV4I:/projects/api_example# curl -o scraped_data.t
And the result is a text file with only the first 100 letters:
scraped_data.txt — only the first 100 letters
As we can see, APIs are a useful way to send small amounts of data online
and enable users to access services that developers provide.
This concludes the article. Hope you had a good read and learned something
new. If there are any questions, please don’t hesitate to ask in the comment
section.
API Data Science Data Engineering Python Programming