ArcticDB has been the modern database for Pandas' dataframe. It can handle billions of rows at scale, making it efficient for quantitative analysis. Therefore, I've decided to give it a spin in my data scrapping project.
Besides, Serverless Framework has been my top choice when it comes to developing Lambda function and its deployment to AWS. In this project, I wrote a data scrapping function that gets triggerred every min to scrap news and store them into ArcticDB.
First I created a S3 Bucket named devto-arctic
, then connect locally with Jupyter Notebook to set up its library. I have opted to use AWS access key method to gain connection to the storage bucket.
# Jupyter Notebook import pandas as pd import arcticdb as adb import os import dotenv dotenv.load_dotenv() ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=us-east-2&access={os.getenv('AWS_ACCESS_KEY_ID')}&secret={os.getenv('AWS_SECRET_ACCESS_KEY')}") ac.create_library('intro') ac.list_libraries() # output list of library in the db df = pd.DataFrame() ac.write('news_frame', df) # writing a empty df to a table
You will notice in your S3 bucket, a file *prefix*Intro will be created within.
Then, we should proceed to setting up Lambda Function with serverless framework. After npm install serverless
, we can then initialize a Python project. Run serverless login
to login to your serverless account before initialization. Next, execute serverless
to choose a scheduled task python template as starter.
Once initialized, you'll get a Python project folder with all the necessary files. In the handler.py
, it should be your function codes to make connection to ArcticDB for performing data read and write.
# handler.py import datetime import logging import arcticdb as adb import requests import pandas as pd import json from dotenv import load_dotenv import os load_dotenv() ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=ap-southeast-1&access={os.environ['AWS_ACCESS_KEY_ENV']}&secret={os.environ['AWS_SECRET_ACCESS_KEY_ENV']}") lib = ac.get_library('intro', create_if_missing=True) ac.list_libraries() ac.list_symbols() # symbols are equivalent to tables in a library logger = logging.getLogger(__name__) logger.setLevel(logging.INFO) def fetch_news(): url = "https://news.endpoint.com/api?limit=500" # dummy endpoint try: response = requests.get(url) response.raise_for_status() # Raise an exception for bad status codes return response.json() except requests.RequestException as e: logger.error(f"Error fetching news: {str(e)}") return None def run(event, context): symbol = 'news_frame' current_time = datetime.datetime.now().timestamp() * 1000 logger.info("Your cron function ran at " + str(datetime.datetime.now().time())) # Fetch news data news_data = fetch_news() if news_data is None: return { 'statusCode': 500, 'body': json.dumps('Failed to fetch news data') } df = pd.DataFrame([{ 'time': datetime.datetime.fromtimestamp(int(news['time'])/1000), # Convert ms to datetime 'title': str(news.get('title', '')), 'source': str(news.get('source', '')), 'news_id': str(news.get('news_id', '')), 'url': str(news.get('url', '')), 'icon': str(news.get('icon', '')), 'image': str(news.get('image', '')) } for news in news_data]) try: print(f"\nWriting DataFrame for {symbol}:") lib.append(symbol, df) # use append so it doesn't overwrite old data print(f"Successfully wrote {symbol} to ArcticDB") except Exception as e: print(f"Error writing {symbol} to ArcticDB: {str(e)}") logger.info(f"Successfully processed news articles") return { 'statusCode': 200, 'body': json.dumps({ 'message': 'Successfully processed news data', 'time': str(current_time) }) }
Now we can deploy the Lambda function, but first make sure requirements.txt
has all the dependencies:
# requirement.txt arcticdb; sys_platform != "darwin" requests pandas numpy python-dotenv
Note that we skip arcticdb from pip install because of binary support for Mac machine is not yet ready at the time of writing. Running pip install
locally could fail without the sys_platform != "darwin"
syntax. This is a work around so that Mac would skip installing arcticdb via pip. You don't need the syntax on Windows or Linux.
If you are on Mac and want to test the code locally, do activate a virtual python env and use conda install -c conda-forge arcticdb
to install arcticdb, run serverless invoke local
to execute the the function.
In the serverless' package.json, I have made sure plugin serverless-python-requirements
is included so that during serverless deployment, python dependencies in the requirements.txt will be packaged as Layer for Lambda function to import the dependent modules from.
Next, if you are on Windows or Linux, you may deploy straight from local by running serverless deploy
. Deploy from Mac machine could fail as arcticdb would spit error for not finding its binary distribution as mentioned.
The workaround will be using cloud CI/CD to package and deploy the Lambda.
The scripts install-plugin
and deploy
in package.json will be used on CI/CD. In this case, let's use Github Actions as deployment tool, deployment script as followed:
# deploy.yml name: deploy serverless on: push: branches: - main jobs: deploy: name: deploy runs-on: ubuntu-latest environment: ${{ inputs.environment }} permissions: contents: read deployments: write strategy: matrix: node-version: [18.x] python-version: [3.9] steps: - uses: actions/checkout@v3 with: token: ${{ secrets.GITHUB_TOKEN }} - name: Use Node.js ${{ matrix.node-version }} uses: actions/setup-node@v3 with: node-version: ${{ matrix.node-version }} - name: Set up Python uses: actions/setup-python@v4 with: python-version: ${{ matrix.python-version }} architecture: x64 - run: npm ci --include=dev - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v4 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: us-east-2 - name: Install Plugin and Deploy run: npm run install-plugin && npm run deploy env: SERVERLESS_ACCESS_KEY: ${{ secrets.SERVERLESS_ACCESS_KEY }}
The step to configure your AWS credentials allow serverless to deploy accordingly to your AWS environment. Please make sure the IAM for such access key is given admin permission to Lambda and S3.
Above Github Action will be triggered on push to main branch. You may configure to how you would like the deployment trigger.
After the deployment, you can see the Event Bridge is auto set up as the scheduler, and a Layer is uploaded and attached to the Lambda.
Hooray, here we go with the serverless approach to scrap data and save into ArcticDB! You may then use Jupyter Notebook to read data locally and analyze them while Lambda doing its thing in the background.
Top comments (0)