Posted on Feb 13 • Edited on Mar 23

Host ArcticDB on S3 and connect with Lambda

ArcticDB has been the modern database for Pandas' dataframe. It can handle billions of rows at scale, making it efficient for quantitative analysis. Therefore, I've decided to give it a spin in my data scrapping project.

Besides, Serverless Framework has been my top choice when it comes to developing Lambda function and its deployment to AWS. In this project, I wrote a data scrapping function that gets triggerred every min to scrap news and store them into ArcticDB.

First I created a S3 Bucket named devto-arctic, then connect locally with Jupyter Notebook to set up its library. I have opted to use AWS access key method to gain connection to the storage bucket.

# Jupyter Notebook import pandas as pd import arcticdb as adb import os import dotenv dotenv.load_dotenv() ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=us-east-2&access={os.getenv('AWS_ACCESS_KEY_ID')}&secret={os.getenv('AWS_SECRET_ACCESS_KEY')}") ac.create_library('intro') ac.list_libraries() # output list of library in the db df = pd.DataFrame() ac.write('news_frame', df) # writing a empty df to a table

You will notice in your S3 bucket, a file *prefix*Intro will be created within.

Then, we should proceed to setting up Lambda Function with serverless framework. After npm install serverless, we can then initialize a Python project. Run serverless login to login to your serverless account before initialization. Next, execute serverless to choose a scheduled task python template as starter.

Once initialized, you'll get a Python project folder with all the necessary files. In the handler.py, it should be your function codes to make connection to ArcticDB for performing data read and write.

# handler.py import datetime import logging import arcticdb as adb import requests import pandas as pd import json from dotenv import load_dotenv import os load_dotenv() ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=ap-southeast-1&access={os.environ['AWS_ACCESS_KEY_ENV']}&secret={os.environ['AWS_SECRET_ACCESS_KEY_ENV']}") lib = ac.get_library('intro', create_if_missing=True) ac.list_libraries() ac.list_symbols() # symbols are equivalent to tables in a library  logger = logging.getLogger(__name__) logger.setLevel(logging.INFO) def fetch_news(): url = "https://news.endpoint.com/api?limit=500" # dummy endpoint  try: response = requests.get(url) response.raise_for_status() # Raise an exception for bad status codes  return response.json() except requests.RequestException as e: logger.error(f"Error fetching news: {str(e)}") return None def run(event, context): symbol = 'news_frame' current_time = datetime.datetime.now().timestamp() * 1000 logger.info("Your cron function ran at " + str(datetime.datetime.now().time())) # Fetch news data  news_data = fetch_news() if news_data is None: return { 'statusCode': 500, 'body': json.dumps('Failed to fetch news data') } df = pd.DataFrame([{ 'time': datetime.datetime.fromtimestamp(int(news['time'])/1000), # Convert ms to datetime  'title': str(news.get('title', '')), 'source': str(news.get('source', '')), 'news_id': str(news.get('news_id', '')), 'url': str(news.get('url', '')), 'icon': str(news.get('icon', '')), 'image': str(news.get('image', '')) } for news in news_data]) try: print(f"\nWriting DataFrame for {symbol}:") lib.append(symbol, df) # use append so it doesn't overwrite old data  print(f"Successfully wrote {symbol} to ArcticDB") except Exception as e: print(f"Error writing {symbol} to ArcticDB: {str(e)}") logger.info(f"Successfully processed news articles") return { 'statusCode': 200, 'body': json.dumps({ 'message': 'Successfully processed news data', 'time': str(current_time) }) }

Now we can deploy the Lambda function, but first make sure requirements.txt has all the dependencies:

# requirement.txt arcticdb; sys_platform != "darwin" requests pandas numpy python-dotenv

Note that we skip arcticdb from pip install because of binary support for Mac machine is not yet ready at the time of writing. Running pip install locally could fail without the sys_platform != "darwin" syntax. This is a work around so that Mac would skip installing arcticdb via pip. You don't need the syntax on Windows or Linux.

If you are on Mac and want to test the code locally, do activate a virtual python env and use conda install -c conda-forge arcticdb to install arcticdb, run serverless invoke local to execute the the function.

See ArcticDB 4.3.1

In the serverless' package.json, I have made sure plugin serverless-python-requirements is included so that during serverless deployment, python dependencies in the requirements.txt will be packaged as Layer for Lambda function to import the dependent modules from.

Next, if you are on Windows or Linux, you may deploy straight from local by running serverless deploy. Deploy from Mac machine could fail as arcticdb would spit error for not finding its binary distribution as mentioned.

The workaround will be using cloud CI/CD to package and deploy the Lambda.

The scripts install-plugin and deploy in package.json will be used on CI/CD. In this case, let's use Github Actions as deployment tool, deployment script as followed:

# deploy.yml name: deploy serverless on: push: branches: - main jobs: deploy: name: deploy runs-on: ubuntu-latest environment: ${{ inputs.environment }} permissions: contents: read deployments: write strategy: matrix: node-version: [18.x] python-version: [3.9] steps: - uses: actions/checkout@v3 with: token: ${{ secrets.GITHUB_TOKEN }} - name: Use Node.js ${{ matrix.node-version }} uses: actions/setup-node@v3 with: node-version: ${{ matrix.node-version }} - name: Set up Python uses: actions/setup-python@v4 with: python-version: ${{ matrix.python-version }} architecture: x64 - run: npm ci --include=dev - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v4 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: us-east-2 - name: Install Plugin and Deploy run: npm run install-plugin && npm run deploy env: SERVERLESS_ACCESS_KEY: ${{ secrets.SERVERLESS_ACCESS_KEY }}

The step to configure your AWS credentials allow serverless to deploy accordingly to your AWS environment. Please make sure the IAM for such access key is given admin permission to Lambda and S3.

Above Github Action will be triggered on push to main branch. You may configure to how you would like the deployment trigger.

After the deployment, you can see the Event Bridge is auto set up as the scheduler, and a Layer is uploaded and attached to the Lambda.

Hooray, here we go with the serverless approach to scrap data and save into ArcticDB! You may then use Jupyter Notebook to read data locally and analyze them while Lambda doing its thing in the background.

DEV Community

Host ArcticDB on S3 and connect with Lambda

Top comments (0)