DEV Community

Cover image for 🛡️ Data quality, SQL, duckdb and http_client on CI🦆
adriens
adriens

Posted on

🛡️ Data quality, SQL, duckdb and http_client on CI🦆

💭 CI, duckdb et and data protection

To efficiently yet effortlessly manage data quality, we created a GitHub Action to install duckdb :

... but recently I had to face an another challenge : as part of our CI, I had the need to validate data... that were relying on web resources.

I needed to be sure that a GitHub Account was really existing (for example to avoid typos) as part of our CI.

In this very short article, I'll show how to use DuckDB with the http_client extension to verify GitHub handles stored in a table, for example to lint data as part of a CI pipeline thanks to GitHub Duckdb Action... and do the job with a very simple SQL script and CHECK constraints.

🍿 For impatients

🔖 Resources

🕹️ Do the job with SQL

First install extension:

INSTALL http_client FROM community; LOAD http_client; 
Enter fullscreen mode Exit fullscreen mode

Then create a table with some example data:

create or replace table person_gh_member ( sam_accountname varchar primary key, gh_member varchar not null ); -- insert rows INSERT INTO person_gh_member (sam_accountname, gh_member) values ('adriens', 'adriens'), ('jdoe', 'johndoe'), ('asmith', 'annasmithRRRRR'); 
Enter fullscreen mode Exit fullscreen mode

Finally, run a query to check the status of each GitHub handle:

create or replace view v_person_gh_status as select sam_accountname, gh_member, 'https://github.com/' || gh_member as gh_url, cast(http_get(gh_url).status as integer) as http_gh_status from person_gh_member; --where gh_status <> 200; from v_person_gh_status; 
Enter fullscreen mode Exit fullscreen mode

Now, to lint, we can add a check to see if any status is not 200:

from v_person_gh_status where http_gh_status <> 200; 
Enter fullscreen mode Exit fullscreen mode
create or replace table lint_gh_handle( gh_handle varchar primary key, gh_url varchar not null unique check (gh_url like 'https://github.com/%'), gh_status integer check (gh_status = 200) ); insert into lint_gh_handle( gh_handle, gh_url, gh_status) select gh_member, 'https://github.com/' || gh_member as gh_url, cast(http_get(gh_url).status as integer) as gh_status from person_gh_member; 
Enter fullscreen mode Exit fullscreen mode

Top comments (4)

Collapse
 
parag_nandy_roy profile image
Parag Nandy Roy

Very cool use of DuckDB..

Collapse
 
adriens profile image
adriens

duckdb is such a productivity tool 🤩

Collapse
 
adriens profile image
adriens

Collapse
 
adriens profile image
adriens

🎁 Thank you for your extension (blog and video gift) #24

Hi guys,

we are maintaining ℹ️ Setup Duckdb Github Action... and we recently had the need to protect some data on our CI. I discovered your extension... which did a great job 😃

So I decided to document this use case, so here is below some feedback :

Thanks a lot for the great work, it will save us lot of efforts 🦥