DEV Community

KILLALLSKYWALKER
KILLALLSKYWALKER

Posted on

May the Blocks Be With You: Parallel Processing in Mage AI

In tourism, many people still get cheated by scam companies. This happens a lot with umrah packages, tourist guides, and travel agencies. Why? Because it is not easy to check if a company is legal or not.

The government has official websites with lists of banned, blacklisted, or registered names. There is a search function, but the problem is the data is split into many different lists. For example, one list for tourist guides, one list for umrah, one list for travel agencies. You must choose the right list first, and then search. Also, each list uses pagination. That means you still need to click page by page, which is slow and not friendly.

I started to think , What if we make one simple website, where people just type a keyword, and it will show if the name exists in any of the lists? This way, travelers can check quickly if a company is real or a scam. Btw I’m doing this for fun, since I can’t go anywhere during the school holidays as the roads are all jammed , so i just spend my time on a little project for fun .

The Challenge

The most lazy part of this actually is to get all the related data . Copy and paste by hand is possible and easy way , but it too much work , it would be fun project to depress project later on haha . So why not use Mage AI as i already use this to for my previous project related to data .

At first, I created a normal block with a loop. It worked, but it was too slow because it went step by step through every page ( at least for page that not have many page ) . Then I realized , why not try a dynamic block? With dynamic blocks, I can run many requests at the same time with parallel processing. Much faster, much smarter.

Mage AI Dynamic Blocks

Here is where Mage AI helps. Mage AI has dynamic blocks. With this feature, we can scrape many pages in parallel. That means faster and easier. To learn more about Mage AI Dynamic Blocks go here .

This is how it works :

  1. Generate a list of url including pagination parameter using a loader block . Keep in mind a dynamic block must return a list of 2 lists of dictionaries
  2. Scrape the page based on url that are store in list of dictionary and reduce the data into one set
  3. Export the data to destination

Example

First step
Create loader block . Ensure you set this block as dynamic .

Enable Dynamic Block

Once you set it as dynamic , you can write this as your loader , the purpose is for us to get all targeted url that we want to to scrape .

from typing import Dict, List import requests from bs4 import BeautifulSoup @data_loader def load_data(*args, **kwargs) -> List[List[Dict]]: """ This loader prepares tasks for scraping multiple MOTAC pages. Each entry in 'urls' becomes a separate block run if used with dynamic blocks. """ url = "https://the-targeted-url" response = requests.get(url, timeout=20) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") record_tag = soup.select_one("li.uk-disabled span") jumlah_rekod = None if record_tag: text = record_tag.get_text(strip=True) jumlah_rekod = int(text.split(":")[-1].strip()) urls = [] for offset in range(0, jumlah_rekod, 20): if offset == 0: urls.append(url) else: urls.append(f"{url}?s=&n=&v={offset}") tasks = [] metadata = [] for idx, url in enumerate(urls, start=1): tasks.append(dict(id=idx, url=url)) metadata.append(dict(block_uuid=f"scrape_page_{idx}")) return [ tasks, metadata ] 
Enter fullscreen mode Exit fullscreen mode

Second Step
Create a transformer . This transformer will do scraping and get all the data from the page . This will be automatic set as dynamic if you set the first block dynamic . The only thing we need to do is to reduce for the output . The reason we reduce because we want to export in one step , so we don't spawn another extra block for export .

Reduce Output

import requests from bs4 import BeautifulSoup @transformer def scrape_page(row, *args, **kwargs): url = row["url"] response = requests.get(url, timeout=30) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") results = [] table = soup.find("table") if table: headers = [th.get_text(strip=True) for th in table.find_all("th")] for tr in table.find_all("tr")[1:]: # skip header row cells = [td.get_text(strip=True) for td in tr.find_all("td")] if cells: results.append(dict(zip(headers, cells))) return { "page_id": row["id"], "url": url, "records": results, } 
Enter fullscreen mode Exit fullscreen mode

Third Step
Add another block to do some cleaning of the data format , column name and etc before exporting .

if 'transformer' not in globals(): from mage_ai.data_preparation.decorators import transformer if 'test' not in globals(): from mage_ai.data_preparation.decorators import test import pandas as pd @transformer def transform(data, *args, **kwargs): df = data if '#' in df.columns: df = df.drop(columns=['#']) df = df.rename(columns={ 'Nama': 'nama', 'No. TG': 'no_tg', 'Tempoh Sah': 'tempoh_sah', 'Tarikh Batal': 'tarikh_batal', 'Seksyen': 'seksyen' }) for col in ['tempoh_sah', 'tarikh_batal']: if col in df.columns: df[col] = pd.to_datetime(df[col], format='%d/%m/%y', errors='coerce') return df @test def test_output(output, *args) -> None: # Ensure required columns exist required_cols = ['nama', 'no_tg', 'tempoh_sah', 'tarikh_batal', 'seksyen'] for col in required_cols: assert col in output.columns, f'Missing column: {col}' 
Enter fullscreen mode Exit fullscreen mode

Fourth Step
As for now , since this only daily project , i just gonna export first and using full load first . No worry , if i've a mood , i will write a better approach for this :)

from mage_ai.settings.repo import get_repo_path from mage_ai.io.config import ConfigFileLoader from mage_ai.io.postgres import Postgres from pandas import DataFrame from os import path if 'data_exporter' not in globals(): from mage_ai.data_preparation.decorators import data_exporter @data_exporter def export_data_to_postgres(df: DataFrame, **kwargs) -> None: """ Template for exporting data to a PostgreSQL database. Specify your configuration settings in 'io_config.yaml'. Docs: https://docs.mage.ai/design/data-loading#postgresql """ schema_name = 'public' table_name = 'pemandu_pelancong' config_path = path.join(get_repo_path(), 'io_config.yaml') config_profile = 'default' with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader: loader.export( df, schema_name, table_name, index=False, if_exists='replace', ) 
Enter fullscreen mode Exit fullscreen mode

Why Use Dynamic Blocks for Scraping?

Dynamic blocks are powerful because they make scraping large datasets much faster. Instead of one request after another, you can run many requests at the same time. For websites with hundreds of pages, this saves a lot of time.

But there are also things to keep in mind

  1. Respect rate limits: Some websites may block you if you send too many requests at once
  2. Error handling: Always add retries in case some requests fail
  3. Data consistency: Make sure to clean and validate data before saving
  4. Ethics and legality: Always check if scraping the website is allowed

Closing Thoughts

This little holiday project showed me how useful Mage AI’s dynamic blocks can be. With just a few blocks, I turned a slow and boring manual process into a fast, automated pipeline. The scraped data can now be used to build a simple search directory, helping people quickly check if a company is real or a scam.

Dynamic blocks are not only fun , they’re practical, powerful, and a great tool for anyone working with pagination or large API calls.

So remember when you face hundreds of pages, don’t suffer like anakin let the blocks be with you.

Top comments (0)