DEV Community

Alain Airom
Alain Airom

Posted on

Yet another document ingestion project with Docling and IBM Cloud Code Engine (serverless)

A recent project concept, using a serverless application powered by Docling document ingestion/preparation capacities.

Image description

Introduction

As part of my professional activities, I am very often engaged in helping our business partners to gain technical hands-on experience with technologies and tools we recommend to them. What follows is a part of a global project in which we helped our partner by some coding samples to accelerate the first phase of their project.

> The code provided below is to used as a starter or helper, and is adopted to the real use-case. So it should not be considered as finished or an end-to-end project, but a project starter/helper.

The main idea is;

  • An application uploads documents by users on a cloud file system.
  • A serverless job application using Docling fetches documents and prepares them for future utilization and drops the result in another cloud file system.

The serverless application deployed on IBM Code Engine, fetches source and updates from a private GitHub repository.

Image description

What is Docling and what is it used for

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

Features

  • 🗂️ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
  • 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
  • 🧬 Unified, expressive DoclingDocument representation format
  • ↪️ Various export formats and options, including Markdown, HTML, and lossless JSON
  • 🔒 Local execution capabilities for sensitive data and air-gapped environments
  • 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
  • 🔍 Extensive OCR support for scanned PDFs and images
  • 💻 Simple and convenient CLI

The file uploading application

I proposed two simple application to upload and store files. At first I wrote an application using Fastapi.

File uploading using Fastapi

import os from fastapi import FastAPI, Request, File, UploadFile, HTTPException from fastapi.responses import HTMLResponse, RedirectResponse from fastapi.templating import Jinja2Templates app = FastAPI() templates = Jinja2Templates(directory="templates") UPLOAD_DIR = "uploads" os.makedirs(UPLOAD_DIR, exist_ok=True) def get_uploaded_files(): try: files = os.listdir(UPLOAD_DIR) files.sort() return files except FileNotFoundError: return [] @app.get("/", response_class=HTMLResponse) async def read_root(request: Request): uploaded_files = get_uploaded_files() return templates.TemplateResponse("index.html", {"request": request, "filename": None, "message": None, "uploaded_files": uploaded_files}) @app.post("/upload", response_class=HTMLResponse) async def upload_file(request: Request, file: UploadFile = File(...)): filename = file.filename filepath = os.path.join(UPLOAD_DIR, filename) if os.path.exists(filepath): return templates.TemplateResponse("confirm.html", {"request": request, "filename": filename}) else: with open(filepath, "wb") as f: contents = await file.read() f.write(contents) uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' uploaded successfully.", "uploaded_files": uploaded_files}) @app.post("/confirm_replace", response_class=HTMLResponse) async def confirm_replace(request: Request): form = await request.form() filename = form.get("filename") replace = form.get("replace") if not filename or not replace: return templates.TemplateResponse("index.html", {"request": request, "message": "Missing filename or replace value."}) filepath = os.path.join(UPLOAD_DIR, filename) if replace == "yes": try: files = await request.files() # Correct way to get the file file = files.get("file") if not file: return templates.TemplateResponse("index.html", {"request": request, "message": "No file uploaded for replacement."}) contents = await file.read() with open(filepath, "wb") as f: f.write(contents) uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' replaced successfully.", "uploaded_files": uploaded_files}) except Exception as e: return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"Error replacing file: {e}"}) elif replace == "no": uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"No action taken for '{filename}'. File already exists.", "uploaded_files": uploaded_files}) else: return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": "Invalid response."}) @app.post("/delete", response_class=RedirectResponse) async def delete_files(request: Request): form = await request.form() files_to_delete = form.getlist("files") if files_to_delete: for file_to_delete in files_to_delete: filepath = os.path.join(UPLOAD_DIR, file_to_delete) try: os.remove(filepath) except Exception as e: print(f"Error deleting {file_to_delete}: {e}") return RedirectResponse("/", status_code=303) return RedirectResponse("/", status_code=303) 
Enter fullscreen mode Exit fullscreen mode

Index.html

/* index.html */ <!DOCTYPE html> <html> <head> <title>File Upload</title> <style> body { font-family: sans-serif; background-color: #f4f4f4; color: #333; margin: 20px; display: flex; flex-direction: column; align-items: center; /* Center content horizontally */ } h1 { color: #007bff; /* Blue heading */ margin-bottom: 20px; } form { background-color: #fff; padding: 20px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); margin-bottom: 20px; width: 400px; /* Set a fixed width for the form */ } input[type="file"] { margin-bottom: 10px; } input[type="submit"] { background-color: #007bff; color: #fff; padding: 10px 15px; border: none; border-radius: 4px; cursor: pointer; } input[type="submit"]:hover { background-color: #0056b3; } h2 { margin-top: 20px; color: #343a40; /* Darker heading */ } ul { list-style: none; padding: 0; } li { margin-bottom: 5px; display: flex; /* Align checkbox and label */ align-items: center; } input[type="checkbox"] { margin-right: 5px; } p { color: #d9534f; /* Red message for errors or feedback */ margin-top: 10px; } .uploaded-file-list { /* Style the uploaded files list */ background-color: #fff; padding: 15px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); width: 400px; /* Match the form width */ } </style> <script> function validateForm() { const fileInput = document.querySelector('input[type="file"]'); if (fileInput.files.length === 0) { alert("No files selected!"); return false; // Prevent form submission } return true; // Allow form submission } function validateDeleteForm() { const checkboxes = document.querySelectorAll('input[type="checkbox"]:checked'); if (checkboxes.length === 0) { alert("No files selected for deletion!"); return false; } return true; } </script> </head> <body> <h1>Upload a File</h1> <form action="/upload" method="post" enctype="multipart/form-data" onsubmit="return validateForm();"> <input type="file" name="file"> <input type="submit" value="Upload"> </form> {% if filename %} <h2>Uploaded File: {{ filename }}</h2> {% endif %} {% if message %} <p>{{ message }}</p> {% endif %} <div class="uploaded-file-list"> <h2>Uploaded Files:</h2> <form action="/delete" method="post"> <ul> {% for file in uploaded_files %} <li> <input type="checkbox" name="files" value="{{ file }}" id="{{ file }}"> <label for="{{ file }}">{{ file }}</label> </li> {% endfor %} </ul> <input type="submit" value="Delete Selected"> </form> </div> </body> </html> 
Enter fullscreen mode Exit fullscreen mode

Confirm.html

/* confirm.html */ <!DOCTYPE html> <html> <head> <title>Confirm Replace</title> <style> body { font-family: sans-serif; background-color: #f4f4f4; color: #333; margin: 20px; display: flex; flex-direction: column; align-items: center; /* Center content horizontally */ } h1 { color: #d9534f; /* Red heading for warning */ margin-bottom: 20px; } p { margin-bottom: 20px; } form { background-color: #fff; padding: 20px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); width: 400px; /* Set a fixed width for the form */ } input[type="file"] { margin-bottom: 10px; width: calc(100% - 10px); /* Ensures the file input doesn't overflow */ } label { margin-right: 10px; /* Space between radio button and label */ } input[type="radio"] { margin-right: 5px; } input[type="submit"] { background-color: #007bff; color: #fff; padding: 10px 15px; border: none; border-radius: 4px; cursor: pointer; margin-top: 10px; /* Space above the button */ } input[type="submit"]:hover { background-color: #0056b3; } </style> </head> <body> <h1>File Already Exists</h1> <p>The file '{{ filename }}' already exists. Do you want to replace it?</p> <form action="/confirm_replace" method="post" enctype="multipart/form-data"> <input type="hidden" name="filename" value="{{ filename }}"> <input type="file" name="file" required><br> <input type="radio" id="yes" name="replace" value="yes" required> <label for="yes">Yes</label> <input type="radio" id="no" name="replace" value="no"> <label for="no">No</label><br> <input type="submit" value="Confirm"> </form> </body> </html> 
Enter fullscreen mode Exit fullscreen mode

The Dockerfile which builds an image for the application.

# Use a Python base image FROM python:3.11-slim-buster # Set the working directory inside the container WORKDIR /app # Copy the requirements file (if you have one) # --- Create this file if you use external packages COPY requirements.txt . # Install dependencies RUN pip install --no-cache-dir -r requirements.txt # Install from requirements.txt # Or install dependencies directly (if you don't have a requirements.txt file) # RUN pip install --no-cache-dir fastapi uvicorn Jinja2 python-multipart # Copy the application code COPY . . # Expose the port that Uvicorn will run on EXPOSE 8000 # Start the Uvicorn server CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] 
Enter fullscreen mode Exit fullscreen mode

Image description
And some sample YAML file for the deployment part (which does not represent the actual cluster).

apiVersion: apps/v1 kind: Deployment metadata: name: my-fastapi-deployment namespace: files # Deploy to the "files" namespace spec: replicas: 3 # Number of pods (adjust as needed) selector: matchLabels: app: my-fastapi-app template: metadata: labels: app: my-fastapi-app spec: containers: - name: my-fastapi-container image: my-fastapi-image:latest # Replace with your Docker image name and tag ports: - containerPort: 8000 volumeMounts: - name: uploads-volume mountPath: /app/uploads # Mount the volume to the uploads directory resources: # Resource requests and limits requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi volumes: - name: uploads-volume persistentVolumeClaim: # Use a PersistentVolumeClaim for persistent storage claimName: my-fastapi-pvc # Create this PVC separately --- apiVersion: v1 kind: Service metadata: name: my-fastapi-service namespace: files spec: selector: app: my-fastapi-app ports: - protocol: TCP port: 80 # External port targetPort: 8000 # Container port type: LoadBalancer # Use a LoadBalancer to expose the service externally --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-fastapi-pvc namespace: files spec: accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed resources: requests: storage: 1Gi # Adjust storage size as needed 
Enter fullscreen mode Exit fullscreen mode

File uploading using Streamlit

However it seemed that a framework like Streamlit comes more handy and easy to deploy as a containerized application using a cluster based deployment.

import os import streamlit as st from pathlib import Path UPLOAD_DIR = Path("uploads") # Use Path for better path handling UPLOAD_DIR.mkdir(exist_ok=True) # Create uploads directory if it doesn't exist def get_uploaded_files(): return sorted([f.name for f in UPLOAD_DIR.iterdir()]) st.title("File Upload and Management") uploaded_file = st.file_uploader("Choose a file", type=None) # Allow any file type if uploaded_file is not None: filepath = UPLOAD_DIR / uploaded_file.name if filepath.exists(): replace = st.radio(f"File '{uploaded_file.name}' already exists. Replace?", ("Yes", "No")) if replace == "Yes": with open(filepath, "wb") as f: f.write(uploaded_file.getbuffer()) st.success(f"File '{uploaded_file.name}' replaced successfully.") else: st.info(f"No action taken for '{uploaded_file.name}'. File already exists.") else: with open(filepath, "wb") as f: f.write(uploaded_file.getbuffer()) st.success(f"File '{uploaded_file.name}' uploaded successfully.") st.subheader("Uploaded Files:") uploaded_files = get_uploaded_files() if uploaded_files: for file in uploaded_files: if st.checkbox(file): # Checkbox for each file if st.button(f"Delete {file}"): # Delete button next to checkbox try: (UPLOAD_DIR / file).unlink() # Delete the file st.experimental_rerun() # Refresh the app to reflect changes st.success(f"File '{file}' deleted successfully.") except Exception as e: st.error(f"Error deleting '{file}': {e}") else: st.info("No files uploaded yet.") 
Enter fullscreen mode Exit fullscreen mode

Building a container for the code above!

# Use a Python base image FROM python:3.11-slim-buster # Set the working directory WORKDIR /app # Copy requirements.txt (recommended) COPY requirements.txt . # Install dependencies RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose the Streamlit port (default is 8501) EXPOSE 8501 # Run Streamlit CMD ["streamlit", "run", "main_st.py"] # Replace app.py with your Streamlit file name 
Enter fullscreen mode Exit fullscreen mode

Sample Docling application using Streamlit framwork

Hereafter a starter code which is used as a helper for a Docling web based application.

import json import logging import time from pathlib import Path import os import shutil # For copying directories import streamlit as st from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( AcceleratorDevice, AcceleratorOptions, PdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) # Define the mount paths KUBERNETES_VOLUME_MOUNT_PATH = "/app/uploads" SCRATCH_VOLUME_MOUNT_PATH = "/app/scratch" def process_pdf(input_doc_path, scratch_dir, pipeline_options): """Processes a single PDF file.""" doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) try: conv_result = doc_converter.convert(input_doc_path) doc_filename = conv_result.input.file.stem with (scratch_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp: json.dump(conv_result.document.export_to_dict(), fp) with (scratch_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_text()) with (scratch_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_markdown()) with (scratch_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_document_tokens()) return True # Indicate success except Exception as e: st.error(f"Error processing {input_doc_path}: {e}") return False # Indicate failure def main(): logging.basicConfig(level=logging.INFO) st.title("Docling Document Conversion") # Kubernetes volume directory kubernetes_volume_dir = Path(KUBERNETES_VOLUME_MOUNT_PATH) if not kubernetes_volume_dir.exists(): st.error(f"Kubernetes volume not found at {KUBERNETES_VOLUME_MOUNT_PATH}") return # Scratch directory scratch_dir = Path(SCRATCH_VOLUME_MOUNT_PATH) scratch_dir.mkdir(parents=True, exist_ok=True) # ... (pipeline options, OCR language, number of threads - same as before) # ... (Make sure pipeline_options is defined here) if st.button("Convert Documents in Volume"): with st.spinner("Converting documents..."): start_time = time.time() success_count = 0 fail_count = 0 for file_path in kubernetes_volume_dir.rglob("*.pdf"): # Recursive search for PDFs if process_pdf(file_path, scratch_dir, pipeline_options): success_count += 1 else: fail_count += 1 end_time = time.time() - start_time st.write(f"Conversion completed in {end_time:.2f} seconds.") st.write(f"Successfully converted {success_count} PDFs.") st.write(f"Failed to convert {fail_count} PDFs.") st.write(f"Files saved to {SCRATCH_VOLUME_MOUNT_PATH}") if __name__ == "__main__": main() 
Enter fullscreen mode Exit fullscreen mode

And a Dockerfile to build an image.

FROM python:3.11-slim-buster WORKDIR /app # Create a requirements.txt with docling and its dependencies COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["streamlit", "run", "Docling_st.py"] 
Enter fullscreen mode Exit fullscreen mode

Image description

A YAML helper if the Docling application to be deployed inside a cluster later (for the time being it is a severless test application).

apiVersion: apps/v1 kind: Deployment metadata: name: docling-deployment namespace: files # Deploy to the same "files" namespace spec: replicas: 1 # Adjust as needed selector: matchLabels: app: docling-app template: metadata: labels: app: docling-app spec: containers: - name: docling-container image: docling-image:latest # Replace with your Docling Docker image ports: - containerPort: 8501 # Streamlit default port volumeMounts: - name: scratch-volume mountPath: /app/scratch # Mount the scratch volume - name: uploads-volume # Mount the existing uploads volume mountPath: /app/uploads # Or another suitable path resources: requests: cpu: 200m # Adjust as needed memory: 512Mi limits: cpu: 1000m memory: 1Gi volumes: - name: scratch-volume persistentVolumeClaim: claimName: docling-pvc # Create this PVC separately - name: uploads-volume # Use the existing uploads volume persistentVolumeClaim: claimName: my-fastapi-pvc # The existing PVC --- apiVersion: v1 kind: Service metadata: name: docling-service namespace: files spec: selector: app: docling-app ports: - protocol: TCP port: 8501 # External port targetPort: 8501 # Container port type: LoadBalancer # Or ClusterIP if internal access is sufficient --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: docling-pvc namespace: files spec: accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed resources: requests: storage: 1Gi # Adjust storage size as needed 
Enter fullscreen mode Exit fullscreen mode

Conclusion

The sample codes provided here are the building blocks for a web based application which prepares a file repository/volume with document types such as images, PDFs and word. These documents are ingested and changed to MD files by Docling which makes them ready for a generative ai application.

Again, this is not an end-to-end project, but portions of code to be enhanced, industrialized and deployed.

Thanks for reading 🤟

Useful links

Top comments (0)