ksanaka

Posted on Jun 25

Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond

#datascience #python #lambda #udf

"How do we extract/summarize/complete/validate our data using LLM functions?"

"Can we detect anomalies in this time-series data using our existing proprietary functions?"

"We need to analyze sentiment in these customer reviews, but there's nothing built-in for that."

"How do we connect our gaming matchmaking system to our analytics pipeline?"

If these questions sound familiar, you're not alone. As data teams scale their analytics capabilities, they inevitably hit the limitations of built-in database functions. That's where Databend's User-Defined Functions (UDFs) come in – and they're transforming how we approach custom data processing.

Why UDFs Matter for Modern Data Teams

From financial services to e-commerce to gaming, organizations are leveraging UDFs to:

Embed domain expertise and proprietary business logic directly into data pipelines
Process data where it lives by eliminating unnecessary data movement while maintaining security and compliance
Connect to specialized services – enabling companies to use the same algorithms in both operations and analytics
Extend analytics capabilities without waiting for feature requests

Databend's UDF Ecosystem: Power Meets Flexibility

What makes Databend's approach to UDFs particularly powerful is its multi-language support and deployment options:

Lambda UDFs: Quick SQL expressions for simple transformations
Python & JavaScript UDFs: Familiar languages for complex logic and ML
WASM UDFs: Near-native performance for compute-intensive operations
External UDF Servers: Scalable microservices for enterprise workloads

In the following sections, we'll explore each approach with practical examples and performance insights to help you choose the right tool for your specific needs.

Lambda UDFs: SQL-Powered Simplicity

Lambda UDFs are the simplest form of user-defined functions in Databend, allowing you to define custom expressions directly in SQL. Think of them as SQL "shorthand" that encapsulates frequently used logic into reusable components.

When to Use Lambda UDFs

Lambda UDFs excel in scenarios where you need:

Data cleansing and standardization across multiple queries
Simple calculations that combine several built-in functions
Text manipulation and pattern matching beyond basic SQL capabilities

For data engineers working with messy data sources or implementing business rules, Lambda UDFs provide a way to define the logic once and reuse it everywhere with lambda expression, ensuring consistency and reducing query complexity.

Syntax and Examples

-- Simple data cleaning function CREATE FUNCTION clean_phone AS (s) -> regexp_replace(s, '[^0-9]', ''); -- More complex example with multiple parameters and logic CREATE FUNCTION calculate_discount AS (base_price, customer_tier, purchase_history) -> CASE WHEN customer_tier = 'premium' AND purchase_history > 10000 THEN base_price * 0.85 WHEN customer_tier = 'premium' THEN base_price * 0.90 WHEN purchase_history > 5000 THEN base_price * 0.95 ELSE base_price END; -- Using the functions in a query SELECT customer_id, clean_phone(raw_phone_number) AS standardized_phone, calculate_discount(item_price, customer_tier, ytd_purchases) AS discounted_price FROM customer_orders;

UDF Script: enhancing your data with python, javascript and WASM

While Lambda UDFs are perfect for simple SQL expressions, more complex data processing often requires the full power of programming languages. Databend's UDF Script capability allows you to embed Python, JavaScript, and compile Rust, go, C++ or Zig to WebAssembly code directly into your data pipeline, bringing sophisticated algorithms right to where your data lives.

Python UDFs: A Taste of Data Science

Python UDFs unlock the vast ecosystem of Python libraries for data science, machine learning, and AI. This is particularly valuable when you need to apply complex algorithms or interact your data natively with external services.

Sample use case: E-Commerce Order Validation

An e-commerce retailer wants to ensure that shipping addresses entered by customers are clean and valid before fulfillment. The data team automates the process of standardizing address strings and flagging obviously invalid entries (such as addresses missing a street number).

To achieve this, the data engineer writes a Python UDF in Databend (using only Python’s standard library). The UDF receives a raw address string, trims whitespace, converts it to title case, and checks if it contains at least one digit (to indicate a street number). It returns a standardized address if valid, or NULL otherwise.

CREATE FUNCTION standardize_address( STRING) RETURNS STRING LANGUAGE python HANDLER = 'clean_address' AS $$ def clean_address(address): if not address or not isinstance(address, str): return None cleaned = address.strip().title() # Check for at least one digit (street number)  if any(char.isdigit() for char in cleaned): return cleaned return None $$;

-- Use the UDF to clean and validate addresses in orders SELECT order_id, customer_id, standardize_address(shipping_address) AS standardized_address FROM orders WHERE standardize_address(shipping_address) IS NOT NULL;

JavaScript UDFs: Lightweight Data Wrangling with Security

JavaScript UDFs in Databend provide a secure, sandboxed environment for data transformations. They are particularly well-suited for handling semi-structured data like JSON, offering a balance of flexibility, security, and performance for common data preparation tasks.

Sample use case: E-commerce Event Log Processing

An e-commerce platform captures raw user interaction events as JSON objects. Before this data can be used for analytics, the data engineering team needs to clean, enrich, and standardize it. Specifically, they need to:

Redact PII: Remove sensitive information like IP addresses.
Enrich Data: Add a processing timestamp to each event.
Standardize Fields: Ensure certain fields, like country codes, adhere to a consistent format.

To handle this, they implement a JavaScript UDF in Databend.

-- This UDF processes a JSON event payload: -- 1. Deletes the 'ip_address' field from 'user_details'. -- 2. Adds a 'processed_at_udf' timestamp to 'metadata'. -- 3. Converts 'country_code' in 'location_details' to uppercase. -- 4. Adds a default 'source' if missing in 'event_properties'. CREATE FUNCTION process_event_payload( VARIANT) RETURNS VARIANT LANGUAGE javascript HANDLER = 'transform_event' AS $$ export function transform_event(event) { // Ensure event is a valid object if (!event || typeof event !== 'object') { return null; } // 1. Delete PII (e.g., user's IP address) if (event.user_details && typeof event.user_details === 'object') { delete event.user_details.ip_address; } // 2. Add a new field (e.g., processing timestamp) if (typeof event.metadata !== 'object' || event.metadata === null) { event.metadata = {}; // Initialize if not an object or is null } event.metadata.processed_at_udf = new Date().toISOString(); // 3. Update/Standardize a field (e.g., ensure country code is uppercase) if (event.location_details && typeof event.location_details.country_code === 'string') { event.location_details.country_code = event.location_details.country_code.toUpperCase(); } // 4. Add a default value if a field is missing if (typeof event.event_properties !== 'object' || event.event_properties === null) { event.event_properties = {}; // Initialize if not an object or is null } if (typeof event.event_properties.source === 'undefined') { event.event_properties.source = 'unknown_source_js_udf'; } return event; } $$;

-- Using the JavaScript UDF to transform raw event data WITH raw_events AS ( SELECT PARSE_JSON('{ "event_id": "evt_123", "user_details": { "user_id": "usr_abc", "ip_address": "192.168.1.100", "email": "test@example.com" }, "location_details": { "city": "San Francisco", "country_code": "us" }, "event_properties": { "page_url": "/products/awesome-widget" } }') AS payload UNION ALL SELECT PARSE_JSON('{ "event_id": "evt_456", "user_details": { "user_id": "usr_def", "ip_address": "10.0.0.5" }, "location_details": { "city": "London", "country_code": "gb" }, "event_properties": null }') AS payload ) SELECT payload:event_id::STRING AS original_event_id, process_event_payload(payload) AS processed_payload FROM raw_events;

WASM UDFs: Near-Native Performance without sacrifice security

For computationally intensive operations requiring high performance, WebAssembly (WASM) UDFs enable near-native execution speeds within a secure, sandboxed environment. Unlike Python and JavaScript UDFs, which interpret source code at runtime, WASM UDFs are compiled to a binary format that executes efficiently and securely directly in the engine.

At first, you need to setup your rust environment, and add arrow-udf as dependency, implement your udf function.

// Rust code (compiled to WASM) use arrow_udf::function; #[function("fib(int) -> int")] fn fib(n: i32) -> i32 { let (mut a, mut b) = (0, 1); for _ in 0..n { let c = a + b; a = b; b = c; } a }

Compile your Rust code to the wasm32-wasip1 target:

cargo build --release --target wasm32-wasip1

Upload the compiled WASM file (e.g., arrow_udf_example.wasm found in target/wasm32-wasip1/release/) to a Databend stage and create the function:

-- Create a stage if you don't have one CREATE STAGE IF NOT EXISTS my_wasm_stage; -- Upload the .wasm file to the stage (replace with your actual path and stage name) -- Example: PUT fs:///path/to/your/project/target/wasm32-wasip1/release/arrow_udf_example.wasm @my_wasm_stage; PUT fs:///tmp/arrow_udf_example.wasm @my_wasm_stage; -- Create the WASM UDF CREATE OR REPLACE FUNCTION fib_wasm (INT) RETURNS INT LANGUAGE wasm HANDLER = 'fib' AS $$@my_wasm_stage/arrow_udf_example.wasm$$; -- Using the WASM UDF SELECT fib_wasm(10) AS fibonacci_number;

External UDF Server: interact your data seamlessly with your existing services

If you want to use computed data to interact with your business services seamlessly, you definitely need to consider External UDF servers. These servers, communicating with Databend via the Arrow Flight protocol, allow you to decouple your UDF logic from the database and integrate with existing microservices or specialized computation engines.

For instance, one of our biggest customers in the gaming industry leverages External UDF servers to process real-time, sub-second computed data. This data is crucial for their game matching algorithms and feature engineering services, handling over 1,000 QPS with a P99 latency of less than 200ms. This showcases the power and scalability of External UDFs for demanding, low-latency applications.

Here's how you can set up and use an External UDF Server, using a Fibonacci example implemented in Python:

1. Implement and Run the External UDF Server (Python Example):

First, you'll need the databend_udf Python package. You can install it via pip:

pip install databend-udf

Then, create a Python script (e.g., my_udf_server.py) with the following content:

from databend_udf import udf, UDFServer import logging logging.basicConfig(level=logging.INFO) @udf( input_types=["INT"], # Corresponds to Databend's INT type  result_type="INT", # Corresponds to Databend's INT type  skip_null=True, # Optional: if True, null inputs won't be passed to the function ) def fib(n: int) -> int: a, b = 0, 1 for _ in range(n): a, b = b, a + b return a if __name__ == "__main__": # The UDF server will listen on this address  udf_server = UDFServer( location="0.0.0.0:8815" ) # Register the function with the server  udf_server.add_function(fib) # Start the server  udf_server.serve()

Run this Python script. It will start a server listening on 0.0.0.0:8815.

python my_udf_server.py

2. Create the External Function in Databend:

Once your UDF server is running, you can register it in Databend:

-- Create the external function, pointing to the running server CREATE OR REPLACE FUNCTION fib_external (val INT) RETURNS INT LANGUAGE python HANDLER = 'fib' -- This must match the function name in your Python script ADDRESS = 'http://0.0.0.0:8815'; -- The address of your UDF server

3. Using the External UDF:

Now you can call this function in your SQL queries just like any other UDF:

SELECT fib_external(10) AS fibonacci_number; -- Expected output: 55

Best Practices

Most Databend users don’t care about UDF theory—they care about what works, what breaks, and what saves time. After seeing dozens of teams (and making plenty of mistakes myself), here’s the honest truth:

If you can do it in SQL, do it in SQL. Lambda UDFs are instant, transparent, and you’ll thank yourself later when you need to debug at 2am. But don’t force it—once your logic starts looking like a regex contest or you’re copying the same formula everywhere, switch gears.

Python and JavaScript UDFs are the real workhorses. Python is the default for anything that smells like business logic, validation, LLM, RAG, or data wrangling. JavaScript is the secret weapon for JSON or semi-structured messes—especially if you’re coming from a web background. Both are easy to write, but don’t expect magic performance. If you’re processing millions of rows per second, you’ll hit a wall. For most ETL, though? They’re perfect.

WASM is for the performance-obsessed. If you’re building a recommendation engine, crunching numbers for a game, or your boss keeps asking “can this be faster?”—bite the bullet, learn some Rust, and ship a WASM UDF. It’s not easy, but it’s the only way to get close to native speed inside Databend. Most people never need it. If you do, you’ll know.

External UDF Servers are for teams with real infrastructure—when you already have a service, a model, or a matching engine you trust, and you want Databend to tap into it. There’s network overhead, but you get freedom: scale compute separately, deploy in your own stack, and keep business logic where it belongs. Just remember to batch requests, or you’ll be debugging latency charts all day.

We do have some basic benchmarking the average execution time per row for different Databend UDF types using the following SQL, but performance is not the only factor to consider when choosing a UDF type. It is recommended to measure the performance of your specific use case(like simplicity, ease of debugging, ease of maintenance, etc.) before making a decision.

SELECT fib((n % 10)::Int32) FROM range(1, 1000000) t(n) IGNORE_RESULT;

Here are the results:

UDF Type	Avg. Time per Row (µs)	Typical Use Case
Lambda UDF	-	Simple transforms, prototyping
Python UDF	0.18	Complex logic, AI integration
JavaScript UDF	2.68	Lightweight data processing
WASM UDF	0.11	High-performance computation
External UDF	23.2	Large-scale, distributed workloads

Note: External UDF timing includes network overhead; actual compute time is often less.

The best teams start simple, refactor when it hurts, and always measure before optimizing. Don’t overthink it. Every UDF type has its place—what matters is using the right one for your problem, not the fanciest one in the docs.

If you’re still unsure, ask around—most Databend users are happy to share what worked (and what didn’t) in production.

DEV Community