Skip to content

Conversation

@youngsofun
Copy link
Member

@youngsofun youngsofun commented May 12, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR re-introduces the streaming load API

Usage

Minimal Example

curl -H "sql: insert into my_table file_format = (type = CSV)" \ -F "upload=@./data/a.csv" \ -u db_user:db_pwd "http://localhost:8000/v1/streaming_load"

Notes:

  1. Authentication is required via:
    • HTTP Basic Auth (-u username:password)
    • Or Authorization header
  2. The sql header is required and currently only supports:
    insert into <table> file_format=(...)
    Supported file formats: CSV, TSV, NDJSON (same as COPY INTO syntax)
  3. The multipart form field name must be upload
  4. The insert operation is atomic

Response Format

On Success:

{ "id": "27262da8-f667-4502-9420-2dead99b8723", "stats": { "rows": 10, "bytes": 1000 } }
  • id: Query ID (same as /v1/query API), auto-generated if not specified
  • stats: Number of rows and bytes written

On Failure:

{ "error": { "code": 1001, "message": "some error message" } }

Advanced Example

curl -H "x-databend-query-id: some-uuid" \ -H "sql: insert /*+ set_var(deduplicate_label='20250511') */ into t1 file_format = (type = CSV)" \ -F "upload=@./data/0001.csv" \ -F "upload=@./data/0002.csv" \ -u root: "http://localhost:8000/v1/streaming_load"

Response:

{ "id": "some-uuid", "state": "SUCCESS", "stats": { "rows": 0, "bytes": 0 }, "error": null, "files": ["ii_100.csv"] }

Advanced Features:

  1. Custom query ID via x-databend-query-id header
  2. Settings configuration via SQL hints
  3. Deduplication support:
    • When deduplicate_label is set, duplicate operations will succeed but report 0 rows/bytes
  4. Multiple file uploads in a single request

Changes from Original Implementation

  1. Simplified Response:

    • Removed unnecessary files and state fields
    • stats now shows write_progress instead of scan_progress
  2. Improved Design:

    • Uses sql header instead of insert_sql for future DML compatibility
    • Response ID now matches query ID
    • Enhanced error handling
  3. Settings Management:

    • Most settings configured via SQL hints
    • Header-based settings removed (may be re-added later as a single settings header)

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label May 12, 2025
@youngsofun youngsofun marked this pull request as draft May 12, 2025 08:46
@youngsofun youngsofun requested review from b41sh and sundy-li May 12, 2025 09:53
@youngsofun youngsofun marked this pull request as ready for review May 12, 2025 09:53
@youngsofun youngsofun merged commit 768c1b2 into databendlabs:main May 13, 2025
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

3 participants