Feat: Implement NoSQL Parser #199

trxvorr · 2025-12-13T08:35:30Z

Description

This PR introduces the NoSQL Parser, a core feature that allows users to transform nested, semi-structured NoSQL data (like JSON or MongoDB collections) into normalized, relational tables suitable for analytical storage (Parquet).

It addresses the need for a reusable, standard way to flatten complex document structures while preserving parent-child relationships through automatically generated foreign keys.

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🎨 Code style update (formatting, renaming)
♻️ Refactoring (no functional changes)
⚡ Performance improvement
✅ Test update
🔧 Configuration change
🏗️ Infrastructure/build change

Related Issue(s)

Fixes #107

Changes Made

Core Parser (src/intugle/nosql/parser.py): Implemented recursive logic to split nested lists into separate child tables and generate foreign keys (parent_table_id) to maintain relationships.
Schema Inference (src/intugle/nosql/inference.py): Added logic to scan sample data, resolve type conflicts (e.g., unifying int and str), and auto-detect primary keys (_id, uuid).
Parquet Writer (src/intugle/nosql/writer.py): Implemented ParquetTarget to persist in-memory DataFrames to disk using pyarrow.
Configuration: Added support for custom table renaming and Primary Key overrides via a config dictionary.
Dependencies: Added pandas and pyarrow as optional dependencies under the nosql extra in pyproject.toml.

Testing

Test Configuration

Python Version: 3.13.2
OS: Windows 11
LLM Provider: N/A

Test Cases

Unit tests pass locally
Manual testing completed
Tested with sample datasets

Test Commands

# Run the NoSQL specific test suite pytest tests/nosql/

Screenshots/Examples

from intugle.nosql.parser import NoSQLParser from intugle.nosql.writer import ParquetTarget data = [ {"id": 1, "name": "Trevor", "orders": [{"order_id": 101, "total": 50}]} ] # 1. Parse (Splits into 'root' and 'root_orders') parser = NoSQLParser() tables = parser.parse(data) # 2. Result is a dict of DataFrames # tables['root'] -> id, name # tables['root_orders'] -> order_id, total, root_id (FK) # 3. Write to Parquet target = ParquetTarget("output_dir") target.write(tables)

Checklist

My code follows the code style of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings or linter errors
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published
I have updated the relevant notebooks (if applicable)
I have checked my code and corrected any misspellings

Documentation Updates

README.md updated
Docstrings added/updated
Documentation site updated (if needed)
Notebook examples updated (if applicable)
CHANGELOG updated (if applicable)

Breaking Changes

This PR introduces breaking changes
Migration guide provided (if applicable)

Performance Impact

Performance benchmarks run
No significant performance impact
Performance improvement:
Performance regression:

Additional Context

The parser uses recursive processing. Memory usage scales with the chunk size of the input data. Dependencies (pandas, pyarrow) are optional and must be installed via pip install intugle[nosql].

Deployment Notes

Users must install the optional extras to use this feature:

pip install .[nosql]

raphael-intugle · 2025-12-17T21:22:11Z

Hey @trxvorr , Im facing issue while running the example snippet. Can you double check ?

trxvorr · 2025-12-18T00:41:38Z

@raphael-intugle I've resolved the merge conflict in pyproject.toml, so the dependencies should install correctly now.

I also realized the example snippet in the description was slightly outdated compared to the final API implementation. I've updated the PR description with the correct usage:

NoSQLParser is initialized without arguments.
data is passed directly to .parse(data).
Writing is handled via arget.write(tables).

It should run perfectly now!

raphael-intugle · 2025-12-18T19:27:37Z

Great start on the NoSQL parser! The core logic for flattening nested documents and generating foreign keys looks solid, and the Parquet export is working well as verified by the tests.

However, comparing this implementation against the original feature requirements, there are several key components missing to fully complete the scope:

Connectors: The MongoSource (and the pluggable adapter interface) is not implemented yet; currently, the parser only accepts a raw list of dictionaries.
CLI & High-Level API: The CLI command (intugle nosql-to-relational) and the top-level NoSQLToRelationalParser class (for the from intugle import ... pattern) are missing.
Metadata Export: The infer_model() capability to emit a structured relational schema/relationship graph is not fully realized.

trxvorr · 2025-12-19T08:53:44Z

@raphael-intugle I've implemented the missing components you requested:

Connectors: Added MongoSource using pymongo with a pluggable NoSQLSource interface
High-Level API: Implemented NoSQLToRelationalParser in api.py as the main orchestrator
CLI: Added the intugle nosql-to-relational command
Metadata Export: infer_model() capability is available via the orchestrator

I also added unit tests for the new components (32 tests passing) and verified the CLI works locally against a MongoDB instance. Ready for another look.

trxvorr · 2025-12-19T08:59:09Z

To test the CLI yourself:

# Install with nosql extras uv sync --all-extras # View available commands uv run intugle --help # Test against a MongoDB instance (replace with your connection details) uv run intugle nosql-to-relational --uri \"mongodb://localhost:27017\" --db your_database --collection your_collection --output ./output_parquet # Or with sampling (fetch only 100 documents) uv run intugle nosql-to-relational --uri \"mongodb://localhost:27017\" --db your_database --collection your_collection --output ./output_parquet --sample 100

Run the unit tests:

uv run pytest tests/nosql/ -v

feat: implement NoSQL Parser

b7cec6a

fix: resolve merge conflict in pyproject.toml

b0124ac

feat: implement CLI, MongoSource, and High-Level API

24ee112

style: fix ruff linting issues

61142f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Implement NoSQL Parser #199

Feat: Implement NoSQL Parser #199

Uh oh!

trxvorr commented Dec 13, 2025 •

edited

Loading

raphael-intugle commented Dec 17, 2025

trxvorr commented Dec 18, 2025

raphael-intugle commented Dec 18, 2025

trxvorr commented Dec 19, 2025

trxvorr commented Dec 19, 2025 •

edited

Loading

Labels

2 participants

Feat: Implement NoSQL Parser #199

Are you sure you want to change the base?

Feat: Implement NoSQL Parser #199

Uh oh!

Conversation

trxvorr commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issue(s)

Changes Made

Testing

Test Configuration

Test Cases

Test Commands

Screenshots/Examples

Checklist

Documentation Updates

Breaking Changes

Performance Impact

Additional Context

Deployment Notes

raphael-intugle commented Dec 17, 2025

trxvorr commented Dec 18, 2025

raphael-intugle commented Dec 18, 2025

trxvorr commented Dec 19, 2025

trxvorr commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

2 participants

trxvorr commented Dec 13, 2025 •

edited

Loading

trxvorr commented Dec 19, 2025 •

edited

Loading