Sparknlp-1174 Adding Partitioning Documents Feature #14579

danilojsl · 2025-05-16T13:54:30Z

Description

This PR introduces the new Partition component and PartitionTransformer annotator to spark-nlp, providing a streamlined and user-friendly interface for interacting with Spark NLP readers. It also extends the configurability of these readers by adding support for additional options.

Key Improvements:

Simplifies integration with Spark NLP readers through a unified interface
Adds flexibility by enabling more reader-specific configurations
Enhances the maintainability and scalability of data loading workflows

Motivation and Context

The current approach to interfacing with Spark NLP readers is fragmented and lacks flexibility, often requiring custom code for handling various input sources and options. This makes onboarding harder for new users and hinders reuse across pipelines.

The Partitioning component abstracts these complexities by:

Unifying access patterns for multiple readers
Reducing boilerplate code in reader configuration
Making it easier to scale and switch between different data sources

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests
Google Colab Notebooks
Databricks Notebooks

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…aks and infer table options

…LPReader

…hon side

…tch Python convention

src/main/scala/com/johnsnowlabs/reader/SparkNLPReader.scala

coveralls · 2025-05-24T16:11:30Z

Pull Request Test Coverage Report for Build 15257196653

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

550 of 666 (82.58%) changed or added relevant lines in 25 files are covered.
67 unchanged lines in 49 files lost coverage.
Overall coverage increased (+0.8%) to 53.543%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/main/scala/com/johnsnowlabs/partition/HasEmailReaderProperties.scala	4	5	80.0%
src/main/scala/com/johnsnowlabs/partition/HasPowerPointProperties.scala	4	5	80.0%
src/main/scala/com/johnsnowlabs/reader/util/HasPdfProperties.scala	28	29	96.55%
src/main/scala/com/johnsnowlabs/partition/HasExcelReaderProperties.scala	7	9	77.78%
src/main/scala/com/johnsnowlabs/reader/EmailReader.scala	7	9	77.78%
src/main/scala/com/johnsnowlabs/partition/HasHTMLReaderProperties.scala	5	8	62.5%
src/main/scala/com/johnsnowlabs/reader/TextReader.scala	20	23	86.96%
src/main/scala/com/johnsnowlabs/reader/WordReader.scala	22	25	88.0%
src/main/scala/com/johnsnowlabs/reader/util/TextParser.scala	24	27	88.89%
src/main/scala/com/johnsnowlabs/partition/HasReaderProperties.scala	25	29	86.21%

Files with Coverage Reduction	New Missed Lines	%
src/main/scala/com/johnsnowlabs/nlp/annotators/cleaners/Extractor.scala	1	93.66%
src/main/scala/com/johnsnowlabs/nlp/annotators/common/TableData.scala	1	85.19%
src/main/scala/com/johnsnowlabs/nlp/annotators/common/Tagged.scala	1	67.82%
src/main/scala/com/johnsnowlabs/nlp/annotators/cv/util/io/ImageIOUtils.scala	1	54.22%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcher.scala	1	97.25%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcherTranslator.scala	1	89.35%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcherUtils.scala	1	88.29%
src/main/scala/com/johnsnowlabs/nlp/annotators/DocumentCharacterTextSplitter.scala	1	96.72%
src/main/scala/com/johnsnowlabs/nlp/annotators/keyword/yake/YakeKeywordExtraction.scala	1	96.79%
src/main/scala/com/johnsnowlabs/nlp/annotators/MultiDateMatcher.scala	1	99.07%

Totals
Change from base Build 15207655495:	0.8%
Covered Lines:	10344
Relevant Lines:	19319

💛 - Coveralls

* Update conda meta.yaml for 6.0.1 [skip test] * added documentation to file readers * updated docs for partition class in scala and python * fixed typos in sparkNLPReader and added documentation for Partition class * added parameters to Partition class and updated read function docs in Readers * updated readers documentation with ipynb path * updated partition description * fixed errors in email readers * fixed errors in email readers * added docs for partition transformer and pdf reader * added docs for python partition transformer and pdf reader * added docs for python partition transformer and pdf reader * updated docs to render partition and reader * reverted changes in init.py * reverted changes in imports * updated formatting docs for pdf reader * updated formatting of docs for spark nlp reader * updated formatting of docs for partition * updated formatting of docs for partition_transformer * updating links to notebooks and partition transformer description --------- Co-authored-by: Devin Ha <devin@trungducha.de> Co-authored-by: Paulami Bhattacharya <paulamibhattacharya@Paulamis-MacBook-Pro.local>

danilojsl added 13 commits May 14, 2025 18:14

[SPARKNLP-1113] Adding Partition feature

11001bf

[SPARKNLP-1118] Adding headers, ssl-verify, request timeout, page bre…

cebc516

…aks and infer table options

[SPARKNLP-1116] Adding groupBrokenParagraphs option

1faa91d

[SPARKNLP-1116] Adding includeSlideNotes option

e98dd26

[SPARKNLP-1116] Adding findSubtable option

4dec816

[SPARKNLP-1116] Adding findSubtable option in SparkNLPReader

edfa581

[SPARKNLP-1116] Renaming findSubtable to appendCells option in SparkN…

38bae06

…LPReader

[SPARKNLP-1116] Handling headers null issue in SparkNLPReader for Pyt…

4e9cd76

…hon side

[SPARKNLP-1116] Refactoring parameters spark-nlp reader getters to ma…

f1ea42f

…tch Python convention

[SPARKNLP-1116] Adding Partitioning demo notebook

c4ab336

[SPARKNLP-1174] Adding PartitionTransformer

058a8a5

[SPARKNLP-1174] Adding missing unit tests in readers

48cf2b3

[SPARKNLP-1174] Moving PDF parameters to HasPdfProperties

5b0c581

danilojsl self-assigned this May 16, 2025

danilojsl added the new-feature Introducing a new feature label May 16, 2025

danilojsl requested review from DevinTDHa and maziyarpanahi May 16, 2025 13:54

danilojsl added 2 commits May 16, 2025 14:23

[SPARKNLP-1174] Adding validation for partition URL content

ce4b9fb

[SPARKNLP-1174] Formatting modified files

fcbf30f

DevinTDHa changed the base branch from master to release/602-release-candidate May 23, 2025 14:50

DevinTDHa reviewed May 23, 2025

View reviewed changes

src/main/scala/com/johnsnowlabs/reader/SparkNLPReader.scala Show resolved Hide resolved

danilojsl added 4 commits May 24, 2025 10:43

[SPARKNLP-1174] Fix reading as text file content

953e03d

[SPARKNLP-1174] Adding PartitionTransformer demo notebook [skip test]

aaa342b

[SPARKNLP-1174] Updates PartitionTransformer demo notebook [skip test]

367504f

[SPARKNLP-1174] Updates PartitionTransformer file link [skip test]

3846eaf

DevinTDHa approved these changes May 26, 2025

View reviewed changes

DevinTDHa merged commit f13258d into release/602-release-candidate May 26, 2025
4 checks passed

DevinTDHa mentioned this pull request May 26, 2025

Release Spark NLP 6.0.2 #14583

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparknlp-1174 Adding Partitioning Documents Feature #14579

Sparknlp-1174 Adding Partitioning Documents Feature #14579

Uh oh!

danilojsl commented May 16, 2025

Uh oh!

coveralls commented May 24, 2025 •

edited

Loading

Uh oh!

Labels

5 participants

Sparknlp-1174 Adding Partitioning Documents Feature #14579

Sparknlp-1174 Adding Partitioning Documents Feature #14579

Uh oh!

Conversation

danilojsl commented May 16, 2025

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

coveralls commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15257196653

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Labels

5 participants

coveralls commented May 24, 2025 •

edited

Loading