Skip to content

Conversation

@danilojsl
Copy link
Contributor

Description

This PR introduces the new Partition component and PartitionTransformer annotator to spark-nlp, providing a streamlined and user-friendly interface for interacting with Spark NLP readers. It also extends the configurability of these readers by adding support for additional options.

Key Improvements:

  • Simplifies integration with Spark NLP readers through a unified interface
  • Adds flexibility by enabling more reader-specific configurations
  • Enhances the maintainability and scalability of data loading workflows

Motivation and Context

The current approach to interfacing with Spark NLP readers is fragmented and lacks flexibility, often requiring custom code for handling various input sources and options. This makes onboarding harder for new users and hinders reuse across pipelines.

The Partitioning component abstracts these complexities by:

  • Unifying access patterns for multiple readers
  • Reducing boilerplate code in reader configuration
  • Making it easier to scale and switch between different data sources

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Tests
  • Google Colab Notebooks
  • Databricks Notebooks

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
@danilojsl danilojsl self-assigned this May 16, 2025
@danilojsl danilojsl added the new-feature Introducing a new feature label May 16, 2025
@DevinTDHa DevinTDHa changed the base branch from master to release/602-release-candidate May 23, 2025 14:50
@coveralls
Copy link

coveralls commented May 24, 2025

Pull Request Test Coverage Report for Build 15257196653

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 550 of 666 (82.58%) changed or added relevant lines in 25 files are covered.
  • 67 unchanged lines in 49 files lost coverage.
  • Overall coverage increased (+0.8%) to 53.543%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/main/scala/com/johnsnowlabs/partition/HasEmailReaderProperties.scala 4 5 80.0%
src/main/scala/com/johnsnowlabs/partition/HasPowerPointProperties.scala 4 5 80.0%
src/main/scala/com/johnsnowlabs/reader/util/HasPdfProperties.scala 28 29 96.55%
src/main/scala/com/johnsnowlabs/partition/HasExcelReaderProperties.scala 7 9 77.78%
src/main/scala/com/johnsnowlabs/reader/EmailReader.scala 7 9 77.78%
src/main/scala/com/johnsnowlabs/partition/HasHTMLReaderProperties.scala 5 8 62.5%
src/main/scala/com/johnsnowlabs/reader/TextReader.scala 20 23 86.96%
src/main/scala/com/johnsnowlabs/reader/WordReader.scala 22 25 88.0%
src/main/scala/com/johnsnowlabs/reader/util/TextParser.scala 24 27 88.89%
src/main/scala/com/johnsnowlabs/partition/HasReaderProperties.scala 25 29 86.21%
Files with Coverage Reduction New Missed Lines %
src/main/scala/com/johnsnowlabs/nlp/annotators/cleaners/Extractor.scala 1 93.66%
src/main/scala/com/johnsnowlabs/nlp/annotators/common/TableData.scala 1 85.19%
src/main/scala/com/johnsnowlabs/nlp/annotators/common/Tagged.scala 1 67.82%
src/main/scala/com/johnsnowlabs/nlp/annotators/cv/util/io/ImageIOUtils.scala 1 54.22%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcher.scala 1 97.25%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcherTranslator.scala 1 89.35%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcherUtils.scala 1 88.29%
src/main/scala/com/johnsnowlabs/nlp/annotators/DocumentCharacterTextSplitter.scala 1 96.72%
src/main/scala/com/johnsnowlabs/nlp/annotators/keyword/yake/YakeKeywordExtraction.scala 1 96.79%
src/main/scala/com/johnsnowlabs/nlp/annotators/MultiDateMatcher.scala 1 99.07%
Totals Coverage Status
Change from base Build 15207655495: 0.8%
Covered Lines: 10344
Relevant Lines: 19319

💛 - Coveralls
* Update conda meta.yaml for 6.0.1 [skip test] * added documentation to file readers * updated docs for partition class in scala and python * fixed typos in sparkNLPReader and added documentation for Partition class * added parameters to Partition class and updated read function docs in Readers * updated readers documentation with ipynb path * updated partition description * fixed errors in email readers * fixed errors in email readers * added docs for partition transformer and pdf reader * added docs for python partition transformer and pdf reader * added docs for python partition transformer and pdf reader * updated docs to render partition and reader * reverted changes in init.py * reverted changes in imports * updated formatting docs for pdf reader * updated formatting of docs for spark nlp reader * updated formatting of docs for partition * updated formatting of docs for partition_transformer * updating links to notebooks and partition transformer description --------- Co-authored-by: Devin Ha <devin@trungducha.de> Co-authored-by: Paulami Bhattacharya <paulamibhattacharya@Paulamis-MacBook-Pro.local>
@DevinTDHa DevinTDHa merged commit f13258d into release/602-release-candidate May 26, 2025
4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request May 26, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature Introducing a new feature

5 participants