6.0.2 #14590

DevinTDHa · 2025-05-28T15:19:59Z

DevinTDHa
May 28, 2025
Maintainer

📢 Spark NLP 6.0.2: Advancing Multimodal Capabilities and Streamlining Document Processing

We are thrilled to announce the release of Spark NLP 6.0.2! This version introduces powerful new multimodal models and significantly enhances document processing workflows. Upgrade to 6.0.2 to leverage these cutting-edge features and expand your NLP and vision task capabilities at scale.

Stay updated with our latest examples and tutorials by visiting our Medium - Spark NLP blog!

🔥 Highlights

Introducing InternVL: Support for the state-of-the-art InternVLForMultiModal model, enabling advanced visual question answering with InternVL 2, 2.5, and 3 series models.
Introducing Florence-2: Integration of Florence-2 in Florance2Transformer, a sophisticated vision foundation model for diverse prompt-based vision and vision-language tasks like captioning, object detection, and segmentation.
New Document Partitioning Feature: Added the Partition and PartitionTransformer annotator for a unified and configurable interface with Spark NLP readers, simplifying unstructured data loading.

🚀 New Features & Enhancements

Advanced Multimodal Model Integrations

This release significantly boosts Spark NLP's multimodal processing power with the integration of two new visual language models:

InternVL: InternVLForMultiModal is a powerful multimodal large language model is specifically designed for visual question answering. This annotator is versatile, supporting the InternVL 2, 2.5, and 3 families of models, allowing users to tackle complex visual-linguistic tasks. (Link to notebook)
Florence-2: Introducing Florance2Transformer, an advanced vision foundation model. Florence-2 utilizes a prompt-based approach, enabling it to perform a wide array of vision and vision-language tasks. Users can leverage simple text prompts to execute tasks such as image captioning, object detection, and image segmentation with high accuracy. (Link to notebook)

Enhanced Unstructured Document Processing

Partitioning Documents: This release introduces the new Partition and PartitionTransformer annotator.
- Partition provides a unified interface for extracting structured content from various document formats into Spark DataFrames. It supports input from files, URLs, in-memory strings, or byte arrays and handles formats such as text, HTML, Word, Excel, PowerPoint, emails, and PDFs. It automatically selects the appropriate reader based on file extension or MIME type and allows customization via parameters. (Link to notebook)
- The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines. PartitionTransformer can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame. (Link to notebook)
Key Improvements:
- Simplifies integration with Spark NLP readers through a unified interface.
- Adds flexibility by enabling more reader-specific configurations.
- Enhances the maintainability and scalability of data loading workflows.

🐛 Bug Fixes

Adjusted python type annotations for the AutoGGUFModel (How does set Grammar works in AutoGGUFModel? #14576)

❤️ Community Support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

⚙️ Installation

Python

#PyPI pip install spark-nlp==6.0.2

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.2

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.2

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.2

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>6.0.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>6.0.2</version> </dependency>

spark-nlp-silicon:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-silicon_2.12</artifactId> <version>6.0.2</version> </dependency>

spark-nlp-aarch64:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-aarch64_2.12</artifactId> <version>6.0.2</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.2.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.2.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.2.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.2.jar

What's Changed

SparkNLP - 1123 Introducing InternVL SparkNLP - 1123 Introducing InternVL #14578 by @prabod
Documentation for SparkNLP Readers and Partition class Documentation for SparkNLP Readers and Partition class #14581 by @paulamib123
Sparknlp-1174 Adding Partitioning Documents Feature Sparknlp-1174 Adding Partitioning Documents Feature #14579 by @danilojsl
SparkNLP 1131 - Introducing Florance-2 SparkNLP 1131 - Introducing Florance-2 #14585 by @prabod

Full Changelog: 6.0.1...6.0.2

This discussion was created from the release 6.0.2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6.0.2 #14590

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

6.0.2 #14590

Uh oh!

DevinTDHa May 28, 2025 Maintainer

📢 Spark NLP 6.0.2: Advancing Multimodal Capabilities and Streamlining Document Processing

🔥 Highlights

🚀 New Features & Enhancements

Advanced Multimodal Model Integrations

Enhanced Unstructured Document Processing

🐛 Bug Fixes

❤️ Community Support

⚙️ Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed

Replies: 0 comments

DevinTDHa
May 28, 2025
Maintainer