- Notifications
You must be signed in to change notification settings - Fork 17
Add Kaggle datasets #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@allisonwang-db |
pyspark_datasources/kaggle.py Outdated
from functools import cached_property | ||
from typing import Iterator | ||
| ||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to depend on pyarrow? Can we throw a better error message if pyarrow is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Depending on pyarrow should be fine since pyspark data source itself depends on pyarrow. Let's import pyarrow later so that pyspark shows the error message if it's missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThis update introduces a new data source, Changes
Sequence Diagram(s)sequenceDiagram participant User participant Spark participant KaggleDataSource participant kagglehub User->>Spark: Read DataFrame (format="kaggle", options) Spark->>KaggleDataSource: Initialize with options KaggleDataSource->>kagglehub: dataset_load(handle, path, ...) kagglehub-->>KaggleDataSource: Returns pandas DataFrame KaggleDataSource->>KaggleDataSource: Convert to PyArrow Table & cache KaggleDataSource->>Spark: Provide schema and data reader Spark->>User: Returns DataFrame Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
tests/test_data_sources.py (3)
28-33
: Consider making the test more robust against dataset changesThis test successfully validates that the
KaggleDataSource
can load a dataset from Kaggle. However, it relies on a specific external dataset with hardcoded expectations about the row and column counts, which could make the test brittle if the dataset changes.Consider either:
- Mocking the Kaggle API to avoid external dependencies in unit tests
- Adding a fallback assertion to prevent test failures if the dataset is slightly modified
- Using a more stable dataset or one that you control
def test_kaggle_datasource(spark): spark.dataSource.register(KaggleDataSource) df = spark.read.format("kaggle").options(handle="yasserh/titanic-dataset").load("Titanic-Dataset.csv") df.show() - assert df.count() == 891 - assert len(df.columns) == 12 + # Check that data was loaded and has expected structure + row_count = df.count() + assert row_count > 0, "DataFrame should not be empty" + assert len(df.columns) > 0, "DataFrame should have columns" + # Log actual values to help debug future failures + print(f"Loaded {row_count} rows with {len(df.columns)} columns")🧰 Tools
🪛 Ruff (0.8.2)
29-29:
KaggleDataSource
may be undefined, or defined from star imports(F405)
28-29
: Explicit imports improve code clarity and avoid F405 warningsStatic analysis tools flagged that
KaggleDataSource
may be undefined due to star imports. While it works since the class is properly exported, explicit imports are preferred for better code clarity and to avoid namespace pollution.- from pyspark_datasources import * + from pyspark_datasources import FakeDataSource, GithubDataSource, KaggleDataSource🧰 Tools
🪛 Ruff (0.8.2)
29-29:
KaggleDataSource
may be undefined, or defined from star imports(F405)
28-33
: Add tests for error handling scenariosThe current test verifies the happy path, but it's important to test error scenarios as well, such as invalid dataset handles, authentication failures, or network issues.
Consider adding tests for error scenarios:
def test_kaggle_datasource_invalid_handle(spark): spark.dataSource.register(KaggleDataSource) with pytest.raises(Exception) as excinfo: spark.read.format("kaggle").options(handle="invalid/dataset").load("nonexistent.csv") assert "dataset not found" in str(excinfo.value).lower() or "invalid handle" in str(excinfo.value).lower()🧰 Tools
🪛 Ruff (0.8.2)
29-29:
KaggleDataSource
may be undefined, or defined from star imports(F405)
pyspark_datasources/kaggle.py (1)
111-112
: Handle large datasets more efficientlyThe current implementation reads all data into memory at once, which may not be efficient for very large datasets. Consider supporting batched reading or partitioning.
Add support for partitioning the data:
-def read(self, partition) -> Iterator["pa.RecordBatch"]: - yield from self.source._data.to_batches() +def read(self, partition) -> Iterator["pa.RecordBatch"]: + # Get the number of batches from options with a default value + options_copy = self.source.options.copy() + batch_size = options_copy.get("batch_size", 10000) + try: + batch_size = int(batch_size) + except (ValueError, TypeError): + batch_size = 10000 + + # Generate batches with the specified size + table = self.source._data + yield from table.to_batches(max_chunksize=batch_size) + +# Update the docstring to document the new option +# Add to the Options section: +# - `batch_size`: The maximum number of rows per batch (default: 10000).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (7)
docs/datasources/kaggle.md
(1 hunks)docs/index.md
(1 hunks)mkdocs.yml
(1 hunks)pyproject.toml
(1 hunks)pyspark_datasources/__init__.py
(1 hunks)pyspark_datasources/kaggle.py
(1 hunks)tests/test_data_sources.py
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pyspark_datasources/__init__.py (1)
pyspark_datasources/kaggle.py (1)
KaggleDataSource
(13-104)
🪛 Ruff (0.8.2)
tests/test_data_sources.py
29-29: KaggleDataSource
may be undefined, or defined from star imports
(F405)
pyspark_datasources/__init__.py
5-5: .kaggle.KaggleDataSource
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
🔇 Additional comments (8)
pyproject.toml (1)
14-20
: Dependency additions and extras grouping look correct.The changes to dependencies and extras are appropriate for supporting the new Kaggle data source. No issues found.
Also applies to: 26-28
mkdocs.yml (1)
27-27
: Documentation navigation updated appropriately.The addition of the Kaggle data source documentation to the navigation is correct.
docs/datasources/kaggle.md (1)
1-5
: KaggleDataSource documentation is clear and complete.The documentation for the new data source is concise and follows project conventions.
docs/index.md (1)
31-39
: Data sources table updated correctly.The new KaggleDataSource entry and improved formatting are accurate and helpful.
pyspark_datasources/__init__.py (1)
5-5
: KaggleDataSource import is correct for public API exposure.The import exposes KaggleDataSource at the package level, matching the pattern for other data sources. The Ruff warning about unused import can be safely ignored in this context.
🧰 Tools
🪛 Ruff (0.8.2)
5-5:
.kaggle.KaggleDataSource
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
pyspark_datasources/kaggle.py (3)
13-60
: Excellent documentation for the data sourceThe comprehensive docstring for
KaggleDataSource
includes clear explanations of the purpose, options, authentication requirements, and usage examples, making it easy for users to understand and use this data source.
1-11
: Good type checking and imports organizationThe imports are well-organized, and the use of TYPE_CHECKING for pyarrow is a good practice to avoid unnecessary imports at runtime while still providing type hints for static analyzers.
87-89
: Good use of temporary directory to avoid permission issuesThe use of a temporary directory for caching and explicitly setting
KAGGLEHUB_CACHE
is a good approach to avoid permission issues with the default cache location.
os.environ["KAGGLE_USERNAME"] = username | ||
os.environ["KAGGLE_KEY"] = key | ||
| ||
kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()} | ||
| ||
# Cache in a temporary directory to avoid writing to ~ which may be read-only | ||
with tempfile.TemporaryDirectory() as tmpdir: | ||
os.environ["KAGGLEHUB_CACHE"] = tmpdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid setting environment variables directly
Setting environment variables directly can have unintended side effects, especially in multi-threaded environments or when tests run concurrently. Consider using a more isolated approach.
Use a context manager to temporarily set and restore environment variables:
+ import contextlib + import os + + @contextlib.contextmanager + def temp_environ(): + """Context manager to temporarily modify os.environ.""" + env_copy = os.environ.copy() + try: + yield + finally: + os.environ.clear() + os.environ.update(env_copy) # In the _data method: - if username or key: - if not (username and key): - raise ValueError( - "Both username and key must be provided to authenticate." - ) - os.environ["KAGGLE_USERNAME"] = username - os.environ["KAGGLE_KEY"] = key - - # Cache in a temporary directory to avoid writing to ~ which may be read-only - with tempfile.TemporaryDirectory() as tmpdir: - os.environ["KAGGLEHUB_CACHE"] = tmpdir - import kagglehub + with temp_environ(): + if username or key: + if not (username and key): + raise ValueError( + "Both username and key must be provided to authenticate." + ) + os.environ["KAGGLE_USERNAME"] = username + os.environ["KAGGLE_KEY"] = key + + # Cache in a temporary directory to avoid writing to ~ which may be read-only + with tempfile.TemporaryDirectory() as tmpdir: + os.environ["KAGGLEHUB_CACHE"] = tmpdir + import kagglehub
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
os.environ["KAGGLE_USERNAME"] = username | |
os.environ["KAGGLE_KEY"] = key | |
kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()} | |
# Cache in a temporary directory to avoid writing to ~ which may be read-only | |
with tempfile.TemporaryDirectory() as tmpdir: | |
os.environ["KAGGLEHUB_CACHE"] = tmpdir | |
import contextlib | |
import os | |
import tempfile | |
import ast | |
# Other necessary imports | |
@contextlib.contextmanager | |
def temp_environ(): | |
"""Context manager to temporarily modify os.environ.""" | |
env_copy = os.environ.copy() | |
try: | |
yield | |
finally: | |
os.environ.clear() | |
os.environ.update(env_copy) | |
# ... inside the _data method of the module | |
kwargs = {k: ast.literal_eval(v) for k, v in self.options.items()} | |
with temp_environ(): | |
if username or key: | |
if not (username and key): | |
raise ValueError( | |
"Both username and key must be provided to authenticate." | |
) | |
os.environ["KAGGLE_USERNAME"] = username | |
os.environ["KAGGLE_KEY"] = key | |
# Cache in a temporary directory to avoid writing to ~ which may be read-only | |
with tempfile.TemporaryDirectory() as tmpdir: | |
os.environ["KAGGLEHUB_CACHE"] = tmpdir | |
import kagglehub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it runs in a separate process so setting env variable should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are on the CodeRabbit Free Plan. In order to use the Chat feature, please contact CodeRabbit support.
The Kaggle data source is simply a wrapper around
kagglehub.dataset_load
which allows loading a dataset as a pandas dataframe.Addition of Kaggle Data Source:
pyspark_datasources/kaggle.py
: Added theKaggleDataSource
class for reading Kaggle datasets in Spark, including methods for schema and data reading.pyspark_datasources/__init__.py
: Imported theKaggleDataSource
class to make it available in the module.Documentation Updates:
docs/datasources/kaggle.md
: Added documentation for theKaggleDataSource
, including requirements and usage examples.docs/index.md
: Updated the index to include theKaggleDataSource
in the list of available data sources.Project Configuration:
pyproject.toml
: Added thekagglehub
library as an optional dependency for the project.Testing:
tests/test_data_sources.py
: Added a new test case for theKaggleDataSource
to ensure it can read a dataset from Kaggle correctly.Summary by CodeRabbit