Skip to content

Conversation

@stephantul
Copy link
Contributor

@stephantul stephantul commented Jun 2, 2025

This PR normalizes the spacing dependent on how the original tokenizer does it. This is important to retain performance increase for BPE and non-BPE tokenizers.

@stephantul stephantul requested a review from Pringled June 2, 2025 18:17
@codecov
Copy link

codecov bot commented Jun 2, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/tokenizer/normalizer.py 85.71% 1 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/tokenizer/normalizer.py 95.23% <85.71%> (-4.77%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@stephantul stephantul merged commit 06a478c into main Jun 3, 2025
5 of 6 checks passed
@stephantul stephantul deleted the normalize-punct-switch branch June 3, 2025 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants