Skip to content

Conversation

@vicky-iv
Copy link
Contributor

@vicky-iv vicky-iv commented Aug 29, 2025

User description

🔗 Related Issues

Fixes #14271

💥 What does this PR do?

Fixes incorrect capitalization behavior for text-transform: capitalize, addressing:

  • Accented Latin letters (e.g., “expiración” → “Expiración”, not “ExpiraciÓN”).
  • Enye (mañana -> Mañana)

🔧 Implementation Notes

Replaced the previous boundary regex:

  • First step uses a negated “separator” class that treats ASCII letters, extended Latin letters (\u00C0–\u02AF, \u1E00–\u1EFF) and combining marks (\u0300–\u036F, \u1AB0–\u1AFF, \u1DC0–\u1DFF) as in-word characters, and excludes _ and apostrophes from being boundaries (so we don’t split snake_case or contractions).

  • Second step: a small second regex capitalizes the first letter after an opening _ or * only when those symbols act as wrappers (preceded by start or a non-word), avoiding interference with snake_case.

All tests from the text_test.html are passed locally:

  • Firefox:
image
  • Chrome:
image
  • Safari:
image

💡 Additional Considerations

Scope limited to Latin scripts + circled letters; other scripts (Greek/Cyrillic/etc.) can be added by extending ranges if needed.

🔄 Types of changes

  • Bug fix (backwards compatible)

PR Type

Bug fix


Description

  • Fixed text-transform: capitalize for accented Latin letters

  • Added support for enye and extended Unicode ranges

  • Improved boundary detection to preserve snake_case

  • Added test cases for Spanish accented characters


Diagram Walkthrough

flowchart LR A["Old regex boundary detection"] --> B["Enhanced Unicode-aware regex"] B --> C["Preserve snake_case"] B --> D["Support accented letters"] E["Add test cases"] --> F["Spanish characters validation"] 
Loading

File Walkthrough

Relevant files
Bug fix
dom.js
Enhanced text capitalization with Unicode support               

javascript/atoms/dom.js

  • Replaced simple boundary regex with Unicode-aware pattern
  • Added support for extended Latin and combining marks ranges
  • Implemented two-step capitalization to handle edge cases
  • Preserved snake_case and contractions from incorrect splitting
+9/-2     
Tests
text_test.html
Added Unicode capitalization test cases                                   

javascript/atoms/test/text_test.html

  • Added test cases for Spanish accented characters
  • Fixed whitespace expectations in preformatted text tests
  • Added validation for "expiración" and "mañana" capitalization
+12/-2   

@selenium-ci selenium-ci added the B-atoms JavaScript chunks generated by Google closure label Aug 29, 2025
@qodo-merge-pro
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Unicode Range Accuracy

Verify the enclosed/circled letter ranges: the capital range included is U+24B6–U+24E9, which spans capital A–Z (U+24B6–U+24CF) and then continues into additional symbols; ensure this is intentional and that the mapping truly pairs small (U+24D0–U+24E9) to corresponding capitals without affecting unrelated characters. Also confirm combining marks following first letter don’t break capitalization.

// 1) don't treat '_' as a separator (protects snake_case) var re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])/g; text = text.replace(re, function () { return arguments[1] + arguments[2].toUpperCase(); }); // 2) capitalize after opening "_" or "*" // Preceded by start or a non-word (so it won't fire for snake_case) re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])([_*])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24D0-\u24E9])/g; text = text.replace(re, function () { return arguments[1] + arguments[2] + arguments[3].toUpperCase(); });
IE Compatibility

The old code had an IE-specific fallback avoiding /u. New regexes drop that branch and rely on Unicode character classes without the 'u' flag but include non-ASCII ranges; confirm this still behaves in legacy environments the atoms target (especially old IE) and doesn’t degrade performance.

// 1) don't treat '_' as a separator (protects snake_case) var re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])/g; text = text.replace(re, function () { return arguments[1] + arguments[2].toUpperCase(); }); // 2) capitalize after opening "_" or "*" // Preceded by start or a non-word (so it won't fire for snake_case) re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])([_*])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24D0-\u24E9])/g; text = text.replace(re, function () { return arguments[1] + arguments[2] + arguments[3].toUpperCase(); });
False Positives Around Punctuation

The boundary class excludes apostrophes to preserve contractions; confirm cases like l'état, O’Neill, and words after punctuation (e.g., “hello—world”) capitalize correctly and that hyphenated words “bla-bla” still behave as desired.

// 1) don't treat '_' as a separator (protects snake_case) var re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])/g; text = text.replace(re, function () { return arguments[1] + arguments[2].toUpperCase(); }); // 2) capitalize after opening "_" or "*" // Preceded by start or a non-word (so it won't fire for snake_case) re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])([_*])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24D0-\u24E9])/g; text = text.replace(re, function () { return arguments[1] + arguments[2] + arguments[3].toUpperCase(); });
@qodo-merge-pro
Copy link
Contributor

qodo-merge-pro bot commented Aug 29, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Align Unicode ranges and boundaries

Make this regex consistent with the first capitalization pass. Include combining
mark ranges in the boundary class and use the full circled Latin range for the
letter group to avoid missed matches (e.g., uppercase circled letters) and
incorrect boundaries after combining marks.

javascript/atoms/dom.js [1184]

-re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])([_*])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24D0-\u24E9])/g; +re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF])([_*])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])/g;
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies an inconsistency in Unicode ranges between the two regular expressions used for capitalization, which could lead to bugs in edge cases.

Medium
Learned
best practice
Use explicit callback parameters

Use named parameters in the replace callbacks instead of indexing into
"arguments" to improve clarity and avoid reliance on the implicit arguments
object.

javascript/atoms/dom.js [1175-1188]

 if (textTransform == 'capitalize') { // 1) don't treat '_' as a separator (protects snake_case) var re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])/g; - text = text.replace(re, function () { - return arguments[1] + arguments[2].toUpperCase(); + text = text.replace(re, function (match, boundary, ch) { + return boundary + ch.toUpperCase(); }); // 2) capitalize after opening "_" or "*" // Preceded by start or a non-word (so it won't fire for snake_case) re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])([_*])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24D0-\u24E9])/g; - text = text.replace(re, function () { - return arguments[1] + arguments[2] + arguments[3].toUpperCase(); + text = text.replace(re, function (match, boundary, marker, ch) { + return boundary + marker + ch.toUpperCase(); }); } else if (textTransform == 'uppercase') {
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why:
Relevant best practice - Prefer clear, language-idiomatic APIs: use explicit callback parameters over the implicit "arguments" object for readability and maintainability.

Low
  • Update
// Preceded by start or a non-word (so it won't fire for snake_case)
re = /(^|[^'_0-9A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24B6-\u24E9])([_*])([A-Za-z\u00C0-\u02AF\u1E00-\u1EFF\u24D0-\u24E9])/g;
text = text.replace(re, function () {
return arguments[1] + arguments[2] + arguments[3].toUpperCase();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can remove the magic numbers and leave something more descriptive like:

 text = text.replace(re, function (_match, prefix, divider, char) { return prefix + divider + char.toUpperCase(); }); 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just followed the existing implementation and coding style

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, it's just a coding tip.

Copy link
Member

@diemol diemol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @vicky-iv!

@diemol diemol merged commit 775cfb3 into SeleniumHQ:trunk Sep 2, 2025
31 of 32 checks passed
@vicky-iv vicky-iv deleted the fix-text-capitalization branch September 5, 2025 19:48
@vicky-iv vicky-iv restored the fix-text-capitalization branch September 5, 2025 19:48
@vicky-iv vicky-iv deleted the fix-text-capitalization branch September 5, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-atoms JavaScript chunks generated by Google closure Review effort 3/5

4 participants