[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters

Background

RFC 3986 defines a scheme like this:

scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

RFC 2234 defines an ALPHA like this:

ALPHA = %x41-5A / %x61-7A

The WHATWG URL spec defines a scheme like this:

"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."

The bug

This is the scheme string parsing code from Lib/urllib/parse.py:462-468:

 i = url.find(':') if i > 0: for c in url[:i]: if c not in scheme_chars: break else: scheme, url = url[:i].lower(), url[i+1:]

This is the definition of scheme_chars from Lib/urllib/parse.py:77-80:

scheme_chars = ('abcdefghijklmnopqrstuvwxyz' 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' '0123456789' '+-.')

This will erroneously validate schemes that begin with any of ('.', '-', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'). This behavior is in violation of both specifications.

This bug is reproducible with the following snippet:

>>> from urllib.parse import urlparse >>> urlparse(".://") # Should error, but doesn't ParseResult(scheme='.', netloc='', path='', params='', query='', fragment='')

My environment

CPython versions tested on:
- 3.12.0a1+ (fb844e1)
- 3.10.8
Operating system and architecture:
- Arch Linux x86_64

PR: gh-99418: Make urllib.parse.urlparse enforce that a scheme must begin with an alphabetical ASCII character. #99421

PR: [3.11] gh-99418: Make urllib.parse.urlparse enforce that a scheme must begin with an alphabetical ASCII character. (GH-99421) #99446

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters #99418

Background

The bug

My environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters #99418

Description

Background

The bug

My environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions