Skip to content

Conversation

@fedexman
Copy link
Contributor

Fix media type inference for URLs with query parameters

When using presigned URLs (eg AWS S3) with ImageUrl, AudioUrl, or VideoUrl, the media type inference fails

https://pics.s3.ap-northeast-1.amazonaws.com/test/Capture-2025-11-21-112402.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=fdafdas%2F20251121%2Fap-northeast-1%2Fs3%2Faws4_request&X-Amz-Date=20251121T023200Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDoaD

to the agent we get this error,

Internal server error: Could not infer media type from image URL: ... Explicitly provide a `media_type` instead

the reason is that the _infer_media_type function only check the end of the url with url.endswith('.mkv') but do not parse the url.

I propose to parse the url with

from urllib.parse import urlparse path = urlparse(self.url).path if path.endswith('.mkv'): return 'video/x-matroska'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we using the mimetypes stdlib module? mimetypes.guess_type() already parses URLs and the current implementation doesn't take into account case insensitivity, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/

@fedexman Can you see if we can use mimetypes.guess_type() for all of these?

The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/

@fedexman Can you see if we can use mimetypes.guess_type() for all of these?

The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.

@github-actions
Copy link

This PR is stale, and will be closed in 3 days if no reply is received.

@github-actions github-actions bot added the Stale label Nov 29, 2025
@fedexman
Copy link
Contributor Author

fedexman commented Dec 1, 2025

I'll update this week 🙏🙏

@DouweM DouweM removed the Stale label Dec 1, 2025
@fedexman
Copy link
Contributor Author

fedexman commented Dec 3, 2025

I refactored using mimetypes. Some types are defined in the standard library, if not, they use some os files that is dependent of the machine. For all the types not in the standard library I added them manually to have reliable behavior.
Good for rereview 🙇

@fedexman fedexman requested a review from DouweM December 3, 2025 13:44
@DouweM
Copy link
Collaborator

DouweM commented Dec 9, 2025

@fedexman Please have a look at @Viicos's comments!

@fedexman
Copy link
Contributor Author

Sorry for slow updates, I've used Mimetypes library in the past and it hides a lot of unexpected behaviors.
For example I have a test failing on pydantic ai due to this behavior:

import mimetypes filename = "data.xml" global_guess, _ = mimetypes.guess_type(filename) db = mimetypes.MimeTypes() db_guess, _ = db.guess_type(filename) print(f"Global: {global_guess}") print(f"Fresh: {db_guess}") assert global_guess == db_guess

Output

Global: application/xml Fresh: text/xml Traceback (most recent call last): File "<python-input-2>", line 13, in <module> assert global_guess == db_guess 

Even though the official docs says

class mimetypes.MimeTypes(filenames=(), strict=True)
This class represents a MIME-types database. By default, it provides access to the same database as the rest of this module. The initial database is a copy of that provided by the module, and may be extended by loading additional mime.types-style files into the database using the read() or readfp() methods. The mapping dictionaries may also be cleared before loading additional data if the default data is not desired.
The optional filenames parameter can be used to cause additional files to be loaded “on top” of the default database.

I get this test failing even though the xml test was guessed by this before

tests/test_messages.py:689 test_binary_content_from_path │ - assert binary_content == snapshot(BinaryContent(data=b'<think>about trains</think>', │ │ media_type='application/xml')) │ │ + assert binary_content == snapshot(BinaryContent(data=b'<think>about trains</think>', media_type='text/xml')) │ 

The mimetypes library has a internal _db with newer types, but using a fresh db doesn't contains those and use the os definitions.

I added the new application/xml manually but doing like we might risk using also other olds os definitions. Currently looking for better solution, i cannot find a clean way of creating a new mimetype db with the exact same state as the mimetype internal db

@fedexman
Copy link
Contributor Author

Now tests should be passing and I added Vico's suggestions. Tell me if you have an idea of cleaner solution for the mime type db init

@DouweM
Copy link
Collaborator

DouweM commented Dec 10, 2025

@fedexman Can you have a look at the failing tests please?

@Viicos
Copy link
Member

Viicos commented Dec 10, 2025

Even though the official docs says

Yeah, this is annoying and a documentation update is pending at python/cpython#27750.

One way to do it is to just replicate how the default db is instantiated:

_mime_types = MimeTypes() # The default db is actually different from the ones we can manually instantiate # (see https://github.com/python/cpython/pull/27750). As such, replicate # what is being done in `mimetypes.init()`: _mime_types.read_windows_registry() for file in mimetypes.knownfiles: if os.path.isfile(file): _mime_types.read(file)
Comment on lines 38 to 73
# Document types
_mime_types.add_type('application/msword', '.doc')
_mime_types.add_type('application/pdf', '.pdf')
_mime_types.add_type('application/rtf', '.rtf')
_mime_types.add_type('application/vnd.ms-excel', '.xls')
_mime_types.add_type('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', '.xlsx')
_mime_types.add_type('application/vnd.openxmlformats-officedocument.wordprocessingml.document', '.docx')
_mime_types.add_type('text/markdown', '.mdx')
_mime_types.add_type('text/plain', '.txt')
_mime_types.add_type('text/x-asciidoc', '.asciidoc')

# Image types
_mime_types.add_type('image/gif', '.gif')
_mime_types.add_type('image/jpeg', '.jpeg')
_mime_types.add_type('image/jpeg', '.jpg')
_mime_types.add_type('image/png', '.png')
_mime_types.add_type('image/webp', '.webp')

# Video types
_mime_types.add_type('video/3gpp', '.three_gp')
_mime_types.add_type('video/mp4', '.mp4')
_mime_types.add_type('video/mpeg', '.mpeg')
_mime_types.add_type('video/mpeg', '.mpg')
_mime_types.add_type('video/quicktime', '.mov')
_mime_types.add_type('video/webm', '.webm')
_mime_types.add_type('video/x-flv', '.flv')
_mime_types.add_type('video/x-matroska', '.mkv')
_mime_types.add_type('video/x-ms-wmv', '.wmv')

# Audio types
_mime_types.add_type('audio/aac', '.aac')
_mime_types.add_type('audio/aiff', '.aiff')
_mime_types.add_type('audio/flac', '.flac')
_mime_types.add_type('audio/mpeg', '.mp3')
_mime_types.add_type('audio/ogg', '.oga')
_mime_types.add_type('audio/wav', '.wav')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not required now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added the one not defined natively by python or the one that have a different definition in python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand: taking .mp4 as an example, it is defined in the mapping and maps to video/mp4?

Copy link
Contributor Author

@fedexman fedexman Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but test are failing on 3.10 even though we add video/mp4 explicitly and it is also added in the python lib 😅 passing locally though so difficult to test. I'll update with test passing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok the query string ignore was supported only from 3.11 😅
python/cpython#66543
on 3.10 it is not supported

in 3.10 this code gives Result: (None, None)

import mimetypes import sys url = "https://example.com/image.png?token=123" result = mimetypes.guess_type(url) print(f"Result: {result}") 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not updating the PR. In case we are using a python feature that is not supported in 3.10 what is the common way to do, does it mean we cannot use mimetype for this task, or do we add a new branch logic for 3.10 😅 what is the usual solution for these cases ?
the purpose of this PR was to make things cleaner and support query types by using a native python feature, but if we cannot use it, this PR does not make so much sense. I can switch to a urllib.parse logic though 🤔 @Viicos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry should have come back here. Given that 3.10 is the oldest supported version and we will drop support in less than a year, I'd rather skip the test in this version. Prior to this PR, it wouldn't be supported in any Python version anyway, so I think this is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that in mind, are the manually added mime types still necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i realized during debugging i may have added unnecessary types, i remade a check with https://github.com/python/cpython/blob/3.10/Lib/mimetypes.py and now it should not have any duplicates anymore with native python

@fedexman
Copy link
Contributor Author

good for rereview @Viicos

Copy link
Member

@Viicos Viicos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fedexman 🙏

@Viicos Viicos merged commit c9f5410 into pydantic:main Dec 22, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment