Fix media type inference for URLs with query parameters #3501

fedexman · 2025-11-21T04:25:56Z

Fix media type inference for URLs with query parameters

When using presigned URLs (eg AWS S3) with ImageUrl, AudioUrl, or VideoUrl, the media type inference fails

https://pics.s3.ap-northeast-1.amazonaws.com/test/Capture-2025-11-21-112402.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=fdafdas%2F20251121%2Fap-northeast-1%2Fs3%2Faws4_request&X-Amz-Date=20251121T023200Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDoaD

to the agent we get this error,

Internal server error: Could not infer media type from image URL: ... Explicitly provide a `media_type` instead

the reason is that the _infer_media_type function only check the end of the url with url.endswith('.mkv') but do not parse the url.

I propose to parse the url with

from urllib.parse import urlparse path = urlparse(self.url).path if path.endswith('.mkv'): return 'video/x-matroska'

Viicos · 2025-11-21T09:11:19Z

pydantic_ai_slim/pydantic_ai/messages.py

Why aren't we using the mimetypes stdlib module? mimetypes.guess_type() already parses URLs and the current implementation doesn't take into account case insensitivity, etc.

@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/

@fedexman Can you see if we can use mimetypes.guess_type() for all of these?

The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.

DouweM · 2025-11-21T17:22:35Z

pydantic_ai_slim/pydantic_ai/messages.py

@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/

@fedexman Can you see if we can use mimetypes.guess_type() for all of these?

The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.

github-actions · 2025-11-29T14:00:34Z

This PR is stale, and will be closed in 3 days if no reply is received.

fedexman · 2025-12-01T17:04:09Z

I'll update this week 🙏🙏

fedexman · 2025-12-03T13:44:43Z

I refactored using mimetypes. Some types are defined in the standard library, if not, they use some os files that is dependent of the machine. For all the types not in the standard library I added them manually to have reliable behavior.
Good for rereview 🙇

pydantic_ai_slim/pydantic_ai/messages.py

tests/test_messages.py

pydantic_ai_slim/pydantic_ai/messages.py

DouweM · 2025-12-09T21:59:32Z

@fedexman Please have a look at @Viicos's comments!

fedexman · 2025-12-10T01:35:42Z

Sorry for slow updates, I've used Mimetypes library in the past and it hides a lot of unexpected behaviors.
For example I have a test failing on pydantic ai due to this behavior:

import mimetypes filename = "data.xml" global_guess, _ = mimetypes.guess_type(filename) db = mimetypes.MimeTypes() db_guess, _ = db.guess_type(filename) print(f"Global: {global_guess}") print(f"Fresh: {db_guess}") assert global_guess == db_guess

Output

Global: application/xml Fresh: text/xml Traceback (most recent call last): File "<python-input-2>", line 13, in <module> assert global_guess == db_guess

Even though the official docs says

class mimetypes.MimeTypes(filenames=(), strict=True)
This class represents a MIME-types database. By default, it provides access to the same database as the rest of this module. The initial database is a copy of that provided by the module, and may be extended by loading additional mime.types-style files into the database using the read() or readfp() methods. The mapping dictionaries may also be cleared before loading additional data if the default data is not desired.
The optional filenames parameter can be used to cause additional files to be loaded “on top” of the default database.

I get this test failing even though the xml test was guessed by this before

tests/test_messages.py:689 test_binary_content_from_path │ - assert binary_content == snapshot(BinaryContent(data=b'<think>about trains</think>', │ │ media_type='application/xml')) │ │ + assert binary_content == snapshot(BinaryContent(data=b'<think>about trains</think>', media_type='text/xml')) │

The mimetypes library has a internal _db with newer types, but using a fresh db doesn't contains those and use the os definitions.

I added the new application/xml manually but doing like we might risk using also other olds os definitions. Currently looking for better solution, i cannot find a clean way of creating a new mimetype db with the exact same state as the mimetype internal db

fedexman · 2025-12-10T02:02:14Z

Now tests should be passing and I added Vico's suggestions. Tell me if you have an idea of cleaner solution for the mime type db init

DouweM · 2025-12-10T17:58:52Z

@fedexman Can you have a look at the failing tests please?

Viicos · 2025-12-10T21:26:41Z

Even though the official docs says

Yeah, this is annoying and a documentation update is pending at python/cpython#27750.

One way to do it is to just replicate how the default db is instantiated:

_mime_types = MimeTypes() # The default db is actually different from the ones we can manually instantiate # (see https://github.com/python/cpython/pull/27750). As such, replicate # what is being done in `mimetypes.init()`: _mime_types.read_windows_registry() for file in mimetypes.knownfiles: if os.path.isfile(file): _mime_types.read(file)

Viicos · 2025-12-12T08:12:13Z

pydantic_ai_slim/pydantic_ai/messages.py

+# Document types
+_mime_types.add_type('application/msword', '.doc')
+_mime_types.add_type('application/pdf', '.pdf')
+_mime_types.add_type('application/rtf', '.rtf')
+_mime_types.add_type('application/vnd.ms-excel', '.xls')
+_mime_types.add_type('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', '.xlsx')
+_mime_types.add_type('application/vnd.openxmlformats-officedocument.wordprocessingml.document', '.docx')
+_mime_types.add_type('text/markdown', '.mdx')
+_mime_types.add_type('text/plain', '.txt')
+_mime_types.add_type('text/x-asciidoc', '.asciidoc')
+
+# Image types
+_mime_types.add_type('image/gif', '.gif')
+_mime_types.add_type('image/jpeg', '.jpeg')
+_mime_types.add_type('image/jpeg', '.jpg')
+_mime_types.add_type('image/png', '.png')
+_mime_types.add_type('image/webp', '.webp')
+
+# Video types
+_mime_types.add_type('video/3gpp', '.three_gp')
+_mime_types.add_type('video/mp4', '.mp4')
+_mime_types.add_type('video/mpeg', '.mpeg')
+_mime_types.add_type('video/mpeg', '.mpg')
+_mime_types.add_type('video/quicktime', '.mov')
+_mime_types.add_type('video/webm', '.webm')
+_mime_types.add_type('video/x-flv', '.flv')
+_mime_types.add_type('video/x-matroska', '.mkv')
+_mime_types.add_type('video/x-ms-wmv', '.wmv')
+
+# Audio types
+_mime_types.add_type('audio/aac', '.aac')
+_mime_types.add_type('audio/aiff', '.aiff')
+_mime_types.add_type('audio/flac', '.flac')
+_mime_types.add_type('audio/mpeg', '.mp3')
+_mime_types.add_type('audio/ogg', '.oga')
+_mime_types.add_type('audio/wav', '.wav')


These are not required now, right?

I only added the one not defined natively by python or the one that have a different definition in python

https://github.com/python/cpython/blob/a183a11db8bc2520c52814635de2df118d2d7e8c/Lib/mimetypes.py#L434C1-L604C1

I'm not sure I understand: taking .mp4 as an example, it is defined in the mapping and maps to video/mp4?

yes, but test are failing on 3.10 even though we add video/mp4 explicitly and it is also added in the python lib 😅 passing locally though so difficult to test. I'll update with test passing

ok the query string ignore was supported only from 3.11 😅
python/cpython#66543
on 3.10 it is not supported

in 3.10 this code gives Result: (None, None)

import mimetypes import sys url = "https://example.com/image.png?token=123" result = mimetypes.guess_type(url) print(f"Result: {result}")

Sorry for not updating the PR. In case we are using a python feature that is not supported in 3.10 what is the common way to do, does it mean we cannot use mimetype for this task, or do we add a new branch logic for 3.10 😅 what is the usual solution for these cases ?
the purpose of this PR was to make things cleaner and support query types by using a native python feature, but if we cannot use it, this PR does not make so much sense. I can switch to a urllib.parse logic though 🤔 @Viicos

Sorry should have come back here. Given that 3.10 is the oldest supported version and we will drop support in less than a year, I'd rather skip the test in this version. Prior to this PR, it wouldn't be supported in any Python version anyway, so I think this is fine.

With that in mind, are the manually added mime types still necessary?

yes i realized during debugging i may have added unnecessary types, i remade a check with https://github.com/python/cpython/blob/3.10/Lib/mimetypes.py and now it should not have any duplicates anymore with native python

tests/test_messages.py

Co-authored-by: Victorien <65306057+Viicos@users.noreply.github.com>

fedexman · 2025-12-22T09:45:23Z

good for rereview @Viicos

pydantic_ai_slim/pydantic_ai/messages.py

Viicos

Thanks @fedexman 🙏

Felix R added 3 commits November 21, 2025 13:13

chore: Parse the url before inferring type

8dd4492

test: verify the url handles query strings

0e3c089

chore: remove comments

38db162

Viicos reviewed Nov 21, 2025

View reviewed changes

DouweM requested changes Nov 21, 2025

View reviewed changes

DouweM self-assigned this Nov 21, 2025

DouweM added the awaiting author revision label Nov 21, 2025

github-actions bot added the Stale label Nov 29, 2025

DouweM removed the Stale label Dec 1, 2025

Feat use mimetypes _infer_media_type

a92874d

fedexman requested a review from DouweM December 3, 2025 13:44

Viicos reviewed Dec 3, 2025

View reviewed changes

pydantic_ai_slim/pydantic_ai/messages.py Outdated Show resolved Hide resolved

Feat create new module level mime db

d1a7670

Viicos reviewed Dec 6, 2025

View reviewed changes

tests/test_messages.py Outdated Show resolved Hide resolved

Viicos reviewed Dec 6, 2025

View reviewed changes

pydantic_ai_slim/pydantic_ai/messages.py Outdated Show resolved Hide resolved

h0rv mentioned this pull request Dec 9, 2025

feat: support inferring media types from URLs with query params/fragments #3686

Closed

Felix R added 3 commits December 10, 2025 10:50

Chore keep only one simple test

c6826e3

Chore add all the necessary types

eae8d5d

Chore hostname being None is fine

03d75f1

Fix simulate mimetypes.init() manually

2d0eeee

Viicos reviewed Dec 12, 2025

View reviewed changes

Viicos reviewed Dec 16, 2025

View reviewed changes

tests/test_messages.py Show resolved Hide resolved

DouweM mentioned this pull request Dec 19, 2025

BinaryContent.from_path incorrectly infers media type for (some) text files #3776

Open

fedexman and others added 5 commits December 22, 2025 14:34

Update tests/test_messages.py

10d9dca

Co-authored-by: Victorien <65306057+Viicos@users.noreply.github.com>

chore: add md for 3.10

671191d

chore: fix syntax

abb96ef

chore: remove all duplicates with native python

fded364

chore: format

377245f

Viicos reviewed Dec 22, 2025

View reviewed changes

pydantic_ai_slim/pydantic_ai/messages.py Outdated Show resolved Hide resolved

Update pydantic_ai_slim/pydantic_ai/messages.py

ec5608a

Viicos approved these changes Dec 22, 2025

View reviewed changes

Viicos merged commit c9f5410 into pydantic:main Dec 22, 2025
31 checks passed

dsfaccini mentioned this pull request Dec 23, 2025

hotfix: Register audio/aac mimetype #3829

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix media type inference for URLs with query parameters #3501

Fix media type inference for URLs with query parameters #3501

fedexman commented Nov 21, 2025

Viicos Nov 21, 2025

DouweM Nov 21, 2025

DouweM Nov 21, 2025

github-actions bot commented Nov 29, 2025

fedexman commented Dec 1, 2025

fedexman commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

DouweM commented Dec 9, 2025

fedexman commented Dec 10, 2025

fedexman commented Dec 10, 2025

DouweM commented Dec 10, 2025

Viicos commented Dec 10, 2025

Viicos Dec 12, 2025

fedexman Dec 12, 2025

fedexman Dec 12, 2025

Viicos Dec 12, 2025

fedexman Dec 12, 2025 •

edited

Loading

fedexman Dec 12, 2025

fedexman Dec 16, 2025

Viicos Dec 16, 2025

Viicos Dec 16, 2025

fedexman Dec 22, 2025

Uh oh!

fedexman commented Dec 22, 2025

Uh oh!

Viicos left a comment

Uh oh!

Labels

3 participants

Fix media type inference for URLs with query parameters #3501

Fix media type inference for URLs with query parameters #3501

Conversation

fedexman commented Nov 21, 2025

Fix media type inference for URLs with query parameters

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 29, 2025

fedexman commented Dec 1, 2025

fedexman commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

DouweM commented Dec 9, 2025

fedexman commented Dec 10, 2025

fedexman commented Dec 10, 2025

DouweM commented Dec 10, 2025

Viicos commented Dec 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedexman Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

fedexman commented Dec 22, 2025

Uh oh!

Viicos left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

fedexman Dec 12, 2025 •

edited

Loading