- Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix media type inference for URLs with query parameters #3501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix media type inference for URLs with query parameters #3501
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we using the mimetypes stdlib module? mimetypes.guess_type() already parses URLs and the current implementation doesn't take into account case insensitivity, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/
@fedexman Can you see if we can use mimetypes.guess_type() for all of these?
The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/
@fedexman Can you see if we can use mimetypes.guess_type() for all of these?
The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.
| This PR is stale, and will be closed in 3 days if no reply is received. |
| I'll update this week 🙏🙏 |
| I refactored using mimetypes. Some types are defined in the standard library, if not, they use some os files that is dependent of the machine. For all the types not in the standard library I added them manually to have reliable behavior. |
| Sorry for slow updates, I've used Mimetypes library in the past and it hides a lot of unexpected behaviors. import mimetypes filename = "data.xml" global_guess, _ = mimetypes.guess_type(filename) db = mimetypes.MimeTypes() db_guess, _ = db.guess_type(filename) print(f"Global: {global_guess}") print(f"Fresh: {db_guess}") assert global_guess == db_guessOutput Even though the official docs says
I get this test failing even though the xml test was guessed by this before The mimetypes library has a internal _db with newer types, but using a fresh db doesn't contains those and use the os definitions. I added the new |
| Now tests should be passing and I added Vico's suggestions. Tell me if you have an idea of cleaner solution for the mime type db init |
| @fedexman Can you have a look at the failing tests please? |
Yeah, this is annoying and a documentation update is pending at python/cpython#27750. One way to do it is to just replicate how the default db is instantiated: _mime_types = MimeTypes() # The default db is actually different from the ones we can manually instantiate # (see https://github.com/python/cpython/pull/27750). As such, replicate # what is being done in `mimetypes.init()`: _mime_types.read_windows_registry() for file in mimetypes.knownfiles: if os.path.isfile(file): _mime_types.read(file) |
| # Document types | ||
| _mime_types.add_type('application/msword', '.doc') | ||
| _mime_types.add_type('application/pdf', '.pdf') | ||
| _mime_types.add_type('application/rtf', '.rtf') | ||
| _mime_types.add_type('application/vnd.ms-excel', '.xls') | ||
| _mime_types.add_type('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', '.xlsx') | ||
| _mime_types.add_type('application/vnd.openxmlformats-officedocument.wordprocessingml.document', '.docx') | ||
| _mime_types.add_type('text/markdown', '.mdx') | ||
| _mime_types.add_type('text/plain', '.txt') | ||
| _mime_types.add_type('text/x-asciidoc', '.asciidoc') | ||
| | ||
| # Image types | ||
| _mime_types.add_type('image/gif', '.gif') | ||
| _mime_types.add_type('image/jpeg', '.jpeg') | ||
| _mime_types.add_type('image/jpeg', '.jpg') | ||
| _mime_types.add_type('image/png', '.png') | ||
| _mime_types.add_type('image/webp', '.webp') | ||
| | ||
| # Video types | ||
| _mime_types.add_type('video/3gpp', '.three_gp') | ||
| _mime_types.add_type('video/mp4', '.mp4') | ||
| _mime_types.add_type('video/mpeg', '.mpeg') | ||
| _mime_types.add_type('video/mpeg', '.mpg') | ||
| _mime_types.add_type('video/quicktime', '.mov') | ||
| _mime_types.add_type('video/webm', '.webm') | ||
| _mime_types.add_type('video/x-flv', '.flv') | ||
| _mime_types.add_type('video/x-matroska', '.mkv') | ||
| _mime_types.add_type('video/x-ms-wmv', '.wmv') | ||
| | ||
| # Audio types | ||
| _mime_types.add_type('audio/aac', '.aac') | ||
| _mime_types.add_type('audio/aiff', '.aiff') | ||
| _mime_types.add_type('audio/flac', '.flac') | ||
| _mime_types.add_type('audio/mpeg', '.mp3') | ||
| _mime_types.add_type('audio/ogg', '.oga') | ||
| _mime_types.add_type('audio/wav', '.wav') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not required now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added the one not defined natively by python or the one that have a different definition in python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand: taking .mp4 as an example, it is defined in the mapping and maps to video/mp4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but test are failing on 3.10 even though we add video/mp4 explicitly and it is also added in the python lib 😅 passing locally though so difficult to test. I'll update with test passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok the query string ignore was supported only from 3.11 😅
python/cpython#66543
on 3.10 it is not supported
in 3.10 this code gives Result: (None, None)
import mimetypes import sys url = "https://example.com/image.png?token=123" result = mimetypes.guess_type(url) print(f"Result: {result}") There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not updating the PR. In case we are using a python feature that is not supported in 3.10 what is the common way to do, does it mean we cannot use mimetype for this task, or do we add a new branch logic for 3.10 😅 what is the usual solution for these cases ?
the purpose of this PR was to make things cleaner and support query types by using a native python feature, but if we cannot use it, this PR does not make so much sense. I can switch to a urllib.parse logic though 🤔 @Viicos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry should have come back here. Given that 3.10 is the oldest supported version and we will drop support in less than a year, I'd rather skip the test in this version. Prior to this PR, it wouldn't be supported in any Python version anyway, so I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that in mind, are the manually added mime types still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i realized during debugging i may have added unnecessary types, i remade a check with https://github.com/python/cpython/blob/3.10/Lib/mimetypes.py and now it should not have any duplicates anymore with native python
Co-authored-by: Victorien <65306057+Viicos@users.noreply.github.com>
| good for rereview @Viicos |
Viicos left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fedexman 🙏
Fix media type inference for URLs with query parameters
When using presigned URLs (eg AWS S3) with
ImageUrl,AudioUrl, orVideoUrl, the media type inference failsto the agent we get this error,
the reason is that the
_infer_media_typefunction only check the end of the url withurl.endswith('.mkv')but do not parse the url.I propose to parse the url with