You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add Response(..., default_encoding=...) * Add tests for Response(..., default_encoding=...) * Add Client(..., default_encoding=...) * Switch default encoding to 'utf-8' instead of 'autodetect' * Make charset_normalizer an optional dependancy, not a mandatory one. * Documentation * Use callable for default_encoding * Update tests for new charset autodetection API * Update docs for new charset autodetection API * Update requirements * Drop charset_normalizer from requirements
For a list of all available client parameters, see the [`Client`](api.md#client) API reference.
147
147
148
+
---
149
+
150
+
## Character set encodings and auto-detection
151
+
152
+
When accessing `response.text`, we need to decode the response bytes into a unicode text representation.
153
+
154
+
By default `httpx` will use `"charset"` information included in the response `Content-Type` header to determine how the response bytes should be decoded into text.
155
+
156
+
In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet.
157
+
158
+
### Using the default encoding
159
+
160
+
To understand this better let's start by looking at the default behaviour for text decoding...
161
+
162
+
```python
163
+
import httpx
164
+
# Instantiate a client with the default configuration.
165
+
client = httpx.Client()
166
+
# Using the client...
167
+
response = client.get(...)
168
+
print(response.encoding) # This will either print the charset given in
169
+
# the Content-Type charset, or else "utf-8".
170
+
print(response.text) # The text will either be decoded with the Content-Type
171
+
# charset, or using "utf-8".
172
+
```
173
+
174
+
This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted.
175
+
176
+
### Using an explicit encoding
177
+
178
+
In some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.
179
+
180
+
```python
181
+
import httpx
182
+
# Instantiate a client with a Japanese character set as the default encoding.
print(response.encoding) # This will either print the charset given in
187
+
# the Content-Type charset, or else "shift-jis".
188
+
print(response.text) # The text will either be decoded with the Content-Type
189
+
# charset, or using "shift-jis".
190
+
```
191
+
192
+
### Using character set auto-detection
193
+
194
+
In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.
195
+
196
+
To use auto-detection you need to set the `default_encoding` argument to a callable instead of a string. This callable should be a function which takes the input bytes as an argument and returns the character set to use for decoding those bytes to text.
197
+
198
+
There are two widely used Python packages which both handle this functionality:
199
+
200
+
*[`chardet`](https://chardet.readthedocs.io/) - This is a well established package, and is a port of [the auto-detection code in Mozilla](https://www-archive.mozilla.org/projects/intl/chardet.html).
201
+
*[`charset-normalizer`](https://charset-normalizer.readthedocs.io/) - A newer package, motivated by `chardet`, with a different approach.
202
+
203
+
Let's take a look at installing autodetection using one of these packages...
204
+
205
+
```shell
206
+
$ pip install httpx
207
+
$ pip install chardet
208
+
```
209
+
210
+
Once `chardet` is installed, we can configure a client to use character-set autodetection.
211
+
212
+
```python
213
+
import httpx
214
+
import chardet
215
+
216
+
defautodetect(content):
217
+
return chardet.detect(content).get("encoding")
218
+
219
+
# Using a client with character-set autodetection enabled.
0 commit comments