Skip to content
This repository was archived by the owner on Jun 1, 2022. It is now read-only.

Conversation

sfarbotka
Copy link
Contributor

In python 3.x PyUnicode_FromString() function accepts an UTF-8 encoded strings only.
But country_code, country_name, country_continent are all ISO-8859-1 encoded.
This commit fixes the issue.

Before fix:

Python 3.4.1 (default, Aug 21 2014, 16:21:32) [GCC 4.6.3] on linux >>> import GeoIP Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 4: invalid continuation byte

After fix:

Python 3.4.1 (default, Aug 21 2014, 16:20:07) [GCC 4.6.3] on linux >>> import GeoIP >>> GeoIP.country_names['CW'] 'Curaçao' 
@oschwald
Copy link
Member

We already set the character set for the C API. The strings coming from it should be UTF-8.

I can't reproduce your issue:

Python 3.4.1 (default, Jul 27 2014, 17:47:19) [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import GeoIP >>> GeoIP.country_names['CW'] 'Curacao' 

What version of the libGeoIP are you using? If any of those methods are returning ISO-8859-1, it sounds like it would be a bug in libGeoIP.

@oschwald
Copy link
Member

I took a closer look at the code in question, and I think the right fix is to populate from the UTF-8 country name array in libGeoIP.

@sfarbotka
Copy link
Contributor Author

My system details:

Python 3.4.1 (default, Aug 21 2014, 16:21:32) [GCC 4.6.3] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import GeoIP >>> GeoIP.lib_version() '1.4.8'
$ uname -a Linux linuxhost 3.4.79 #6 SMP PREEMPT Fri Feb 14 23:58:54 CST 2014 armv7l GNU/Linux $ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=en_US.UTF-8 $ cat /etc/issue Debian GNU/Linux 7 \n \l 

My fixed version of GeoIP prints 'Curaçao' and your output is 'Curacao'. 5th chars are different.
Also when I use unfixed version of GeoIP in python 2.7, print skips 5th char in my locale:

Python 2.7.3 (default, Mar 14 2014, 17:55:54) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import GeoIP >>> GeoIP.country_names['CW'] 'Cura\xe7ao' >>> print GeoIP.country_names['CW'] Curao

I pushed updates. Now the code uses GeoIP_utf8_country_name for populating of dictionary.

@oschwald
Copy link
Member

Thanks. I merged this. I did change it to use GeoIP_country_name for Python 2 since people may be expecting latin1 there.

@oschwald oschwald closed this Aug 22, 2014
@oschwald
Copy link
Member

I also release 1.3.2 with this fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

2 participants