20

I have a directory with ~10,000 image files from an external source.

Many of the filenames contain spaces and punctuation marks that are not DB friendly or Web friendly. I also want to append a SKU number to the end of every filename (for accounting purposes). Many, if not most of the filenames also contain extended latin characters which I want to keep for SEO purposes (specifically so the filenames accurately represent the file contents in Google Images)

I have made a bash script which renames (copies) all the files to my desired result. The bash script is saved in UTF-8. After running it omits approx 500 of the files (unable to stat file...).

I have run convmv -f UTF-8 -t UTF-8 on the directory, and discovered these 500 filenames are not encoded in UTF-8 (convmv is able to detect and ignore filenames already in UTF-8)

Is there an easy way I can find out which language encoding they are currently using?

The only way I've been able to figure out myself is by setting my terminal encoding to UTF-8, then iterating through all the likely candidate encodings with convmv until it displays a converted name that 'looks right'. I have no way to be certain that these 500 files all use the same encoding, so I would need to repeat this process 500 times. I would like a more automated method than 'looks right' !!!

3 Answers 3

16

There's no 100% accurate way really, but there's a way to give a good guess.

There is a python library chardet which is available here: https://pypi.python.org/pypi/chardet

e.g.

See what the current LANG variable is set to:

$ echo $LANG en_IE.UTF-8 

Create a filename that'll need to be encoded with UTF-8

$ touch mÉ.txt 

Change our encoding and see what happens when we try and list it

$ ls m* mÉ.txt $ export LANG=C $ ls m* m??.txt 

OK, so now we have a filename encoded in UTF-8 and our current locale is C (standard Unix codepage).

So start up python, import chardet and get it to read the filename. I'm use some shell globbing (i.e. expansion through the * wildcard character) to get my file. Change "ls m*" to whatever will match one of your example files.

>>> import chardet >>> import os >>> chardet.detect(os.popen("ls m*").read()) {'confidence': 0.505, 'encoding': 'utf-8'} 

As you can see, it's only a guess. How good a guess is shown by the "confidence" variable.

1
  • script works as described, but in my case, chardet didn't found file's encoding. Commented Feb 1, 2012 at 10:31
8

You may find this useful, to test the current working directory (python 2.7):

import chardet import os for n in os.listdir('.'): print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence']) 

Result looks like:

Vorlagen => ascii (1.0) examples.desktop => ascii (1.0) Öffentlich => ISO-8859-2 (0.755682154041) Videos => ascii (1.0) .bash_history => ascii (1.0) Arbeitsfläche => EUC-KR (0.99) 

To recurse trough path from current directory, cut-and-paste this into a little python script:

#!/usr/bin/python import chardet import os for root, dirs, names in os.walk('.'): print root for n in names: print '%s => %s (%s)' % (n, chardet.detect(n)['encoding'], chardet.detect(n)['confidence']) 
1
  • Does that work with Asian encoding too? Or is it Eurocentric? Commented Aug 31, 2012 at 17:29
4

Landing here in 2021 using python3 I found @philip-reynoldsn @klaus-kappel answers useful but not functional anymore as chardet.detect() expects a byte-like object. I slightly edited the code to get the encoding of all files in current working directory as follows:

import os import chardet for n in os.listdir('.'): chardet.detect(os.fsencode(n)) 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.