90

I have files with invalid characters like these

009_-_�%86ndringshåndtering.html 

It is a Æ where something have gone wrong in the filename.

Is there a way to just remove all invalid characters?

or could tr be used somehow?

echo "009_-_�%86ndringshåndtering.html" | tr ??? 
2
  • 6
    The characters probably aren't "invalid", else the filesystem wouldn't store them (unless you did something really nasty to the FS). Have you tried changing your locale (e.g. to UTF8) to display the names correctly? Commented Jan 10, 2012 at 14:29
  • Something really nasty like cp -r /mnt/broken_but_mountable_old_flash_disk/ /some/dir can actually happen very easily leading to undeletable files. To save time trying, the perl answer below does work on those: serverfault.com/a/348496/327691 Commented Sep 15, 2021 at 21:24

11 Answers 11

64

One way would be with sed:

mv 'file' $(echo 'file' | sed -e 's/[^A-Za-z0-9._-]/_/g') 

Replace file with your filename, of course. This will replace anything that isn't a letter, number, period, underscore, or dash with an underscore. You can add or remove characters to keep as you like, and/or change the replacement character to anything else, or nothing at all.

7
  • 9
    I used: f='file'; mv 'file' ${f//[^A-Za-z0-9._-]/_} Commented Oct 7, 2015 at 15:05
  • 2
    Look for the best solution by H. Hess below... (and my funny comment alongside :) ) Commented Feb 14, 2019 at 15:20
  • 3
    This will fail miserably on accented characters. Also on anything else than ascii. Definitely not the solution for the original question. Commented Jan 3, 2021 at 12:00
  • 1
    This is a great observation by @grin. The solution I offered naively assumes the C locale, which uses the literal byte values of characters for collating. ASCII tends to form the basis of most western character sets, and it was adopted into Unicode with the same byte values. In ASCII, the byte values of the letters A through Z are sequential, as are a to z and 0 to 9. However, other character sets have different collating rules. UTF-8, which is now a pretty common default, includes accented characters in those ranges, so a-z would include ä. Commented Sep 13, 2021 at 3:40
  • 1
    Unfortunately this doesn't handle corrupted characters (filename copied from broken filesystem, looks something like ''$'\265''0ADE9~3.JPG). I got it sorted only by using perl from answer below: serverfault.com/a/348496/327691 Commented Sep 15, 2021 at 21:19
76

I had some japanese files with broken filenames recovered from a broken usb stick and the solutions above didn't work for me.

I recommend the detox package:

The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It'll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.

Example usage:

detox -r -v /path/to/your/files 
 -r Recurse into subdirectories -v Be verbose about which files are being renamed -n Can be used for a dry run (only show what would be changed) 
8
  • 10
    This should be much higher, I urge everyone to have a look at detox before essentially reinventing the wheel. If you look at the man page, you will see that it covers all the other proposed solutions here because of its flexibility. Commented Apr 10, 2018 at 8:04
  • 5
    Ezekiel 25:17 - Blessed is he who, in the name of charity and good will upvotes this solution, for he is truly his brother's keeper and the finder of lost children. Commented Feb 14, 2019 at 15:18
  • 6
    Unintuitively, the path can not be '.' in debian. If you use a '.' it finds nothing. Commented Sep 10, 2019 at 18:19
  • 3
    I wonder if it really works, it seems remove/replace Chinese characters, e.g. 的节奏啊, but those characters are valid filename. Commented Sep 11, 2019 at 19:50
  • 4
    be careful with this tool. it's pretty aggressive. it even changes spaces into underscores :/ it also renames __init__.py to init_.py Commented Dec 20, 2020 at 16:35
49

I assume you are on Linux box and the files were made on a Windows box. Linux uses UTF-8 as the character encoding for filenames, while Windows uses something else. I think this is the cause of the problem.

I would use "convmv". This is a tool that can convert filenames from one character encoding to another. For Western Europe one of these normally works:

convmv -r -f windows-1252 -t UTF-8 . convmv -r -f ISO-8859-1 -t UTF-8 . convmv -r -f cp-850 -t UTF-8 . 

If you need to install it on a Debian based Linux you can do so by running:

sudo apt-get install convmv 

It works for me every time and it does recover the original filename.

Source: LeaseWebLabs

4
  • 1
    this looks promising, but any idea how to tell what the encoding is? I have a directory called Save the current file in Word 97-2004 format\sco.workflow that got created on my Mac (via Microsoft Office) and the above encodings don't have any effect. Commented Dec 7, 2016 at 6:49
  • 2
    It's worth pointing out that by default convmv runs in "test" mode, where it just performs a dry run and tells you which files it would move. It will then tell you to run it again with the --notest option to actually rename the files. Commented Jan 28, 2019 at 10:47
  • 1
    This program does not deaccent accented characters. If you try -f utf8 -t ascii it will just complain that it cannot represent the characters in ASCII and refuse to do anything. Commented Apr 25, 2024 at 14:45
  • @SzczepanHołyszewski that is how it should be. If you wanted iconv //TRANSLIT behavior, see here: stackoverflow.com/questions/9930484/… Commented Apr 30, 2024 at 14:11
23

I assume you mean you want to traverse the filesystem and fix all such files?

Here's the way I'd do it

find /path/to/files -type f -print0 | \ perl -n0e '$new = $_; if($new =~ s/[^[:ascii:]]/_/g) { print("Renaming $_ to $new\n"); rename($_, $new); }' 

That would find all files with non-ascii characters and replace those characters with underscores (_). Use caution though, if a file with the new name already exists, it'll overwrite it. The script can be modified to check for such a case, but I didnt put that in to keep it simple.

2
  • 2
    the only solution that helped me... Is perl underrated? 🤔 Commented Sep 9, 2021 at 19:23
  • 1
    I can confirm, that only this one helped with actually corrupted characters, copied from broken flash drive. Commented Sep 15, 2021 at 21:25
18

Following answers at https://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters, You can use:

rename 's/[^\x00-\x7F]//g' * 

where * matches the files you want to rename. If you want to do it over multiple directories, you can do something like:

find . -exec rename 's/[^\x00-\x7F]//g' "{}" \; 

You can use the -n argument to rename to do a dry run, and see what would be changed, without changing it.

5
  • Is there a way to modify this to keep foreign characters such as ü and ä for example? Commented Feb 6, 2016 at 22:51
  • Only the second one worked for me. Everything was in the same directory so I'm not sure what's the difference..? Commented Mar 9, 2017 at 14:51
  • 1
    @Shautieh: the -n stops it from actually running. I'll clarify the answer. Commented Mar 13, 2017 at 5:38
  • rename can be slow when dealing with lots of files. If you want to speed this up, push the check into find. I'm not sure how to do that though. Commented Sep 10, 2019 at 18:13
  • This was the one to help me - detox, as nice as it sounded, just errored out with "unsupported unicode length" exactly on the files I wished it fixed :) Commented Jan 29, 2022 at 10:49
7

This shell script sanitizes a directory recursively, to make files portable between Linux/Windows and FAT/NTFS/exFAT. It removes control characters, /:*?"<>\| and some reserved Windows names like COM0.

sanitize() { shopt -s extglob; filename=$(basename "$1") directory=$(dirname "$1") filename_clean=$(echo "$filename" | sed -e 's/[\\/:\*\?"<>\|\x01-\x1F\x7F]//g' -e 's/^\(nul\|prn\|con\|lpt[0-9]\|com[0-9]\|aux\)\(\.\|$\)//i' -e 's/^\.*$//' -e 's/^$/NONAME/') if (test "$filename" != "$filename_clean") then mv -v "$1" "$directory/$filename_clean" fi } export -f sanitize sanitize_dir() { find "$1" -depth -exec bash -c 'sanitize "$0"' {} \; } sanitize_dir '/path/to/somewhere' 

Linux is less restrictive in theory (/ and \0 are strictly forbidden in filenames) but in practice several characters interfere with bash commands (like *...) so they should also be avoided in filenames.

Great sources for file naming restrictions:

1
  • 1
    It what I search! but add quotes to support dirs with spaces find "$1" -depth -exec bash -c 'sanitize "$0"' {} \; Commented May 22, 2017 at 14:02
3

I use this one-liner to remove invalid characters in subtitle files:

for f in *.srt; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.-]/./g;s/\.\.\././g;s/\.\././g'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done 
  1. Only process *.srt files( * could be used in place of *.srt to process every file)
  2. Removes all other characters except for letters A-Za-z, numbers 0-9, periods ".", and dash's "-"
  3. Removes possible double or triple periods
  4. Checks to see if the file name needs changing
  5. If true, it renames the file with the mv command, then outputs the changes it made with the echo command

It works to normalize directory names of movies:

for f in */; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.]/./g' -e 's/\.\.\././g' -e 's/\.\././g' -e 's/\.*$//'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done 

Same steps as above but I added one more sed command to remove a period at the end of the directory

X-Men Days of Future Past (2014) [1080p]
Modified to:
X-Men.Days.of.Future.Past.2014.1080p

0
1

If you want to handle embedded newlines, multibyte characters, spaces, leading dashes, backslashes and spaces you are going to need something more robust, see this answer:
https://superuser.com/a/858671/365691

I put the script up on code.google.com if anyone is interested: r-n-f-bash-rename-script

0
1

I know this is a bit old but recently I've discovered Google's translate-shell really helps with foreign named files with unicode-choking names. Helpful batch renaming with translation in shell.

$ echo скачать | trans -b download 

https://github.com/soimort/translate-shell

[UPDATE] The Google Translate API tends to block you if you hit it too many times but I also found a convenient local option that converts between alphabets called uconv. Helpful phonetically but not translation:

echo скачать | uconv -x 'Any-Latin;Latin-ASCII' skacat' 
1

This is loosely based on @KrisWebDev's search string.

  • don't touch files/dirs, create batch list instead (to review)
  • going via a two-stage temp file (is faster on my machine)
  • more edge cases for samba (trailing/leading spaces)
  • a basic progress indicator

note: there may occur "already exists" problems when doing the actual rename. to be solved manually

 # tested on: bash linux # needs: bc # this function doesn't change files on its own sanitize_dir() { rm -f /tmp/filenames_toreview_$$.txt touch /tmp/filenames_toreview_$$.txt echo " Batch mv review file is gonna be /tmp/filenames_toreview_$$.txt " # find... and reverse list it, to prevent "file disappeared" (parent dirs are changed last) find "$1" -depth | sort | tac >/tmp/filenames$$.txt FOUNDNUM=$(cat /tmp/filenames$$.txt | wc | awk '{ print $1 }') echo "# found $FOUNDNUM filenames or dirnames to check." echo "# found $FOUNDNUM filenames or dirnames to check." >> /tmp/filenames_toreview_$$.txt IFS=$'\n' shopt -s extglob; COUNT=1 PROC_OLD=N for THISLINE in $(cat /tmp/filenames$$.txt);do # Some percentage info PROC=$(printf %1.f $(echo "($COUNT/$FOUNDNUM)*100" | bc -l)) if [ "$PROC" != "$PROC_OLD" ];then echo "# $PROC%" echo "# $PROC%" >> /tmp/filenames_toreview_$$.txt PROC_OLD=$PROC fi filename=$(basename "$THISLINE") directory=$(dirname "$THISLINE") filename_clean=$(echo "$filename" | sed -E -e 's/[\\/:\*\?"\|\x01-\x1F\x7F]//g' -e 's/^(nul|prn|con|lpt[0-9]|com[0-9]|aux)$/_\1/' -e 's/^$/NONAME/') # multi spaces => single spaces filename_clean=$(echo "$filename_clean" | sed -E -e 's/\s+/ /g' ) # leading and trailing spaces filename_clean=$(echo "$filename_clean" | sed -E -e 's/^\s+//; s/\s+$//;' ) if (test "$filename" != "$filename_clean") then echo "missmatch: '$filename' != '$filename_clean'" if [ -d "$THISLINE" ] || [ -f "$THISLINE" ];then echo mv -v "'$THISLINE'" "'$directory/$filename_clean'" >> /tmp/filenames_toreview_$$.txt else echo "File or dir disappeared. This shouldn't happen." fi fi COUNT=$((COUNT+1)) done rm -f /tmp/filenames$$.txt echo " please review batch rename execution: cat /tmp/filenames_toreview_$$.txt " } sanitize_dir /goto/dir 
1
  • Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. Commented May 17, 2023 at 10:52
-3

for file in *; do mv "$file" $(echo "$file" | sed -e 's/[^A-Za-z0-9.-]//g'); done &

1
  • 3
    You should explain what your code does and use proper formatting. Your code can cause files to be deleted by introducing collisions in the names. And running the entire thing in the background is kind of silly. Commented Jul 4, 2017 at 23:19

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.