How to remove invalid characters from filenames?

Question

I have files with invalid characters like these

009_-_�%86ndringshåndtering.html

It is a Æ where something have gone wrong in the filename.

Is there a way to just remove all invalid characters?

or could tr be used somehow?

echo "009_-_�%86ndringshåndtering.html" | tr ???

The characters probably aren't "invalid", else the filesystem wouldn't store them (unless you did something really nasty to the FS). Have you tried changing your locale (e.g. to UTF8) to display the names correctly? — James O'Gorman
– James O'Gorman, Commented Jan 10, 2012 at 14:29
Something really nasty like cp -r /mnt/broken_but_mountable_old_flash_disk/ /some/dir can actually happen very easily leading to undeletable files. To save time trying, the perl answer below does work on those: serverfault.com/a/348496/327691 — kub1x
– kub1x, Commented Sep 15, 2021 at 21:24

James Sneeringer · Accepted Answer · 2012-01-10 14:22:09Z

64

One way would be with sed:

mv 'file' $(echo 'file' | sed -e 's/[^A-Za-z0-9._-]/_/g')

Replace file with your filename, of course. This will replace anything that isn't a letter, number, period, underscore, or dash with an underscore. You can add or remove characters to keep as you like, and/or change the replacement character to anything else, or nothing at all.

answered Jan 10, 2012 at 14:22

James Sneeringer

7,02327 silver badges27 bronze badges

9

I used: f='file'; mv 'file' ${f//[^A-Za-z0-9._-]/_}

Louis
– Louis

2015-10-07 15:05:49 +00:00
Commented Oct 7, 2015 at 15:05
2

Look for the best solution by H. Hess below... (and my funny comment alongside :) )

Jan Sila
– Jan Sila

2019-02-14 15:20:57 +00:00
Commented Feb 14, 2019 at 15:20
3

This will fail miserably on accented characters. Also on anything else than ascii. Definitely not the solution for the original question.

grin
– grin

2021-01-03 12:00:51 +00:00
Commented Jan 3, 2021 at 12:00
1

This is a great observation by @grin. The solution I offered naively assumes the C locale, which uses the literal byte values of characters for collating. ASCII tends to form the basis of most western character sets, and it was adopted into Unicode with the same byte values. In ASCII, the byte values of the letters A through Z are sequential, as are a to z and 0 to 9. However, other character sets have different collating rules. UTF-8, which is now a pretty common default, includes accented characters in those ranges, so a-z would include ä.

James Sneeringer
– James Sneeringer

2021-09-13 03:40:55 +00:00
Commented Sep 13, 2021 at 3:40
1

Unfortunately this doesn't handle corrupted characters (filename copied from broken filesystem, looks something like ''$'\265''0ADE9~3.JPG). I got it sorted only by using perl from answer below: serverfault.com/a/348496/327691

kub1x
– kub1x

2021-09-15 21:19:53 +00:00
Commented Sep 15, 2021 at 21:19

| Show 2 more comments

H. Hess · Accepted Answer · 2017-08-30 06:57:45Z

76

I had some japanese files with broken filenames recovered from a broken usb stick and the solutions above didn't work for me.

I recommend the detox package:

The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It'll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.

Example usage:

detox -r -v /path/to/your/files

 -r Recurse into subdirectories -v Be verbose about which files are being renamed -n Can be used for a dry run (only show what would be changed)

answered Aug 30, 2017 at 6:57

H. Hess

8616 silver badges2 bronze badges

10

This should be much higher, I urge everyone to have a look at detox before essentially reinventing the wheel. If you look at the man page, you will see that it covers all the other proposed solutions here because of its flexibility.

emk2203
– emk2203

2018-04-10 08:04:02 +00:00
Commented Apr 10, 2018 at 8:04
5

Ezekiel 25:17 - Blessed is he who, in the name of charity and good will upvotes this solution, for he is truly his brother's keeper and the finder of lost children.

Jan Sila
– Jan Sila

2019-02-14 15:18:48 +00:00
Commented Feb 14, 2019 at 15:18
6

Unintuitively, the path can not be '.' in debian. If you use a '.' it finds nothing.

isaaclw
– isaaclw

2019-09-10 18:19:57 +00:00
Commented Sep 10, 2019 at 18:19
3

I wonder if it really works, it seems remove/replace Chinese characters, e.g. 的节奏啊, but those characters are valid filename.

林果皞
– 林果皞

2019-09-11 19:50:02 +00:00
Commented Sep 11, 2019 at 19:50
4

be careful with this tool. it's pretty aggressive. it even changes spaces into underscores :/ it also renames __init__.py to init_.py

jaksco
– jaksco

2020-12-20 16:35:29 +00:00
Commented Dec 20, 2020 at 16:35

| Show 3 more comments

mevdschee · Accepted Answer · 2013-12-25 00:23:02Z

I assume you are on Linux box and the files were made on a Windows box. Linux uses UTF-8 as the character encoding for filenames, while Windows uses something else. I think this is the cause of the problem.

I would use "convmv". This is a tool that can convert filenames from one character encoding to another. For Western Europe one of these normally works:

convmv -r -f windows-1252 -t UTF-8 . convmv -r -f ISO-8859-1 -t UTF-8 . convmv -r -f cp-850 -t UTF-8 .

If you need to install it on a Debian based Linux you can do so by running:

sudo apt-get install convmv

It works for me every time and it does recover the original filename.

Source: LeaseWebLabs

this looks promising, but any idea how to tell what the encoding is? I have a directory called Save the current file in Word 97-2004 format\sco.workflow that got created on my Mac (via Microsoft Office) and the above encodings don't have any effect. — Sridhar Sarnobat
– Sridhar Sarnobat, Commented Dec 7, 2016 at 6:49
It's worth pointing out that by default convmv runs in "test" mode, where it just performs a dry run and tells you which files it would move. It will then tell you to run it again with the --notest option to actually rename the files. — Kenny Rasschaert
– Kenny Rasschaert, Commented Jan 28, 2019 at 10:47
This program does not deaccent accented characters. If you try -f utf8 -t ascii it will just complain that it cannot represent the characters in ASCII and refuse to do anything. — Szczepan Hołyszewski
– Szczepan Hołyszewski, Commented Apr 25, 2024 at 14:45
@SzczepanHołyszewski that is how it should be. If you wanted iconv //TRANSLIT behavior, see here: stackoverflow.com/questions/9930484/… — mevdschee
– mevdschee, Commented Apr 30, 2024 at 14:11

phemmer · Accepted Answer · 2012-01-10 14:41:50Z

I assume you mean you want to traverse the filesystem and fix all such files?

Here's the way I'd do it

find /path/to/files -type f -print0 | \ perl -n0e '$new = $_; if($new =~ s/[^[:ascii:]]/_/g) { print("Renaming $_ to $new\n"); rename($_, $new); }'

That would find all files with non-ascii characters and replace those characters with underscores (_). Use caution though, if a file with the new name already exists, it'll overwrite it. The script can be modified to check for such a case, but I didnt put that in to keep it simple.

the only solution that helped me... Is perl underrated? 🤔 — d9k
– d9k, Commented Sep 9, 2021 at 19:23
I can confirm, that only this one helped with actually corrupted characters, copied from broken flash drive. — kub1x
– kub1x, Commented Sep 15, 2021 at 21:25

Community · Accepted Answer · 2017-05-23 12:41:19Z

18

Following answers at https://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters, You can use:

rename 's/[^\x00-\x7F]//g' *

where * matches the files you want to rename. If you want to do it over multiple directories, you can do something like:

find . -exec rename 's/[^\x00-\x7F]//g' "{}" \;

You can use the -n argument to rename to do a dry run, and see what would be changed, without changing it.

edited May 23, 2017 at 12:41

CommunityBot

1

answered May 25, 2015 at 10:52

naught101

9238 silver badges11 bronze badges

Is there a way to modify this to keep foreign characters such as ü and ä for example?

Elder Geek
– Elder Geek

2016-02-06 22:51:22 +00:00
Commented Feb 6, 2016 at 22:51
Only the second one worked for me. Everything was in the same directory so I'm not sure what's the difference..?

Shautieh
– Shautieh

2017-03-09 14:51:21 +00:00
Commented Mar 9, 2017 at 14:51
1

@Shautieh: the -n stops it from actually running. I'll clarify the answer.

naught101
– naught101

2017-03-13 05:38:14 +00:00
Commented Mar 13, 2017 at 5:38
rename can be slow when dealing with lots of files. If you want to speed this up, push the check into find. I'm not sure how to do that though.

isaaclw
– isaaclw

2019-09-10 18:13:28 +00:00
Commented Sep 10, 2019 at 18:13
This was the one to help me - detox, as nice as it sounded, just errored out with "unsupported unicode length" exactly on the files I wished it fixed :)

Tomáš M.
– Tomáš M.

2022-01-29 10:49:44 +00:00
Commented Jan 29, 2022 at 10:49

Add a comment |

KrisWebDev · Accepted Answer · 2017-05-25 08:16:40Z

This shell script sanitizes a directory recursively, to make files portable between Linux/Windows and FAT/NTFS/exFAT. It removes control characters, /:*?"<>\| and some reserved Windows names like COM0.

sanitize() { shopt -s extglob; filename=$(basename "$1") directory=$(dirname "$1") filename_clean=$(echo "$filename" | sed -e 's/[\\/:\*\?"<>\|\x01-\x1F\x7F]//g' -e 's/^\(nul\|prn\|con\|lpt[0-9]\|com[0-9]\|aux\)\(\.\|$\)//i' -e 's/^\.*$//' -e 's/^$/NONAME/') if (test "$filename" != "$filename_clean") then mv -v "$1" "$directory/$filename_clean" fi } export -f sanitize sanitize_dir() { find "$1" -depth -exec bash -c 'sanitize "$0"' {} \; } sanitize_dir '/path/to/somewhere'

Linux is less restrictive in theory (/ and \0 are strictly forbidden in filenames) but in practice several characters interfere with bash commands (like *...) so they should also be avoided in filenames.

Great sources for file naming restrictions:

It what I search! but add quotes to support dirs with spaces find "$1" -depth -exec bash -c 'sanitize "$0"' {} \; — mmv-ru
– mmv-ru, Commented May 22, 2017 at 14:02

Brian Kuepper · Accepted Answer · 2019-11-07 21:04:16Z

I use this one-liner to remove invalid characters in subtitle files:

for f in *.srt; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.-]/./g;s/\.\.\././g;s/\.\././g'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done

Only process *.srt files( * could be used in place of *.srt to process every file)
Removes all other characters except for letters A-Za-z, numbers 0-9, periods ".", and dash's "-"
Removes possible double or triple periods
Checks to see if the file name needs changing
If true, it renames the file with the mv command, then outputs the changes it made with the echo command

It works to normalize directory names of movies:

for f in */; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.]/./g' -e 's/\.\.\././g' -e 's/\.\././g' -e 's/\.*$//'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done

Same steps as above but I added one more sed command to remove a period at the end of the directory

X-Men Days of Future Past (2014) [1080p]
Modified to:
X-Men.Days.of.Future.Past.2014.1080p

Community · Accepted Answer · 2017-03-20 10:16:40Z

If you want to handle embedded newlines, multibyte characters, spaces, leading dashes, backslashes and spaces you are going to need something more robust, see this answer:
https://superuser.com/a/858671/365691

I put the script up on code.google.com if anyone is interested: r-n-f-bash-rename-script

BoeroBoy · Accepted Answer · 2021-12-01 13:27:28Z

I know this is a bit old but recently I've discovered Google's translate-shell really helps with foreign named files with unicode-choking names. Helpful batch renaming with translation in shell.

$ echo скачать | trans -b download

https://github.com/soimort/translate-shell

[UPDATE] The Google Translate API tends to block you if you hit it too many times but I also found a convenient local option that converts between alphabets called uconv. Helpful phonetically but not translation:

echo скачать | uconv -x 'Any-Latin;Latin-ASCII' skacat'

Manu · Accepted Answer · 2023-05-09 20:11:46Z

This is loosely based on @KrisWebDev's search string.

don't touch files/dirs, create batch list instead (to review)
going via a two-stage temp file (is faster on my machine)
more edge cases for samba (trailing/leading spaces)
a basic progress indicator

note: there may occur "already exists" problems when doing the actual rename. to be solved manually

 # tested on: bash linux # needs: bc # this function doesn't change files on its own sanitize_dir() { rm -f /tmp/filenames_toreview_$$.txt touch /tmp/filenames_toreview_$$.txt echo " Batch mv review file is gonna be /tmp/filenames_toreview_$$.txt " # find... and reverse list it, to prevent "file disappeared" (parent dirs are changed last) find "$1" -depth | sort | tac >/tmp/filenames$$.txt FOUNDNUM=$(cat /tmp/filenames$$.txt | wc | awk '{ print $1 }') echo "# found $FOUNDNUM filenames or dirnames to check." echo "# found $FOUNDNUM filenames or dirnames to check." >> /tmp/filenames_toreview_$$.txt IFS=$'\n' shopt -s extglob; COUNT=1 PROC_OLD=N for THISLINE in $(cat /tmp/filenames$$.txt);do # Some percentage info PROC=$(printf %1.f $(echo "($COUNT/$FOUNDNUM)*100" | bc -l)) if [ "$PROC" != "$PROC_OLD" ];then echo "# $PROC%" echo "# $PROC%" >> /tmp/filenames_toreview_$$.txt PROC_OLD=$PROC fi filename=$(basename "$THISLINE") directory=$(dirname "$THISLINE") filename_clean=$(echo "$filename" | sed -E -e 's/[\\/:\*\?"\|\x01-\x1F\x7F]//g' -e 's/^(nul|prn|con|lpt[0-9]|com[0-9]|aux)$/_\1/' -e 's/^$/NONAME/') # multi spaces => single spaces filename_clean=$(echo "$filename_clean" | sed -E -e 's/\s+/ /g' ) # leading and trailing spaces filename_clean=$(echo "$filename_clean" | sed -E -e 's/^\s+//; s/\s+$//;' ) if (test "$filename" != "$filename_clean") then echo "missmatch: '$filename' != '$filename_clean'" if [ -d "$THISLINE" ] || [ -f "$THISLINE" ];then echo mv -v "'$THISLINE'" "'$directory/$filename_clean'" >> /tmp/filenames_toreview_$$.txt else echo "File or dir disappeared. This shouldn't happen." fi fi COUNT=$((COUNT+1)) done rm -f /tmp/filenames$$.txt echo " please review batch rename execution: cat /tmp/filenames_toreview_$$.txt " } sanitize_dir /goto/dir

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. — Community
– Community Bot, Commented May 17, 2023 at 10:52

Jairo Bernal · Accepted Answer · 2017-07-04 21:53:09Z

-3

for file in *; do mv "$file" $(echo "$file" | sed -e 's/[^A-Za-z0-9.-]//g'); done &

answered Jul 4, 2017 at 21:53

Jairo Bernal

1

3

You should explain what your code does and use proper formatting. Your code can cause files to be deleted by introducing collisions in the names. And running the entire thing in the background is kind of silly.

kasperd
– kasperd

2017-07-04 23:19:32 +00:00
Commented Jul 4, 2017 at 23:19

Add a comment |

Stack Exchange Network

How to remove invalid characters from filenames?

11 Answers 11

You must log in to answer this question.

Hot Network Questions

How to remove invalid characters from filenames?

11 Answers 11

You must log in to answer this question.

Related

Hot Network Questions