5

I'm attempting to use grep to search for text patterns from an ISO-8859-1 encoded file:

When I execute a search, all of the matches are returned, but the accented characters are stripped. For example, if I want to search of all the words ending in -ese:

$ LC_ALL=pt_PT.ISO-8859-1 grep -a ese\$ wordsList 

This results in 58 matches. One of the matches is the word hipótese, but when printed out appears as hiptese (missing the ó character).

How can I prevent the grep output from stripping the accented characters?

1
  • I just managed doing that, converting my file from ISO-8859 into UTF-8, using the command iconv -f ISO-8859-1 -t UTF-8 filename -o filename. Commented Mar 18 at 14:49

1 Answer 1

10

How can I prevent the grep output from stripping the accented characters?

grep itself does not strip accented characters, it outputs matching lines as they are in the input file. It's your terminal (terminal emulator) that doesn't interpret accented characters encoded as ISO-8859-1 as anything it should display as accented characters.

Your terminal most likely expects UTF-8. The rest of this answer assumes the terminal does expect UTF-8 and the locale is something.UTF-8 (e.g. pt_PT.UTF-8). It should be so in many modern Unix-like systems by default, certainly in Linux.

Possible solutions:

  • You may be able to configure your terminal emulator to ISO-8859-1, run the command and reconfigure back to UTF-8. (e.g. in konsole select from the menu: View, Set Encoding; and so on). I wouldn't call this the right way though.

  • Alternatively convert the output of grep to UTF-8 on the fly:

    LC_ALL=pt_PT.ISO-8859-1 grep -a ese\$ wordsList | iconv -f ISO-8859-1 -t UTF-8 
  • If you plan to work with the file a lot, convert the content to UTF-8*:

    <wordsList iconv -f ISO-8859-1 -t UTF-8 >wordsList-utf8 

    Then work with the new file without tricks, e.g.:

    grep ese\$ wordsList-utf8 

    Now you can even grep for accented characters in a straightforward way, e.g.:

    grep ó wordsList-utf8 

    In general Unicode equivalence may be a problem; but here, since the file is a conversion from ISO-8859-1, I expect consistency: every ó shall be U+00F3 (0xC3B3 in UTF-8, the above grep will find it), not U+006F followed by U+0301 (0x6FCC81 in UTF-8, the above grep would not find it); similarly for other accented characters.


* I notice you used grep -a, as if you needed grep to treat binary files like text. If your wordsList is truly non-text, converting the whole of it to UTF-8 may fail or give you mangled non-text parts. Since you did not link to a single specific file, I cannot investigate further without guessing. I guess you meant the file linked under "just the file", i.e. the file one can extract from wordsList.zip. With this particular file I do not need -a for grep, if only I tell grep to use the right encoding (this is what LC_ALL=pt_PT.ISO-8859-1 does).

2
  • Indeed the filename should have been wordsList and I've update my original post. Thanks for the detailed explanation. I'm using the kitty terminal, but now that I know it's related to the terminal (not grep itself), I know where to focus my efforts. One difference from your comment. Regardless of which terminal I'm using (kitty, gnome, xfce4), It's necessary for me to specify the -a option for grep, otherwise it skips any words with accented characters. Commented Jan 7, 2024 at 20:27
  • @JeffBauer For me grep "thinks" the file is binary when working in UTF-8; but with LC_ALL=pt_PT.ISO-8859-1 it does not, -a is not needed. My grep is GNU grep 3.8. Commented Jan 8, 2024 at 9:09

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.