0

I have an awk script that must process millions of records, but I need to remove any containing a multibyte character.

In one environment where I work, the following simplified shell sequence accomplishes exactly what I want:

firstval=$'\x1c' lastval=$'\xFF' regex="[^${firstval}-${lastval}]" awk -v REGEX="${regex}" '{if ($0 !~ REGEX){print $0}}' myfile 

However, on my laptop, I get fatal: "invalid regexp: Invalid collation character: /[^-�]/"

This feels like a locale issue and have verified that both my machine and the one where it works are identical:

sh-4.2$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= 

What am I missing, and what do I need to adjust to get the same behavior on my own machine?

1 Answer 1

0

I would also try setting LC_ALL. It appears to be empty in your locale output.

LC_ALL=en_US.UTF-8 

If you still can't get multi-byte characters set to work, try running AWK with POSIX and see if that works.

awk -W posix -v REGEX="${regex}" '{if ($0 !~ REGEX){print $0}}' myfile 
1
  • I did try LC_ALL and then switched it back when it didn't work. I just tried the posix switch. No joy. However, the posix switch prompted me to read the man page (when all else fails...) which alerted me to an option to read things on a byte rather than a character level -- which will work for my purposes. Though I would still like to solve the problem Commented Mar 9, 2023 at 1:30

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.