I have an awk script that must process millions of records, but I need to remove any containing a multibyte character.
In one environment where I work, the following simplified shell sequence accomplishes exactly what I want:
firstval=$'\x1c' lastval=$'\xFF' regex="[^${firstval}-${lastval}]" awk -v REGEX="${regex}" '{if ($0 !~ REGEX){print $0}}' myfile However, on my laptop, I get fatal: "invalid regexp: Invalid collation character: /[^-�]/"
This feels like a locale issue and have verified that both my machine and the one where it works are identical:
sh-4.2$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= What am I missing, and what do I need to adjust to get the same behavior on my own machine?