AWK: "invalid regexp: Invalid collation character" -- how do I make it valid?

Question

I have an awk script that must process millions of records, but I need to remove any containing a multibyte character.

In one environment where I work, the following simplified shell sequence accomplishes exactly what I want:

firstval=$'\x1c' lastval=$'\xFF' regex="[^${firstval}-${lastval}]" awk -v REGEX="${regex}" '{if ($0 !~ REGEX){print $0}}' myfile

However, on my laptop, I get fatal: "invalid regexp: Invalid collation character: /[^-�]/"

This feels like a locale issue and have verified that both my machine and the one where it works are identical:

sh-4.2$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

What am I missing, and what do I need to adjust to get the same behavior on my own machine?

FrankBlank · Accepted Answer · 2023-03-09 00:15:02Z

0

I would also try setting LC_ALL. It appears to be empty in your locale output.

LC_ALL=en_US.UTF-8

If you still can't get multi-byte characters set to work, try running AWK with POSIX and see if that works.

awk -W posix -v REGEX="${regex}" '{if ($0 !~ REGEX){print $0}}' myfile

answered Mar 9, 2023 at 0:15

FrankBlank

343 bronze badges

I did try LC_ALL and then switched it back when it didn't work. I just tried the posix switch. No joy. However, the posix switch prompted me to read the man page (when all else fails...) which alerted me to an option to read things on a byte rather than a character level -- which will work for my purposes. Though I would still like to solve the problem

Kyle Banerjee
– Kyle Banerjee

2023-03-09 01:30:28 +00:00
Commented Mar 9, 2023 at 1:30

Add a comment |

Stack Exchange Network

AWK: "invalid regexp: Invalid collation character" -- how do I make it valid?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

AWK: "invalid regexp: Invalid collation character" -- how do I make it valid?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions