Posted on May 21

PWC 322 String Format

Task 1 String Format

You are given a string and a positive integer. Write a script to format the string, removing any dashes, in groups of size given by the integer. The first group can be smaller than the integer but should have at least one character. Groups should be separated by dashes.

Example 1
- Input: $str = "ABC-D-E-F", $i = 3
- Output: "ABC-DEF"
Example 2
- Input: $str = "A-BC-D-E", $i = 2
- Output: "A-BC-DE"
Example 3
- Input:$str = "-A-B-CD-E", $i = 4
- Output: "A-BCDE"

Thought process

The obvious first step is going to be to remove any dashes that might already be present.

Then, we'll want to form groups of $i, starting from the right end of the string -- or reverse the string to start from the left.

$str = reverse $str =~ s/-+//gr;

The r modifier on the substitution is going to return the result of the operation, which can then be fed straight into the reverse function. Without it, the s/// operation would return the number of matches, which is not useful this week.

There are at least four ways to operate on groups of characters: (1) regular expression match; (2) substring operators; (3) some kind of split operation; (4) counting while processing characters. Like an aspiring Pokemon trainer, let's try to catch 'em all.

Solution the first: regular expressions

A simple global replacement should work: every group of i characters can be replaced by itself plus a dash, and then we can reverse the string again to get back its original order.

$str = reverse $str =~ s/.{$i}/$&-/gr;

.{$i} is regex for i occurrences of any character
$& is regex for "whatever matched"

There's an untidy detail: if the length of the string is a multiple of i, this will leave us with an extra dash at the end of the string. I'm going to speculate that there's a regular expression involving a look-ahead assertion to test for the end-of-string, but I'm going to do the easy thing and simply trim a leading dash if that's what we ended up with.

sub strFmtRE($str, $i) { $str =~ s/-+//gr; $str = reverse $str =~ s/.{$i}/$&-/gr; return $str =~ s/^-//r; }

Solution the second: substrings

The most direct implementation would be to take the leading part of the string so that the remainder is a multiple of $i, and then keep taking chunks of i using substr.

sub strFmtSubstr($str, $i) { $str =~ s/-//g; my $out = substr($str, 0, length($str)%$i, ''); while ( $str ne '' ) { $out .= '-' if ( $out ne '' ); $out .= substr($str, 0, $i, ''); } return $out; }

As usual, begin by removing the dashes.
Move the leading part of the string into $out, and delete it from the input. One call to substr will do both. With four arguments, substr will replace the segment with the fourth argument, and return the part that was replaced.
Now we know that the string length is a multiple of i. We can keep taking chunks of length i out of $str and append them onto $out. Every time we do, we'll want to insert a dash.
Once again we have that untidy detail that if $str starts out a multiple of $i long, we don't want to insert a dash into the beginning of $out.

Solution the third: unpack

As it happens, over the weekend I was looking at the draft second edition of Dave Cross's book "Data Munging in Perl". That left the unpack function hovering around in my brain, which is a particularly easy solution to this problem, except for (a) the part where you have to be aware that the unpack function exists, and (b) the part where you have to realize that 90% of the rather daunting pack/unpack tutorial is irrelevant for this.

unpack uses its own pattern language to describe templates for data. When a string matches the template, unpack hands you an array of matching parts. A string of 4 characters is represented by the pattern A4. A repeating set of 4-character substrings is represented as (A4)*.

sub strFmtUnpack($str, $i). { return scalar reverse join("-", unpack("(A$i)*", reverse $str =~ s/-//gr)); }

Let's read this backwards.

At the end, we have the same reversal of the string with its dashes deleted. This will become the second argument of unpack.
The template argument for unpack is "A($i)*", to select groups of $i characters.
Unpack will return an array of i-character strings. It turns out that the * in the template of unpack will give us the trailing bit of $str that is less than $i long as a bonus array element. How lucky for us.
join puts the dashes in where we want them
except that the string is backwards now, so we need to reverse it again
and we need to force scalar context because otherwise reverse likes to operate on array elements, and we need it to work on the strings in its arguments.

Solution the fourth: shift in to out

Another way to process is to move one character at a time from the input string to the output string, and every ith character, insert a dash. Again, we will process from right to left.

sub strFmtShift($str, $i) { my @in = split(//, $str =~ s/-//gr); my $out; my $d = 0; while ( @in ) { $out .= pop @in; $out .= '-' if ( ++$d % $i == 0 && @in ); } return scalar reverse $out; }

We begin by removing dashes, and converting the input string to an array of characters.
The last character on the right is appended to the output, and removed from the input.
Every ith iteration, we insert a dash.
At the end, we have the string backward.

Battle of the solutions

Time to benchmark. Last week, we saw that a regular expression solution was a clear winner. I'll make up a long-ish string of, say, 1000 characters, and ask to format it in some random size chunks. My environment is a five-year-old MacBook Pro M1 laptop, running Perl 5.40 on MacOS Sequoia.

sub runBenchmark($repeat) { use Benchmark qw/cmpthese/; my $str = "xyzzy" x 200; my $i = 23; cmpthese($repeat, { regex => sub { strFmtRE($str, $i) }, substr => sub { strFmtSubstr($str, $i) }, unpack => sub { strFmtUnpack($str, $i) }, shift => sub { strFmtShift($str, $i) }, }); }

Drum roll please:

 Rate shift regex substr unpack shift 6883/s -- -90% -95% -96% regex 72165/s 948% -- -51% -58% substr 145833/s 2019% 102% -- -15% unpack 170732/s 2380% 137% 17% --

Not too surprisingly, unpack, which was designed for this kind of thing, kicks butt. Regular expressions disappointed me this week. I was really surprised, however, by how bad the character-shifting solution is -- I would have guessed it would be closer to the substring solution.