DEV Community

Igor Irianto
Igor Irianto

Posted on • Edited on

Introduction to Awk

Awk is a processing language for data reading and manipulation. If you need to quickly process a text pattern inside a file, especially if your file contains rows and columns, awk might be the tool for the job.

Let's see some examples.

This command line kills the process running on localhost:3000 (don't worry trying to understand the code below. I will go over it later):

lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9 
Enter fullscreen mode Exit fullscreen mode

Let's do something simpler:

awk '{print}' server.rb 
Enter fullscreen mode Exit fullscreen mode

Displays file content, similar to cat server.rb. Awk also makes it easy to add filter. If you want display only lines that contains the word "run", you can do:

awk '/run/ {print}' server.rb 
Enter fullscreen mode Exit fullscreen mode

Very powerful. I am not even scratching the surface of the awk iceberg.

Basic Syntax

Awk's basic syntax is:

awk 'pattern {action}' file 
Enter fullscreen mode Exit fullscreen mode

One important action is print. Let's do some examples with print; I will go over pattern later.

For this, let's create a file called awk.ward (pun intended):

echo 'Awk. Or do not awk. There is no try' > awk.ward 
Enter fullscreen mode Exit fullscreen mode

To get the content of file, we can do:

awk '{print $0}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

Let's try another print variation, this time we will hard code it:

awk '{print "Hello awk!"}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

This prints "Hello awk!" regardless of what the file content is.

Fields

Earlier we saw:

awk '{print $0}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

You may wonder, what $0 is. In awk,$0 represents the whole record match. Usually it is the entire line. You can do the same with a simple print statement ({print $0} is the same as {print}).

In addition, awk also captures different "fields" in a line. By default, it is delimited by space and tabs. Let's check out the fields:

awk '{print $0}' awk.ward awk '{print $1}' awk.ward awk '{print $2}' awk.ward awk '{print $9}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

My awk.ward file contains 1 line and 9 fields (each separated by space). If you ask awk to print fields higher than what awk captures (like field 10), it returns empty:

awk '{print $10}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

You can change the delimiter with -F. In this case, we want to capture each field separated by ., not space. To tell awk to separate it with ., we use -F.:

awk -F. '{print $1}' awk.ward awk -F. '{print $2}' awk.ward awk -F. '{print $3}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

You can also print multiple fields at once:

awk -F. '{print $2, $3, $1}' awk.ward ## Or do not awk There is no try Awk 
Enter fullscreen mode Exit fullscreen mode

Pattern matching

Recall our basic awk syntax:
awk 'pattern {action}' file

Let's talk about pattern now. It accepts Basic regex rules. For example, to match any letters a-z:

awk '/[A-Za-z]+/ {print "I have string"}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

To match integer (it won't display anything because there is no integer inside awk.ward):

awk '/[0-9]+/ {print "I have integer"}' awk.ward 
Enter fullscreen mode Exit fullscreen mode

If we create a new file, testFile.txt and inside we have:

1. This is first line 2. This is second line 3. This is third line This is not part of the list 
Enter fullscreen mode Exit fullscreen mode

If we run awk '/[0-9]+/ {print}' testFile.txt, we get:

1. This is first line 2. This is second line 3. This is third line 
Enter fullscreen mode Exit fullscreen mode

Our command works as expected. It omits "This is not part of the list" because the last line does not contain any integer (/[0-9]+/).

Executing awk script from file

When our script grows too big, we can call awk command from script file.

Awk accepts -f to execute awk scripts. Let's create a script file and we will call it awk.script (you can name this anything):

## awk.script /[0-9]+/ { print "I have integer" } /[A-Za-z]+/ { print "I have string" } 
Enter fullscreen mode Exit fullscreen mode

Then run it against our awk.ward file: awk -f awk.script awk.ward

You'll see "I have string". It is expected, because our test file does not contain integer.

What do you think will print if we run it against our testFile.txt?

awk -f awk.script testFile.txt I have integer I have string I have integer I have string I have integer I have string I have string 
Enter fullscreen mode Exit fullscreen mode

It returns what we expects. The first 3 lines contain both string and integer, so awk prints two lines for each match. The last one does not contain integer, so awk only prints string match output.

Chaining awk

In real life, I don't really use awk by itself that often. More often, I combine it with other commands.

Let's use the script earlier and break it down. Btw, if you are coding along, I have a server running on localhost:3000. Fire up a local server to see that awk actually kills it.

lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9 
Enter fullscreen mode Exit fullscreen mode

Let's walk through each step. lsof -i:3000 gives:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME node 48523 iggy 27u IPv4 0xe25443d27b90583f 0t0 TCP localhost:hbci (LISTEN) 
Enter fullscreen mode Exit fullscreen mode

lsof -i:3000 | awk '/LISTEN/ {print}' displays only the row with "LISTEN":

node 48523 iggy 27u IPv4 0xe25443d27b90583f 0t0 TCP localhost:hbci (LISTEN) 
Enter fullscreen mode Exit fullscreen mode

Now we need to target the 2nd "field", because that's where our PID is. Modify our script to look for "LISTEN" pattern (lsof -i:3000 | awk '/LISTEN/ {print $2}'). This returns our PID:

48523 
Enter fullscreen mode Exit fullscreen mode

When we add xargs kill -9, it will pass the PID to kill -9, to terminate that PID. In this case, we need to use xargs to pipe the number so it becomes executable with kill -9. For more explanation, this SO post explains it well.

Begin, middle, end

An awk script consist of 3 parts: beginning, middle, and end. The beginning is performed once before processing any input. The middle is our main loop - everything that we've done up to this point are done in main loop. Most things in awk is done in this middle/ main loop. The end is processed once once main loop is finished.

  • BEGIN { # beginning script }
  • { # main input loop script }
  • END { # end script }

Suppose we have a file hello.txt with content:

Hello1 Hello2 Hello3 Hello4 Hello5 Hello6 Hello7 Hello8 Hello9 Hello10 
Enter fullscreen mode Exit fullscreen mode

And we run this:

awk 'BEGIN {print "BEGIN"} {print} END {print "END"}' hello.txt 
Enter fullscreen mode Exit fullscreen mode

We should expect 12 lines: 1 from BEGIN, 10 from main loop, and 1 from END. Our actual stdout:

BEGIN Hello0 Hello1 Hello2 Hello3 Hello4 Hello5 Hello6 Hello7 Hello8 Hello9 Hello10 END 
Enter fullscreen mode Exit fullscreen mode

Exactly what is expected.

Field Separator

Recall that we can redefine delimiter/ field separator with -F. In awk, we can redefine field separator inside our script with built-in variable FS (Field Separator). The convention is to define it inside BEGIN - right before the file is read and processed.

For example, inside greetings.txt we have a text:

Hello, how are you, sire? 
Enter fullscreen mode Exit fullscreen mode

When we inspect the fields, they are separated by space.

awk '{print $1}' greetings.txt # Hello, awk '{print $2}' greetings.txt # how awk '{print $3}' greetings.txt # are ## ... and so on 
Enter fullscreen mode Exit fullscreen mode

We want to separate them by comma. Here is how you can redefine separator:

awk 'BEGIN {FS = "," } {print $2}' myFile.txt # how are you 
Enter fullscreen mode Exit fullscreen mode

Record Separator

By now, you can tell that awk performs operations line-wise. In awk, each line is a record. Each record contains multiple "fields", separated by tabs/ spaces (that we can change with -F or FS). What if we need to read chunks of multiple lines?

What if our data looks like users.txt below?

Iggy Programmer 123-123-1234 Yoda Jedi Master 111-222-3333 
Enter fullscreen mode Exit fullscreen mode

We need to make the lines ranging from "Iggy" to "123-123-1234" one record, lines from "Yoda" to "111-222-3333" another record. How to tell awk to chunk our data for this structure?

Luckily, awk has a "Record Separator" (RS) to do this. By default, you can guess, the default record separator is newline (\n). Let's change that:

awk 'BEGIN {FS="\n"; RS=""} {print "Name:", $1; print "Rank:", $2; print "\n"}' users.txt 
Enter fullscreen mode Exit fullscreen mode

This returns:

Name: Iggy Rank: Programmer Name: Yoda Rank: Jedi Master 
Enter fullscreen mode Exit fullscreen mode

Which is exactly what we expected. Now all $1 contain names, $2 ranks/titles, and $3 phone numbers.

How did it work?

  • We set our Field Separator (FS) from space/tabs default into newlines (\n). Now newline marks a different field, instead of new record.
  • We set our record separator into "" from newline default.

You may ask, how does making record separator "" make chunking above work? That doesn't make sense. Shouldn't we use RS = "\n\n+" for when we have two or more newlines?

Awk, when it sees RS equals to empty string ("") it interprets it as having records separated by one or more blank lines. Apparently it is quite common to have a record separated by blank lines that awk accepts RS="".

In other word, each record now is separated by a blank line. The next record starts after blank line.

This is a record This is another record separated by blank line This is yet another record 
Enter fullscreen mode Exit fullscreen mode

For more information about this weird behavior, check out this link.

Conclusion

I think this is a good place to end. There are still much more features I didn't get to cover here: variables, conditionals, functions, etc. I will leave that for you.

Can you do what awk does with scripting language like Python or Ruby?
Definitely. But, if you need something on-the-fly, awk might be a better choice. Plus it is included in most Unix-like operating system, so you don't need to install anything.

Do you need to know awk to be a good developer?
Definitely not. I know many great developers who don't know awk. But knowing a little awk can be very helpful - it looks really cool.

Thanks for reading. Happy coding!

Resources

Top comments (1)

Collapse
 
epsi profile image
E.R. Nurwijayadi

Good article. Thank you for posting.

Awk can also be utilized to solve data structure challenge such as flatten or unique array.

To help more beginner, I have made a working example of awk with source code in github.

🕷 epsi.bitbucket.io//lambda/2021/02/...

First the data structure in a comma separated text fashioned.

Awk: Data Structure

Then the awk script to flatten array.

Awk: Flatten

And finally get the unique array:

Awk: Unique

I hope this could help other who seeks for other case example.

🙏🏽

Thank you for posting with general introduction.