Let's Learn AWK

AWK is a Turing-complete language. It’s tiny, sure. But it’s alreay on basically every Linux box in the world. So let’s learn it.

The approach I’m going to take here is to introduce each major feature of the AWK language via example. So, this will not be a complete language reference, you can find all that stuff at the bottom.

Awk vs Mawk vs Gawk

There are lots of flavors of awk.

If you’re on nearly any flavor of linux, awk almost certainly comes installed. Try typing “man awk” to see what flavor of awk you have. You probably have MAWK or GAWK (GNU AWK). They have a very similar API and both are fine. I think GAWK is somewhat more consistent across versions and platforms. But who knows.

Example 1 - Field Separators

For our first example, let’s parse a little CSV file. This one shows the radiation dosage from various sources in two different units (rad.csv):

source,dose (Sv),dose (BED)
eating a banana,1.0E-7,1
arm x-ray,1.0e-6,10
dental x-ray,5.0e-6,50
one day natural background,1.0e-5,100
flight from LA to New york,4.e-5,400
mammogram,4e-4,4000
head CT scan,2.0e-3,20000
chest CT scan,7.0e-3,70000
max yearly radiation worker dose,5.0e-2,5e5
lowest yearly dose linked to cancer,0.1,1e6
usually fatal dase,4.0,4e7

The first column in this file is the source of radiation, let’s grab that using awk on the commandline:

$ awk -F, '{print $1}' rad.csv
source
eating a banana
arm x-ray
dental x-ray
one day natural background
flight from LA to New york
mammogram
head CT scan
chest CT scan
max yearly radiation worker dose
lowest yearly dose linked to cancer
usually fatal dase

The second column in the file is the radiation dosage in in the fancy, scientific unit Sieverts (Sv). Boring. The third column is the amount of radiation from each source in the far more approachable Banana Equivalent Dosages (BED). You know one banana is healthy.

But how many bananas would you need to eat to get the radiation dosage from one dental x-ray? Let’s find out!

$ awk -F "," '{print $1,"\t\t",$3}' rad.csv
source 		 dose (BED)
eating a banana 		 1
arm x-ray 		 10
dental x-ray 		 50
one day natural background 		 100
flight from LA to New york 		 400
mammogram 		 4000
head CT scan 		 20000
chest CT scan 		 70000
max yearly radiation worker dose 		 5e5
lowest yearly dose linked to cancer 		 1e6
usually fatal dase 		 4e7

Okay, a dental x-ray is 50 times as much radiation as eating a banana. That seems safe. A chest CT is 70,000 BEDs; probably don’t do that too often.

But you would need to each 40 million bananas to get a lethal dose of radiation. So don’t do that.

Example 2 - Arithmetic Expressions

Well, how does one Banana Equivalent Dose (BED) relate to one Sievert (Sv)? Let’s do a little awk math:

$ awk -F, '{print $3 / $2}' rad.csv 
-nan
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000

Oh, right, that first row in the file is a header without any numbers in it. Let’s skip that:

$ awk -F, 'NR > 1 {print $3 / $2}' rad.csv 
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000
10000000

Okay, the difference is exactly 10 million. That’s easy to remember.

Well, AWK can do some basic math. Cool. Let’s make a little script (f2c.awk):

BEGIN {
  print(f2c(32))
  print(f2c(72))
  print(f2c(100))
}

function f2c(f) {
        return (f - 32.0) / 1.8
}

Which we can run from the command line super easy:

$ awk -f f2c.awk 
0
22.2222
37.7778

So, neat. You can write AWK scripts and run them using awk -f script.awk. We will talk more about that BEGIN function later. And, yeah, you can even create your own functions and do some basic math. That’s starting to seem more like a real language than just a little commnanline utility.

But this rabbit hole goes a lot deeper.

Example 3 - Associative Arrays

“Dictionaries” have made Python all kinds of popular. In Java they’re less convenient, but just as awesome, and folks call them “HashMaps”. In AWK people call them “Associative Arrays”. Either way, it’s a data structure for making collections of key-value pairs.

For instance, let’s say we want to count the number of files owned by different users in a folder. Well, ls -l will give us a list of files with various information, like username:

$ ls /usr/bin/ -l
-rwxr-xr-x  1  root  root     96  Nov 12 2018  2to3-2.7
-rwxr-xr-x  1  root  root  10104  Apr 23 2016  411toppm
-rwxr-xr-x  1  root  root  22696  Sep 27 2018  aa-enabled
...

So, that third column is the username of the owner of the file.

Here is the little script we will use to count files owned by various users, using associative arrays (count_users.awk):

#!/bin/awk -f
{
    if (NF>7) {
        user_count[$3]++;
    }
}
END {
    for (user in user_count) {
        print user, user_count[user];
    }
}

And here we run the file, piping the raw data into it on the command line:

$ ls /usr/bin/ -l | awk -f count_users.awk 
thedoctor 2
noether 4
root 1906

Example 4 - printf

That last little program worked, but the output was ugly and hard to read. Enter printf. If you’ve worked in the C language before, this is approximately what you remember.

Let’s re-write our little user-counter using printf (count_users2.awk):

#!/bin/awk -f
BEGIN {
    # print the header
    printf("%-12s %-12s\n", "USER", "FILES OWNED");
}
{
    # parse the ls output
    if (NF>7) {
        user_count[$3]++;
    }
}
END {
    # print the results
    for (user in user_count) {
        printf("%-12s %-4d\n", user, user_count[user]);
    }
}

Hey look, comments.

And run this from the commandline:

$ ls -l /usr/bin/ | awk -f count_users4.awk 
USER      FILES OWNED 
thedoctor 2
noether   4
root      1906

So, that was a lot easier to read. Notice, we used %-12s to left-justify a string, rather than just %12s to right justify. We also used %d to printf an integer, as opposed to %s to print an integer. This is standard printf stuff from C. And we are finally starting to see what BEGIN and END can be useful for.

Example 5 - Random Numbers

We are going to write a program that randomly selects a line from rad.csv above and prints it in some human-readable way. In the process of writing this we will see a few new features. Throw this code intofun_fact.awk:

#!/bin/awk -f
# older versions off AWK don't have rand() or srand()
BEGIN {
    # seed our random number generator
    srand();
    # pick a random line in the file
    line=int(rand()*10);
    # let's skip the header and first row
    line+=3;
    # commas separate the columns in our file
    FS=",";
}
{
    # find a random line in the file and print our fun fact
    if (NR==line) {
        printf("The radiation from a %s is equal to eating %d bananas.\n", $1, $3);
    }
}

Also, we generated a random number between 0.0 and 1.0 using rand(). And we tried to give our code a better random seed using srand().

Side Note: Two of the language features we used above are super handy when writing those amazing AWK one-liners on the commandline: FS and NR. By default, AWK splits lines into columns based on whitespace, but we can split on any “Field Separator” we like using FS. (On the command line we do this by writing awk -F "," ....) Similarly, we can grab just one line from our input file by specifying the line number or “number of row” with NR.

Enough talk, let’s just run the damn script:

$ awk -f fun_fact.awk rad.csv 
The radiation from a head CT scan is equal to eating 20000 bananas.

Huh. So, if you need a head CT get one. But try not to turn it into a hobby.

Example 6 - Numerical Functions

Yes, AWK has a bunch of built-in math libraries. I’m not going to list them all. But here’s a flavor (math.awk):

#!/bin/awk -f
BEGIN {
# define Pi
PI=3.1415926;

# print header
printf("Degrees  Radians    sin    cos\n");

# use some trig
for (i=0; i<=360; i+=45) {
    r = i * (PI / 180);
    printf("%7d %8.5f %8.3f %6.3f\n", i, r, sin(r), cos(r))
}
exit;
}

And when you run it from the commandline, you get what you’d expect:

$ awk -f math.awk 
Degrees  Radians    sin    cos
      0  0.00000    0.000  1.000
     45  0.78540    0.707  0.707
     90  1.57080    1.000  0.000
    135  2.35619    0.707 -0.707
    180  3.14159    0.000 -1.000
    225  3.92699   -0.707 -0.707
    270  4.71239   -1.000 -0.000
    315  5.49779   -0.707  0.707
    360  6.28319   -0.000  1.000

Moving on.

Example 7 - String Functions

Let’s generate some fixed-width data and throw it into a text file (ls -lah . > ls.txt):

-rw-r--r-- 1 center center  0 Dec  8 21:39 p1
-rw-r--r-- 1 center center 17 Dec  8 21:15 t1
-rw-r--r-- 1 center center 26 Dec  8 21:38 t2
-rw-r--r-- 1 center center 25 Dec  8 21:38 t3
-rw-r--r-- 1 center center 43 Dec  8 21:39 t4
-rw-r--r-- 1 center center 48 Dec  8 21:39 t5

AWK has a ton of super-handy string-manipulation tools built in. Here’s a quick example showing index, substr and lenth (string.awk):

#!/bin/awk -f
{
    if ((x=index($9, "t")) > 0) {
        month = substr($6, 1, 3);
        code = substr($9, x+1, length($9));
        printf("month = %s, code = %s\n", month, code);
    }
}

Which, when run from the command line (cat ls.txt | string.awk):

month = Dec, code = 1
month = Dec, code = 2
month = Dec, code = 3
month = Dec, code = 4
month = Dec, code = 5

Of course, there are another couple dozen common string-manipulation tools available. But they all work exactly as you’d want them to.

Example 8 - User-Defined Functions

The last major tool you will need in AWK is the ability to write a funcion. The following little program will be put into rock_paper_scissors.awk, and will allow us to hone our Rock-Paper-Scissors game in the off season:

#!/bin/awk -f
BEGIN {
    srand()

    opts[1] = "rock"
    opts[2] = "paper"
    opts[3] = "scissors"

    do {
        print ""
        print "1 - rock"
        print "2 - paper"
        print "3 - scissors"
        print "9 - end game"

        ret = getline < "-"

        if (ret == 0 || ret == -1) {
            exit
        }

        val = $0

        if (val == 9) {
            exit
        } else if (val != 1 && val != 2 && val != 3) {
            print "Invalid option"
            continue
        } else {
            play_game(val)
        }

    } while (1)
}

function play_game(val) {
    r = int(rand()*3) + 1
    print "I have " opts[r] " you have "  opts[val]

    if (val == r) {
        print "Tie, next throw."
        return
    }

    if (val == 1 && r == 2) {
        print "Paper covers rock, you loose."
    } else if (val == 2 && r == 1) {
        print "Paper covers rock, you win."
    } else if (val == 2 && r == 3) {
        print "Scissors cut paper, you loose."
    } else if (val == 3 && r == 2) {
        print "Scissors cut paper, you win."
    } else if (val == 3 && r == 1) {
        print "Rock blunts scissors, you loose."
    } else if (val == 1 && r == 3) {
        print "Rock blunts scissors, you win."
    }
}

And now we can play the first game we wrote entirely in AWK:

$ awk -f rock_paper_scissors.awk 

1 - rock
2 - paper
3 - scissors
9 - end game
2
I have scissors you have paper
Scissors cut paper, you loose

1 - rock
2 - paper
3 - scissors
9 - end game
3
I have rock you have scissors
Rock blunts scissors, you loose

1 - rock
2 - paper
3 - scissors
9 - end game
9

Well, I lost. I guess I need to practice more.

Other Resources

One of the best things about AWK is how small it is. You can learn most of the standard libraries in a weekend of light puttering.

But if you spend a little more time in AWK and need actual resources, check these out:

Also, just to prove a point, here is a simple 3D version of DOOM written entirely in AWK:

Published: September 16 2019