grokking awk

This HN post got me to brush up my awk beyond the basic {print $2,$3} to extract fields from csv-like files. Here are some of my notes from The AWK Programming Language by Alfred Aho, Peter Weinberger and Brian Kernighan.

sequence of pattern-action statements pattern { action }
2 types of data: numbers and strings.
fields in current input line: $1, $2… $0 is the entire line.
if there is no pattern, perform action on every line

if there is no action, print lines that match the pattern

$ echo -e 'alice 20\nbob 30\niyer 24' | awk '$2>20'  # all people aged >20
bob 30
iyer 24

NF - number of fields in current line

$ echo 'a b c d' | awk '{print NF, $1, $NF}'  # print field count, the 1st and last field
4 a d

NR - line number

$ echo -e 'alice 20\nbob 30' | awk '{print NR, $0}'  # prefix with line number
1 alice 20
2 bob 30

simple arithmetic

$ echo 'alice 5.50 22' | awk '{print $1, $2 * $3}'
alice 121

use printf like in C

$ echo 'alice 5.50 22' | awk '{printf("%s $%.2f\n", $1, $2 * $3)}'
alice $121.00

string/regex matching

$ echo -e 'susan 20\nbob 24\nsusie 12' | awk '$1=="susie"'
susie 12
$ echo -e 'susan 20\nbob 24\nsusie 12' | awk '$1 ~ /^su/'  # all lines where the 1st field starts with "su"
susan 20
susie 12

use ||, && and ! as in C

BEGIN,END - special patterns that matches before first line is read and after last line has been processed.

$ echo -e 'susan,20\nbob,24\nsusie,12' | awk 'BEGIN {print "name,age"} {print}'  # add header to csv file
name,age
susan,20
bob,24
susie,12
$ echo -e 'susan 20\nbob 24\nsusie 12' | awk '{sum=sum+$2} END{print sum/NR}' # average age
18.6667

if-else, while and for loops similar to C

$ cat /tmp/tmp
name,age
Alice,20
Bob,30
$ awk -F, '{for(i=1;i<=NF;i+=1)if(NR==1)a[i]=$i;else a[i]=a[i] FS $i}END{for(i=1;i<=NF;i+=1)print a[i]}' /tmp/tmp
name,Alice,Bob
age,20,30

The one-liner above transposes the input csv file, expanded here for readability:

{  # since this action block does NOT have a pattern, it will execute for all lines
  for (i = 1; i <= NF; i += 1)
    if (NR == 1)  # if this is the first line
      a[i] = $i
    else
      a[i] = a[i] FS $i  # concatenate current cell to the ith "column" string separated by the Field Separator
}
  
END { # this action block gets executed after all lines have been processed
  for (i = 1; i <= NF; i += 1)
    print a[i]
}

use length for finding string length

$ echo -e 'name age\nalice 20\nbob 30' | awk '{print length($0)}'  # print length of each line
8
8
6