This HN post got me to brush up my awk beyond the basic {print $2,$3} to extract fields from csv-like files. Here are some of my notes from The AWK Programming Language by Alfred Aho, Peter Weinberger and Brian Kernighan.

  • sequence of pattern-action statements pattern { action }
  • 2 types of data: numbers and strings.
  • fields in current input line: $1, $2$0 is the entire line.
  • if there is no pattern, perform action on every line
  • if there is no action, print lines that match the pattern
    $ echo -e 'alice 20\nbob 30\niyer 24' | awk '$2>20'  # all people aged >20
    bob 30
    iyer 24
    
  • NF - number of fields in current line
    $ echo 'a b c d' | awk '{print NF, $1, $NF}'  # print field count, the 1st and last field
    4 a d
    
  • NR - line number
    $ echo -e 'alice 20\nbob 30' | awk '{print NR, $0}'  # prefix with line number
    1 alice 20
    2 bob 30
    
  • simple arithmetic
    $ echo 'alice 5.50 22' | awk '{print $1, $2 * $3}'
    alice 121
    
  • use printf like in C
    $ echo 'alice 5.50 22' | awk '{printf("%s $%.2f\n", $1, $2 * $3)}'
    alice $121.00
    
  • string/regex matching
    $ echo -e 'susan 20\nbob 24\nsusie 12' | awk '$1=="susie"'
    susie 12
    $ echo -e 'susan 20\nbob 24\nsusie 12' | awk '$1 ~ /^su/'  # all lines where the 1st field starts with "su"
    susan 20
    susie 12
    
  • use ||, && and ! as in C

  • BEGIN,END - special patterns that matches before first line is read and after last line has been processed.
    $ echo -e 'susan,20\nbob,24\nsusie,12' | awk 'BEGIN {print "name,age"} {print}'  # add header to csv file
    name,age
    susan,20
    bob,24
    susie,12
    $ echo -e 'susan 20\nbob 24\nsusie 12' | awk '{sum=sum+$2} END{print sum/NR}' # average age
    18.6667
    
  • if-else, while and for loops similar to C
    $ cat /tmp/tmp
    name,age
    Alice,20
    Bob,30
    $ awk -F, '{for(i=1;i<=NF;i+=1)if(NR==1)a[i]=$i;else a[i]=a[i] FS $i}END{for(i=1;i<=NF;i+=1)print a[i]}' /tmp/tmp
    name,Alice,Bob
    age,20,30
    

    The one-liner above transposes the input csv file, expanded here for readability:

    {  # since this action block does NOT have a pattern, it will execute for all lines
      for (i = 1; i <= NF; i += 1)
        if (NR == 1)  # if this is the first line
          a[i] = $i
        else
          a[i] = a[i] FS $i  # concatenate current cell to the ith "column" string separated by the Field Separator
    }
      
    END { # this action block gets executed after all lines have been processed
      for (i = 1; i <= NF; i += 1)
        print a[i]
    }
    
  • use length for finding string length
    $ echo -e 'name age\nalice 20\nbob 30' | awk '{print length($0)}'  # print length of each line
    8
    8
    6