Programming with awk

Generating reports

awk is especially useful for producing reports that summarize and format information. Suppose you want to produce a report from the file countries in which the continents are listed alphabetically, and the countries on each continent are listed after in decreasing order of population:

   	Sudan          19
   	Algeria        18

Asia: China 866 India 637 USSR 262

Australia: Australia 14

North America: USA 219 Canada 24

South America: Brazil 116 Argentina 26

As with many data processing tasks, it is much easier to produce this report in several stages. First, create a list of continent-country-population triples, in which each field is separated by a colon. This can be done with the following program triples, which uses an array pop indexed by subscripts of the form continent:country to store the population of a given country. The print statement in the END section of the program creates the list of continent-country-population triples that are piped to the sort routine.

   BEGIN { FS = "\t" }
         { pop[$4 ":" $1] += $3 }
   END   { for (cc in pop)
           print cc ":" pop[cc] | "sort -t: +0 -1 +2nr" }
The arguments for sort deserve special mention. The -t: argument tells sort to use : as its field separator. The +0 -1 arguments make the first field the primary sort key. In general, +i -j makes fields i+1, i+2, . . ., j the sort key. If -j is omitted, the fields from i+1 to the end of the record are used. The +2nr argument makes the third field, numerically decreasing, the secondary sort key (n is for numeric, r for reverse order). Invoked on the file countries, this program produces as output
   North America:USA:219
   North America:Canada:24
   South America:Brazil:116
   South America:Argentina:26
This output is in the right order but the wrong format. To transform the output into the desired form, run it through a second awk program format:
   BEGIN  { FS = ":" }
   {      if ($1 != prev) {
               print "\n" $1 ":"
               prev = $1
          printf "\t%-10s %6d\n", $2, $3
This is a control-break program that prints only the first occurrence of a continent name and formats the country-population lines associated with that continent in the desired manner. The command line
   $ awk -f triples countries | awk -f format<<Return>>
gives the desired report. As this example suggests, complex data transformation and formatting tasks can often be reduced to a few simple awk commands and sorts.
Next topic: Word frequencies
Previous topic: Example applications

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004