Programming with awk


awk provides one-dimensional arrays. Arrays and array elements need not be declared; like variables, they spring into existence by being mentioned. An array subscript may be a number or a string.

As an example of a conventional numeric subscript, the statement

   x[NR] = $0
assigns the current input line to the NRth element of the array x . In fact, it is
possible in principle (though perhaps slow) to read the entire input into an array with the awk program
        { x[NR] = $0 }
   END  { . . . processing . . . }
The first action merely records each input line in the array x, indexed by line number; processing is done in the END statement.

Array elements may also be named by nonnumeric values. For example, the following program accumulates the total population of Asia and Africa into the associative array pop. The END action prints the total population of these two continents.

   /Asia/	{ pop["Asia"] += $3 }
   /Africa/	{ pop["Africa"] += $3 }
   END	{ print "Asian population in millions is", pop["Asia"]
   	  print "African population in millions is", pop["Africa"]
On the file countries, this program generates

Asian population in millions is 1765
African population in millions is 37

In this program if you had used pop[Asia] instead of pop["Asia"] the expression would have used the value of the variable Asia as the subscript, and since the variable is uninitialized, the values would have been accumulated in pop[""] .

Suppose your task is to determine the total area in each continent of the file countries. Any expression can be used as a subscript in an array reference. Thus

   area[$4] += $2
uses the string in the fourth field of the current input record to index the array area and, in that entry, accumulates the value of the second field:
   BEGIN { FS = "\t" }
         { area[$4] += $2 }
   END   { for (name in area)
                     print name, area[name] }
Invoked on the file countries, this program produces
   Africa 1888
   South America 4358
   North America 7467
   Australia 2968
   Asia 13611

This program uses a form of the for statement that iterates over all defined subscripts of an array:

   for (i in array) statement
executes statement with the variable i set in turn to each value of i for which array[i] has been defined. The loop is executed once for each defined subscript, which is chosen in a random order. Results are unpredictable when i or array is altered during the loop.

awk does not provide multi-dimensional arrays, but it does permit a list of subscripts. They are combined into a single subscript with the values separated by an unlikely string (stored in the variable SUBSEP). For example,

   for (i = 1; i <= 10; i++)
   	for (j = 1; j <= 10; j++)
   		     arr[i,j] = ...
creates an array which behaves like a two-dimensional array; the subscript is the concatenation of i, SUBSEP, and j.

You can determine whether a particular subscript i occurs in an array arr by testing the condition i in arr, as in

   if ("Africa" in area) ...
This condition performs the test without the side effect of creating area["Africa"], which would happen if you used
   if (area["Africa"] != "") ...
Note that neither is a test of whether the array area contains an element with the value "Africa" .

It is also possible to split any string into fields in the elements of an array using the built-in function split. The function

   split("s1:s2:s3", a, ":")
splits the string s1:s2:s3 into three fields, using the separator :, and stores s1 in a[1], s2 in a[2], and s3 in a[3]. The number of fields found, here three, is returned as the value of split. The third argument of split is a extended regular expression to be used as the field separator. If the third argument is missing, FS is used as the field separator.

An array element may be deleted with the delete statement:

   delete arrayname[subscript]

Next topic: User-defined functions
Previous topic: Control flow statements

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004