How to spot a DMA memory shortage.

This is a writeup on how to determine if a system has deadlocked because all DMA-able memory has been used up. A shortage of DMA-able memory may only result in a system slowdown, because the kernel will try to free up as much memory as possible when trying to allocate more. But sometimes it will get into a situation where there isn't a single drop of DMA-able memory left to be found.

First of all let me explain what DMA-able memory means. DMA-able memory is that which can be used directly by the hardware for Direct Memory Access operations. An example is a disk controller which takes a pointer to a memory buffer and then copies disk sectors directly into that buffer without the assistance of the CPU. Devices which are fully EISA capable (32-bit) can DMA anywhere in memory. Those which are ISA (16-bit) can only DMA within the first 16MB of memory because the ISA bus does not have enough address bits to reach more than that. I don't think any 8-bit boards do DMA. If they did, they would only be able to reach 640K worth of memory space.

Our kernel has a tunable parameter called MAXDMAPAGE which defines an upper memory limit on DMA space. This value can be set to represent the highest numbered memory page to be considered DMA-able. If the value given is zero, or if it's found to be greater than or equal to the total memory size at boot time, no separate DMA-able memory pool will be created. In that case, all memory will be generic and also DMA-able. If MAXDMAPAGE is a non-zero value less than the total memory size, then system memory is split. Memory will then be allocated out of either the DMA-able or generic pools, depending on a flag passed to page_get(), which is the low level kernel memory allocator. This can lead to a situation where there is lots of free generic memory in the system, but the DMA-able memory is all used up. Unix was not developed with the shortcomings of the ISA/Intel architecture in mind, so many parts of the kernel assume all memory is the same. For this reason, memory is often allocated from the DMA pool by default, which causes it to disappear rather quickly.

The good news is that since 1.1.1 it is considered safe to always set the value of MAXDMAPAGE to 0. All of our supported drivers will take steps to assure that memory they need for DMA to/from ISA boards really is DMA-able. NOTE WELL: Third party device drivers for 16-bit ISA boards are NOT guaranteed to work properly in this case. The MAXDMAPAGE tunable is basically a kludge to "dumb down" the kernel to accomodate ISA boards on an EISA system. As of 1.1.1, all this stupidity has been moved to the appropriate drivers. If it should ever be necessary to limit DMA space in the kernel on behalf a third party board/driver, the only sensible value of MAXDMAPAGE is 4096. This is 16MB, which is the address range of the ISA bus. Anything less is self-defeating, anything more will result in an unstable kernel. By the way, on the 6000/60 as of 1.1.1QT2, MAXDMAPAGE is ignored and always set to 4096. This is because the 60 is an ISA-only system, no other value makes sense. Setting other "nonsense" values of MAXDMAPAGE will not cause any errors or warnings from idbuild. Like I said, it's a kludge. Note also that the discussion of MAXDMAPAGE in the 1.2 Release Notes is wrong. Trust me, this is how it *really* works.

When looking at a kernel with kcrash, the variable "dma_check_on" will indicate whether memory has been split into two areas or not. If dma_check_on == 0, there is no DMA-able pool and none of the other variables having to do with DMA-able memory are meaningful. If dma_check_on is non-zero, then there are two distinct memory pools.

When dma_check_on is non-zero, the variable "dma_freemem" contains the number of currently free DMA-able memory pages in the system (a page is 4096 bytes). The variable "freemem" will contain the number of generic memory pages that are free. When dma_check_on is zero, dma_freemem will always be zero and freemem will contain the number of free pages, all of which are considered to also be DMA-able.

A field in the "tune" structure in the kernel will tell you how much memory was placed in the DMA pool at boot time. This is the value of MAXDMAPAGE. If you have the macro named "tune", it will display all the fields of the tune variable. The field which corresponds to MAXDMAPAGE is called "t_dmalimit". If you don't have this macro, the value is located at "tune+2c". The default value is 0x1000, which is 4096 in hex. The represents 4096 x 4096 = 16MB.

Now, lets take a look at an example from a real crash dump. This was a 3 processor 65, they were complaining that their Progress database "crashed". The backtrace showed it was a pushbutton panic, as expected. Examining the process list showed everything was sleeping and all cpus were idle. Here are the first few lines of the ps list:

ADDRESS  PID   PPID  UID   FLAGS    K U R WCHAN    ST  COMMAND
D1973400 09744 02063 00000 00002018 - - - D120C998 SLP /usr/sbin/inetd
D1C21200 09738 00001 00000 00102018 - - - D01914FC SLP find . -name *.p -print
D1C08C00 09735 00001 00000 02002018 - - - D01914FC SLP in.tftpd
D1289400 09702 09025 00000 00502018 - - - D01914FC SLP /var/dlc/_progres -p pros
tart.p -pf /var/dlcft/mproft.pf -! /var/dlcft/ftmsgs -

We are interested in what's going on with Progress, so let's look at its backtrace:

S> btproc D1289400
E0004A50: 00000000(D01914FC,2)  <- a call to sleep()
E0004A88: page_get+373(5000,D)
E0004AA8: segkmem_alloc+5A(D0270CE0,D1FC9000,5000,0)
E0004AD0: sptalloc+19F(5,1,0,0)
E0004B10: kmem_allocbpool+CA(0)
E0004B40: kmem_alloc+4F7(1000,0)
E0004B54: kmem_zalloc+13(1000,0)
E0004B80: getfreeblk+3B7(1000,0)
E0004BA0: ngeteblk+14(1000)
E0004BD0: indirtrunc+64(E0004C04,6E754,FFFFFFFF,0)
E0004D64: ufs_itrunc+41F(D1215218,0)
E0004D84: ufs_iinactive+19B(D1215218)
E0004D90: ufs_inactive+F(D1215220,D153BC80)
E0004DA8: vn_inactive+1E(D1215220)
E0004DC8: closef+115(D15CAB80)
E0004DDC: close+38(E00050B0,E0004E0C)
E0004E28: systrap+2A6(E0004E34)
E0004E34: sys_call+38()

We see from this that Progress was closing a file and the kernel was trying to allocate space to pull in an i-node block to update the information about it. It went down through kmem_alloc and arrived at the low-level memory page allocator page_get(). The arguments to page_get are the size (in bytes) and some flags. Hex 1000 is 4k (1 page), so hex 5000 is 5 pages. One of the bits in the flags indicates it wanted DMA-able memory. This flag value is in a private header file, but even if we don't know if it is asking for DMA-able memory we can make a reasonable guess that it is by looking further. The header where these flags are defined is vm/page.h. If you don't have access to kernel sources, these are the possible flags for page_get():

        #define P_CANWAIT       0x0001
	#define P_PHYSCONTIG    0x0002
	#define P_DMA           0x0004
	#define P_NORESOURCELIM 0x0008

At this point it looks like the process was put to sleep because the memory request could not be satisfied. Let's look at a few variables to see what we can find:

S> dl dma_check_on
dma_check_on:  00000001 00000000 00000000 00000000  ................  .

We see from this that memory *is* divided on this system and that dma_freemem should be valid. Let's take a look:

S> dl dma_freemem
dma_freemem:  00000001 00000001 D019102C D17D7480  ........,....t}.  .

Aha! This shows there is only one free page of DMA-able memory left. The call to page_get was asking for 5 pages. Let's check how much generic memory is left:

S> dl freemem
freemem:  00000515 00000000 00000022 00000000  ........".......  .

That's 0x515 = 1301 pages, 5,328,896 bytes. Plenty.

We can make a reasonable assumption here that the kernel was trying to allocate 5 pages of DMA-able memory and couldn't get it. Especially since it was apparently setting up to do disk I/O to read in an i-node block.

Now, let's take another look at the ps line for Progress:

D1289400 09702 09025 00000 00502018 - - - D01914FC SLP /var/dlc/_progres -p pros
tart.p -pf /var/dlcft/mproft.pf -! /var/dlcft/ftmsgs -

We see that the address it is sleeping on is D01914FC:

S> dl D01914FC
freemem:  00000515 00000000 00000022 00000000  ........".......

We see that's the address of the freemem variable, which is a reasonable address for the memory allocator to sleep on. By sending the output of the ps command to a file, then grepping for D01914FC, we see:

D1C21200 09738 00001 00000 00102018 - - - D01914FC SLP find . -name *.p -print
D1C08C00 09735 00001 00000 02002018 - - - D01914FC SLP in.tftpd
D1289400 09702 09025 00000 00502018 - - - D01914FC SLP /var/dlc/_progres -p pros
tart.p -pf /var/dlcft/mproft.pf -! /var/dlcft/ftmsgs -
D1A5D200 09009 00001 00000 00102018 - - - D01914FC SLP -ksh

There are a total of four processes all waiting for memory. Since we saw that there is plenty of generic memory available, they must all be waiting for DMA-able. If there was any DMA-able memory to be freed up anywhere, the kernel would have found it by now. We must have a deadlock condition where all DMA-able memory is in use and there is no hope of getting any more.

Now, let's see how big the DMA pool is:

S> dl tune+2c
tune+2C:  00001000 0000012C 00000010 00000019  ....,...........  .

Hex 1000 = 4096, 4096 x 4096 = 16MB. How much total memory? The appropriate line displayed by the stat macro shows:

Memsz           64552960

So, only 1/4 of total memory is eligible for use by DMA. We know this is a 6000/65 running 1.1.1, so there is no reason for restricting the DMA space on this system. We recommend to the customer that they change the value of MAXDMAPAGE to 0. This will allow them take full advantage of all the memory in their system.