Kernel Korner

Memory Allocation

Memory allocation of some sort is required in practically any program, but in the Linux kernel it is more complex than in user-level code--for good reason.

by Michael K. Johnson

Memory allocation in the Linux kernel is complex, because there are significant constraints involved--and different ways of allocating memory have different constraints. This means that anyone writing Linux kernel code needs to understand the various ways of allocating memory, including the tradeoffs involved. This makes for for more efficient use of memory and CPU time--you can specify exactly what you need--but it also makes for more demanding programming.

There are essentially five different ways of allocating memory in the kernel. That's a white lie, but it is close enough to the truth for anyone who needs to read this article to learn about kernel memory allocation. Three (which provide dynamic allocation) are generally useful, and two (which provide static allocation) are deprecated, and are mostly historical artifacts that should not be used. We will discuss the advantages and limitations of the useful ways first, and will only briefly mention the two deprecated ways at the end of this article so that you know what to avoid.

Memory Allocation Strategies

There are a few rules that apply no matter what form of dynamic kernel memory allocation you attempt to do. Whenever you attempt to allocate memory in kernel space, you must be prepared for an allocation error. Always check the value returned from the allocation function, and if it is 0, you will need to handle it cleanly, somehow. User-space code can be terminated with a segmentation violation if it ignores memory allocation errors, but the kernel can easily crash, bringing down the whole system.

There are several common error-handling strategies. One strategy is to attempt to allocate critical memory at the top of a function, where you are less likely to have committed yourself and can more likely return an error cleanly. This is usually the best way to handle the problem.

Another strategy, usually used together with allocation at the top of the function, is to allocate an "easy" amount of memory for the memory management system to provide, and then parcel it out for various purposes during the life of the function, effectively doing its own memory management. Several subsystems in the kernel do this, such as the high-level SCSI drivers and the network code. Both include special memory allocation functions which are only supposed to be used in those subsystem. These are not documented here, under the assumption that documentation for those subsystems should document subsystem-specific memory allocation routines.

Yet another strategy, which will only work if you are not in "critical" sections of code, is to allow the kernel to schedule another process by calling schedule() and then to try again later, when schedule returns. Note that some kinds of allocation are not safe to call even once from within critical code; that will be covered when we discuss the individual functions.

The fundamental rule is not to write algorithms that commit themselves to completing without having been guaranteed the resources they need in order to complete. Memory is one of the scarcest and most commonly needed of the resources that must be guaranteed, and the only way to guarantee that memory will be available is to allocate it.

Kmalloc

The kmalloc() function allocates memory at two levels: it uses a "bucket" system to allocate memory in units up to nearly a page (4Kb on the i86) in length, and uses a "buddy" system on lists of different sizes of contiguous chunks of memory to allocate memory in units up to 128Kb (on the i86) in length. Only in recent kernels has it been able to allocate memory in units over 4Kb in length, and allocating large amounts of memory with kmalloc is very likely to fail, especially in low-memory situations, and especially on machines with less memory.

Kmalloc is very flexible, as demonstrated by its calling convention:

void * kmalloc(unsigned int size, int priority);

Note the priority argument: this is what makes kmalloc so flexible; it is possible to use kmalloc in very constrained circumstances such as from an interrupt handler. Interrupt-driven code, or code that cannot be pre-empted, but still needs to allocate memory, can call kmalloc with the GFP_ATOMIC priority. This will be more likely to fail, because it cannot swap or do anything else which would cause implicit or explicit I/O to occur. Code with relaxed requirements, which may legitimately be pre-empted, should instead call kmalloc with the GFP_KERNEL priority. This may cause paging and may cause schedule() to be called, but has a higher chance of success.

In order to dynamically allocate memory that can by accessed via DMA, the GFP_DMA priority should be used. It does stress the memory system, particularily if large amounts of memory are requested, and is quite likely to fail. Try again. It should be noted that GFP_DMA is only likely to fail on system with severe limitations on DMA transfers--such as computers using the common ISA bus. Not all platforms are affected by this problem.

Memory allocated with kmalloc() is freed with kfree() (or kfree_s()).

Vmalloc

For allocating large areas of virtually contiguous memory that do not have to be physically contiguous for interfacing with hardware, the new vmalloc() function (with the same calling convention as conventional malloc()) will cause less stress on the memory subsystem. It allocates possibly non-contiguous blocks of free memory, and maps them into one contiguous space in high memory. It is less likely to fail than kmalloc in many situations. It does not take a priority like kamlloc does. It cannot be called from within an interrupt, and it may implicitly cause pre-emption to occur.

Memory allocated with vmalloc is not DMA-able, even on systems without DMA restrictions, because DMA under Linux assumes a 1-1 logical-physical page mapping. This simplifies memory management in several ways, and is not a severe restriction, because kmalloc provides a way to get DMA-capable memory.

Just because it is addressed virtually does not mean that this memory is subject to paging to disk, despite rumors to the contrary. The "virtual" in vmalloc refers only to the addressing, which is not a 1-1 mapping from virtual to physical address space, unlike the rest of the kernel. Swapping may be initiated to provide the memory during a call to vmalloc(), but the vmalloced memory will not then be swapped out.

Memory allocated with vmalloc() is freed with vfree().

get_free_pages

Now we learn what the GFP above stands for: get_free_page (well, perhaps __get_free_pages) and simply specifies how exactly this function goes about attempting to find free pages of memory. As you may guess, the same GFP_* values are used for these functions as well.

This is the way to request an amount of memory that is easy for the memory subsystem to allocate. This is the lowest-level--and therefore the lowest-overhead--way of dynamicly allocating memory. If you need a chunk of memory larger than half a page but no larger than a page (when deciding this, be aware that page size varies from architecture to architecture; it is 4Kb on the i86 and 8Kb on the DEC Alpha, for instance), especially if you only need it for the duration of the current procedure, this can be the right way to go. Also, if you are working on subsystem-specific memory management, you almost certainly want to allocate your memory this way.

If you want only one page, call get_free_page(priority), where priority is one of the GFP_* values. Of course, the same rules about which GFP_* value is correct apply as for kmalloc. If you only want one page and don't care if it has been cleared (set to all zero values), use __get_free_page(priority) instead, since most of the overhead of allocating a page with get_free_page goes to clearing the page.

If you need to allocate more than one consecutive page, you can do so, although this is more likely to fail than allocating a single page, and the more pages you wish to allocate, the less likely you are to succeed. You can only allocate a number of pages which is a power of two. __get_free_pages(priority, order) is called with the same priority argument; the order argument gives the size according to the following formula: PAGE_SIZE<\!s>*<\!s>2<+>order<+> so an order of 0 gives one page, of 1 gives two pages, of 2 gives four pages, and so on (in current kernels, at least) up to 5, which gives 32 pages, which on the i86 architecture is 128Kb. PAGE_SIZE, as you may have guessed, is the standard macro for the number of bytes in a page.

The __get_dma_pages() function works exactly like __get_free_pages(), except that it allocates pages which are capable of being used for DMA, and it puts more stress on the memory allocation system.

Pages allocated with get_free_page() or __get_free_page() are freed with free_page(), and pages allocated with __get_free_pages() and __get_dma_pages() are freed with free_pages().

Device initialization

Now we approach the deprecated strategies. They may be useful in some circumstances, mostly in situations where they are the "easy way" to get a driver working. In those cases, it is usually best to eventually find another way to do the same thing, as neither strategy is very flexible. Neither strategy is applicable to loadable modules.

When a device which is compiled into the kernel is initialized, it is passed a pointer to available memory. It is then required to return a pointer. If the pointer it returns is higher than the pointer it got, the memory in between the two pointers is reserved for the device. That memory will be in the first few megabytes of memory. The exact location will depend on how the kernel is booted. This is (perhaps unfortunately...) documented in the Linux Kernel Hackers' Guide.

Once allocated, that memory cannot be freed.

Memory initialization

This particular method is extremely deprecated, and is architecture-dependent as well. It is possible to add a function call within the body of mem_init(), which resides, for the i86 platform, in the file arch/i386/mm/init.c. In the middle of this function, two functions for initializing SCSI and sound-driver memory are provided. Also, arch/i386/kernel/head.S provides another platform-dependent way to allocate memory. This is where initial memory management is set up.

If you understand these well enough to muck with them, you don't need my help. These are last resorts for memory allocation, and you need to know exactly what you need to do, and why the dynamic allocation strategies will not work for you, before considering these "hacks".

Michael K. Johnson is the Editor of Linux Journal, and pretends to be a Linux guru in his spare time. He can be reached via email as johnsonm@ssc.com.