Linux Kernel Architecture

(Jacob Rumans) #1

Chapter 18: Page Reclaim and Swapping


Recall thatisolate_lru_pagesalso picks pages adjacent to the page frame of a page on the free list
if lumpy reclaim is used. If the allocation order of the request that led to the current reclaim pass is
larger than the threshold order specified inPAGE_ALLOC_COSTLY_ORDER, lumpy reclaim is allowed
to use both active and inactive pages when picking pages surrounding the tag page. For small allocation
orders, only inactive pages may be used. The reason behind this is that larger allocations usually
cannot be satisfied if the kernel is restricted to inactive pages — the chance that an active page is
contained in large intervals is simply too big on a busy kernel.PAGE_ALLOC_COSTLY_ORDERis per
default set to 3, which means that the kernel considers allocations of 8 and more continuous pages as
complicated.

Although all pages on the inactive list are guaranteed to be inactive, lumpy reclaim can lead to active
pages on the result list ofisolate_lru_pages. To account these pages properly, the auxiliary function
clear_active_flagsiterates over all pages, counts the active ones, and clears the page flagPG_active
from any of them. Finally, the page list can be pushed onward toshrink_page_listfor writeout. Notice
that the asynchronous mode is employed.

Notice that it is not certain that all pages selected for reclaim can actually be reclaimed.
shrink_page_listleaves such pages on the passed list and returns the number of pages for
which it succeeded to initiate writeout. This figure must be added to the total number of swapped-out
pages to determine when work may be terminated.

Direct reclaim requires one more step:

mm/vmscan.c
if (nr_freed < nr_taken && !current_is_kswapd() &&
sc->order > PAGE_ALLOC_COSTLY_ORDER) {
congestion_wait(WRITE, HZ/10);
...
nr_freed += shrink_page_list(&page_list, sc,
PAGEOUT_IO_SYNC);
}

If not all pages that were supposed to be reclaimed could have been reclaimed, that is, ifnr_freed <
nr_taken, some pages on the list have been locked and could not be written out in asynchronous mode.^13
If the kernel is performing the current reclaim pass in direct reclaim mode, that is, was not called from
the swapping daemonkswapd, and reclaims to fulfill a high-order allocation, then it first waits for any
congestion on the block devices to settle. Afterward,another writeout pass is performed in synchronous
mode. This has the drawback that higher-order allocations are somewhat delayed, but since they do not
happen so often, this is not an issue. Allocations smaller thanPAGE_ALLOC_COSTLY_ORDERthat arise much
more frequently are not disturbed.

Finally, the non-reclaimable pages must be returned to the LRU lists. Lumpy reclaim and failed writeout
attempts might have led to active pages on the local list, so both the active and the inactive LRU lists
are possible destinations. To preserve the LRU order, the kernel iterates over the local list from tail to
head. Depending on whether the page is active or not, it is returned to the start of the appropriate LRU
list using eitheradd_page_to_active_listoradd_page_to_inactive_list. Once again, the usage
counter of each page must be decremented by 1 because it was incremented accordingly at the start of
the procedure. The now familiar page vectors are used to ensure that this is done as quickly as possible
because they perform processing block-by-block.

(^13) There can also be other reasons for this, for instance, a failed writeout, but the reason mentioned is the essential cause.

Free download pdf