Brief: Linux 2.2.3 bforget() bug fix
Chuck Lever, Netscape Communications Corp.
chuckl@netscape.com
Andrea Arcangeli
andrea@e-mind.com
$Id: bforget.html,v 1.5 1999/11/12 20:12:54 cel Exp $
Abstract
 | 
| 
Linux 2.2.x sports a redesigned VM and buffer cache system that
can self-tune based on offered system load.
This brief describes a bug in the buffer cache that was identified
and fixed by the authors.
 | 
 
  
This document is Copyright © 1999
Netscape Communications Corp.,
all rights reserved.
Trademarked material referenced in this document is copyright
by its respective owner.
 | 
Introduction
Linux is an open-source POSIX-compliant operating system that runs
on commodity hardware.
A feature of recent releases of Linux is its ability to self-tune
system parameters, such as buffer cache size, according to offered
system load.
Part of this self-tuning ability is implemented in a varying
partitioning of physical RAM between the traditional-style
buffer cache and the virtual memory system.
However, allowing the buffer cache to grow and shrink on-demand
introduces some interesting design problems, and, in this case,
a significant bug.
In this brief, we describe the bug, and how the bug was identified.
We provide performance measurements to show the significance of
the problem.
Finally, we detail the fix as we proposed it and as it was
adopted by the Linux kernel, and report on the performance
improvement.
Identifying the bug
Several developers, including the authors, had noticed that
early releases of Linux 2.2.x were experiencing a memory leak
of some kind under heavy file system and VM loads.
Symptoms included poor system performance after copying or
removing large files, sporadic "out of memory" errors, and
performance degradation on long-running jobs.
In order to stress the system and perhaps identify the cause
of the problem, the SPEC S-DET benchmark suite was used to
generate significant loads on a 4-way Dell PowerEdge 6300-450 with
512M of RAM and an 18G Ultra2 LVD SCSI hard drive.
The SPEC S-DET benchmark consists of a script that is designed
to emulate a software developer by running programs such as
cc, nroff, cpio, and ed.
The software developer workload stresses many aspects of an
operating system, including the file and VM subsystems.
Multiple concurrent instances of the script can run to simulate
reproducible and increasing amounts of system load.
The S-DET benchmark is carefully designed to provide a consistent
workload across runs, and to error-check the output of each
script to quickly catch problems.
It became quickly apparent that running S-DET with a large
number of scripts could reproduce the performance degradation
scenario quickly.
Figure 1. Consecutive runs of 128 scripts on Linux 2.2.3
[ raw data ]
Notice below how the buffer cache continues to grow without bounds during
the benchmark runs.
Eventually, the buffer cache causes the system
to begin flushing pages unnecessarily.
This pressure on the VM system results in, among other things,
pages being stolen back from the buffer cache.
Since the buffer cache doesn't have any aging or replacement policy,
any buffer can be stolen, even buffers for heavily-used data.
The system doesn't usually recover from this condition until it is rebooted.
 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 1 0 0     0 413404 74064  7940   0   0    0  759  178 1852  14  13  73
 0 0 0     0 412948 74508  7940   0   0    1   79  119 1754  14  14  72
 1 0 0     0 412572 74944  7940   0   0    1   89  117 1749  13  14  72
161 0 0     0 375724 75916 10572   0   0   69   88  140 1529  19  30  50
149 3 0     0 319992 97796 34716   0   0  697  144  256  793  15  85   0
116 0 0     0 255888 136664 49480   0   0  284 2184 1217 3828  37  53  10
119 4 0     0 232072 153492 46196   0   0   17  138  277  688  15  85   0
139 0 0     0 234232 161756 49000   0   0   25  408  228  638  21  79   0
...
 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
15 107 0    88  2892 388056 42528   0   0  480   81  320 10401  24  43  33
29 92 1    88  3092 387488 43832   0   0  376  116  310 13246   9  14  77
10 111 0    88  1584 387376 45524   0   0  360   94  307 13222   2   9  89
13 108 1    88  6076 385548 42692   0   0  359  224  310 12467   3   9  88
 2 122 1    88  4812 383928 44592   0   0  444   67  304 10600   6  12  82
 9 111 0    88  1504 381636 48740   0   0  400  143  303 10244   3  13  83
 3 117 1    88  3040 380332 47396   0   0  392  122  299 7481   4   6  91
43 78 1    88  3936 380432 50056   0   0  427  180  301 10556  19  28  52
21 98 1    88  2724 382424 48456   0   0  455   55  311 12743   9  10  81
19 100 1    88  7484 379236 46612   0   0  336   84  302 11554   7  11  83
Figure 2. "vmstat" output excerpts of benchmark runs on Linux 2.2.3
Other features of this scenario include low CPU utilization and a large
number of blocked processes.  The size of the buffer cache and the
size of the free memory list are continually fluctuating,
suggesting a high flow of pages into and out of the buffer cache.
After some number of unsuccessful guesses at what might be the
problem, it was noticed
that the rate at which blocks were being read was low during
benchmark runs with good performance, but increased significantly
during poorly performing runs.
This suggested that the buffer cache was somehow becoming
ineffective over time.
The elevated read rate interferes with disk write bandwidth,
since Linux prioritizes read requests over write requests, since
usually a read request is more time-critical.
The original bforget()
The authors, independently of one another, discovered that buffers
were being orphaned.  In other words, many buffers were ending up
unlinked from the hash, such that they would not be found during
a 
find_buffer() request.
Each time a buffer is orphaned, another buffer is allocated
for the same logical block of data, and another read operation
is requested to pull the same data in from the disk.
This also causes the buffer cache to grow in size as more and more
copies of the same data blocks appear in the cache.
A review of the source that manages buffers (linux/fs/buffer.c)
revealed that the bforget() function, used by ext2 when files are
truncated or deleted, removes buffers from the hash table,
but then it abandons them without recycling them.
/*
 * bforget() is like brelse(), except it removes the buffer
 * from the hash-queues (so that it won't be re-used if it's 
 * shared).
 */
void __bforget(struct buffer_head * buf)
{
        mark_buffer_clean(buf);
        clear_bit(BH_Protected, &buf->b_state);
        remove_from_hash_queue(buf);
        buf->b_dev = NODEV;
        refile_buffer(buf);
        if (!--buf->b_count)
                return;
        printk("VFS: forgot an in-use buffer! (count=%d)\n",
                buf->b_count);
}
Figure 3. The original bforget()
In Linux 2.0, 
refile_buffer() was probably responsible for ensuring
that the buffer was properly added to the free list.
However, rewrites of the VM and buffer cache subsystems in Linux
have since removed that functionality from 
refile_buffer().
Proposed and final versions
The authors proposed the following replacement to correct this
problem.
void __bforget(struct buffer_head * buf)
{
        clear_bit(BH_Protected, &buf->b_state);
        remove_from_queues(buf);
        put_last_free(buf);
        if (!--buf->b_count)
                return;
        printk("VFS: forgot an in-use buffer! (count=%d)\n",
                buf->b_count);
}
Figure 4. Proposed fix
After more analysis and testing, Linus Torvalds' final version of the
routine looks like this (from Linux 2.2.5):
/*
 * bforget() is like brelse(), except it puts the buffer on the
 * free list if it can.. We can NOT free the buffer if:
 *  - there are other users of it
 *  - it is locked and thus can have active IO
 */
void __bforget(struct buffer_head * buf)
{
        if (buf->b_count != 1 || buffer_locked(buf)) {
                __brelse(buf);
                return;
        }
        buf->b_count = 0;
        remove_from_queues(buf);
        put_last_free(buf);
}
Figure 5. Final fix
The "if" construct is simply defensive programming.
Studying the current implementation of ext2, it appears unlikely
that 
bforget() will be called with a buffer that has a usage
count other than one.
However, it was thought that eventually, other file systems may want
to use 
bforget(), or ext2 may change how it uses
bforget(), so leaving the safety check in there would help
catch software problems that may be introduced later.
The rest
of the routine ensures that the buffer's usage count is zero
before finally inserting it into the free list.
In general, a non-zero usage count prevents try_to_free_buffers()
from releasing pages containing buffers.
If buffers with non-zero usage counts appear in the buffer free lists,
they will be skipped over by try_to_free_buffers() in favor of
buffers that may still have useful data.
Having a large number of these buffers in the free list can even
cause a severe system-wide memory shortage.
 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 1 0 0     0 414080 73704  7636   0   0    1  123  130 1759  13  14  73
 0 0 0     0 413516 74140  7636   0   0    1   89  117 1745  13  15  72
 0 0 0     0 413136 74576  7636   0   0    1   90  117 1749  13  15  72
159 0 0     0 372752 75372 11088   0   0  130   88  145 1379  19  40  41
134 1 0     0 332664 89228 32172   0   0  688   75  259  749  13  87   0
110 30 0     0 322096 89232 38256   0   0   79  466  337  746  22  78   0
132 0 0     0 303544 89232 47352   0   0  104   68  166  751  54  46   0
144 0 0     0 301464 89232 45012   0   0   36  285  244  777  39  61   0
139 1 0     0 305996 89232 49180   0   0    6  211  235  666  36  65   0
142 0 0     0 307616 89232 52908   0   0   15  141  161  606  31  69   0
140 0 0     0 303904 89232 49568   0   0   27  199  249  669  27  73   0
139 3 0     0 299460 89232 51920   0   0    5  317  310  924  31  69   0
139 0 0     0 307004 89232 51676   0   0   33   61  167  458  26  74   0
138 2 0     0 304716 89232 49880   0   0    5  415  287  728  34  66   0
...
138 0 0     0 286980 91856 53820   0   0   10   20  126  510  33  67   0
137 7 0     0 291312 91856 56588   0   0    8  374  293  716  31  69   0
147 0 0     0 293088 91856 58432   0   0    7   78  144  593  38  62   0
141 0 0     0 287608 91856 57548   0   0    0  290  230  667  36  64   0
142 1 0     0 306696 91856 51956   0   0    0  178  203  643  35  65   0
135 0 0     0 303112 91856 54436   0   0    1  127  144  480  27  73   0
Figure 6. "vmstat" output excerpts of benchmark runs on Linux 2.2.3
with final bforget()
The final 
bforget() has corrected the
performance degradation scenario.
CPU utilization is maximized, and few processes are blocked.
Block read rate is low, and the buffer cache size expands to a
reasonable working set, then remains at a steady size.
The following graph shows that performance is flat for each
consecutive run of the 128 script benchmark.
Figure 7. Consecutive runs of 128 scripts on Linux 2.2.3 with final
bforget()
[ raw data ]
The fixed version of 
bforget() is contained in Linux 2.2.5
and later.
This document was written as part of the Linux Scalability Project.
For more information, see
our home page.
If you have comments or suggestions, email
linux-scalability@umich.edu