Work scope for the Linux Scalability Project
Chuck Lever, Niels Provos, Naomaru Itoi
linux-scalability@citi.umich.edu
Abstract
 | 
| 
We outline areas of research for the
Linux Scalability
research project, and list specific deliverables.
 | 
 
  
This working document is Copyright © 1998, 1999
Netscape Communications Corp.,
all rights reserved.
Trademarked material referenced in this document is copyright
by its respective owner.
 | 
Introduction
This document outlines areas
of research and development for the
Linux Scalability research project.
The primary goal of this 
research is to
help make the Linux operating system
enterprise-ready
by improving its performance and scalability.
Linux is beginning to compete with other 
UNIX and non-UNIX
operating systems in the server market,
and is becoming more popular among
ISPs and ESPs who are using it to provide
enterprise-class network services to their customers.
We are specifically interested
in finding immediate and practical
improvements to Linux that will increase
the performance of Netscape's network server 
suite, which includes an LDAP directory 
server, an IMAP electronic mail server, 
and a web server, among others.
To achieve our primary goal, we will 
select a number of areas of potential 
improvement to the Linux operating system,
prioritize the areas
based on their estimated improvement pay-off
versus their implementation cost,
then implement the highest 
priority improvements.
We will evaluate each improvement
using server and OS benchmarking methodologies
that are as close 
to standard as possible to allow scientific 
comparison with other research in this area.
Finally we will work with Linux developers to
incorporate our improvements into the baseline
Linux source code.
Improving existing features and 
adding new ones to an operating system 
to boost application-specific performance
via independent research
is an opportunity afforded
by freeware operating systems
such as Linux.
In the following sections, we outline
the areas where we think we can have the
most success.
In addition to technical achievements,
part of our work will include building
collaborative relationships among freeware
advocates, and with system and software vendors.
Network server performance issues
There are several common performance
and scalability issues recognized by 
researchers and server developers.
Here we summarize 
some of the most important factors we have
considered.
File descriptor scalability
As the number of network users and 
clients grows, the number of concurrent 
open file descriptors on network servers 
can easily exhaust system limits. The 
number of file descriptors maintained on 
servers often grows proportionally with 
the number of concurrent clients served.
For example, IMAP servers need to 
maintain a socket to connect with each 
client, and an open file descriptor for the 
client's mailbox. A system-wide file 
descriptor limit of 1024 prevents such a 
server from supporting more than about 
500 concurrent users.
Data throughput
Service scalability depends on the 
ability of network servers to deliver 
more and more data at higher and higher
rates. Operating system architecture
and implementation can have significant
effects on data bandwidth.  To improve
a server's effectiveness, we need to
address issues in the operating system
and application that limit the amount
and rate of data flowing from the server's
disk to the network.
On some types of network servers, such
as mail servers, the disk read-write ratio
is significantly skewed towards writes.
Metadata updates and data writes are
among the most expensive disk and file
system operations.
Careful analysis of these operations
may be of great benefit.
Memory bandwidth is also important
in this regard.
Memory allocation and system memory
management can be optimized to make
good use of hardware memory caches.
As well, maintaining I/O data cached
in main memory can improve overall
server efficiency.
Network traffic generated by heavily-used 
network servers exhibits unique
characteristics not easily reproduced when 
analyzing server performance. Clients 
are often situated behind high latency 
network connections, resulting in a high 
degree of server packet retransmission. 
Packet retransmission creates unnecessary
levels of network congestion. Furthermore,
servers often maintain an increasingly
large number of concurrent connections
because most clients are slow to retrieve
data, and thus maintain their connection
for a longer time.
Research has suggested ways to improve TCP
congestion management and startup behavior. 
The good news is that these changes can 
be implemented on the server, benefiting 
server network data throughput without 
dependencies on client networking software.
Specific OS dependencies
Lock contention
Locks are used extensively in server applications,
so the performance of an OS's lock primitives
is very important.
Also, support for mutexes that can be shared
among processes is required.
Memory management
Server applications make heavy use of shared
regions, anonymous maps, and mapped files.
Special features like locking down regions
so they aren't swapped, fast 
mmap(),
support for allocating very large shared
regions and memory areas, and efficient memory
allocation are especially useful.
For example
malloc() in the C run-time library
appears not to scale well across multiple
processors, since sharing the heap requires
a single global heap lock.
MP scalability
To provide more processing power to 
a server application, we can add more 
CPUs to a server, but first we must be 
sure that the operating system and the 
server application can take full
advantage of more than one or two
CPUs at a time.
Network servers are generally I/O bound,
however.  Increasing the number of CPUs,
while not directly increasing the I/O
bandwidth of a system, may have other
benefits, such as increasing the amount
of CPU available for handling interrupts
and processing network protocols.
The very latest versions of Linux use MP
hardware significantly more efficiently than
some earlier versions do.  However, there
is still room to improve.
Asynchronous events and thread dispatching
Network servers require an integrated
approach to asynchronous I/O and thread
dispatching.
Most modern server architectures make heavy
use of both asynchronous I/O and threads.
Asynchronous I/O support helps keep the
amount of kernel resources and number of
outstanding read buffers to a minimum.
Having an asynchronous I/O model that
is easy to program with and allows
reuse of server software among various OS
platforms is a big win.
Most importantly, an OS-provided integrated
asynchronous I/O and event dispatching facility
has been shown by researchers to be critical
to the performance and scalability of
internet servers.
More flexible and efficient system 
call interfaces
Under some circumstances, Netscape 
server products appear to perform better 
on Windows NT than on UNIX platforms.
Many have conjectured that NT 
has better system call interfaces
for network servers than UNIX.
A way of improving 
server performance and scalability is to 
help the server application itself make 
more efficient use of the operating system
and the resources it provides. We 
can do this by adding improved interfaces,
or by making the current interfaces,
such as 
poll(), more efficient.
System interfaces should also support
64-bit files and filesystems, as well as
very large address spaces and more than
a few gigabytes of physical RAM.
Prioritizing our work
To decide which improvements provide
the most benefit, we will rate each 
potential improvement in the following 
categories.
Estimated pay-offs
- 
Measurable throughput, performance, scalability improvements.
 
- These are the most important gains we hope 
for. We can estimate these pay-offs 
by using research of preexisting literature
and simple microbenchmarks.
 
- 
Added stability during overload.
 
- While some improvements may provide 
little, none, or even slightly negative 
performance or scalability gains, they 
might offer significant enhancements of 
system behavior during overload conditions.
 
- 
Synergy with other potential improvements.
 
- Several potential improvements to
Linux can be accomplished in 
different ways. Choosing to implement 
one improvement may make others 
much simpler to implement.
 
Estimated implementation costs
- 
Implementation time with existing resources.
 
- We want to make sure the 
work we plan is feasible for our developers,
and can be completed successfully during the 
course of this project.
 
- 
Estimated complexity.
 
- Complexity
relates directly to the amount of 
testing required, for example,
and also increases the likelihood 
that improvements may introduce new 
bugs. We are also concerned about introducing
improvements that require 
significant changes to applications, especially
changes that are not backwards compatible.
 
- 
Potential introduction of security or 
scalability problems.
 
- While an improvement might be easily implemented, 
it also might introduce other problems 
that make it unsuitable, such as unstable 
overload behavior, or unacceptable security exposures.
 
- 
Amount of server re-engineering required.
 
- A strong bias towards work that requires
no or only incremental modification of system
interfaces is appropriate.  This will improve the ability of
everyone to take advantage of our changes immediately.
 
- 
Degree of expected acceptance by Linux developers.
 
- We'd like to provide 
improvements that are likely to be accepted
as permanent modifications to the 
Linux kernel, as maintained by the 
Linux developers. Our improvements 
lose value if they have to be installed or 
included separately.
 
Coordination efforts
These efforts help build collaborative 
relationships among research and
corporate entities to forward the mission of 
our research.
EECS research on performance 
and NT-like system call APIs
Graduate research at U-M's College 
of Engineering may be approaching
several of these issues already. We would 
like to coordinate with this work so we 
don't duplicate it.
Input from Netscape server product teams
We will consult with members of 
Netscape server product teams to itemize 
and prioritize work that the teams would 
like to see to improve Linux/server 
product performance.
Collaboration with Linux development community
We will co-operate with members of 
the Linux development community to 
determine the current state of Linux,
and determine how that work
affects Netscape server product
performance. We will offer development
resources for work on 
scalability and performance issues.
ISV relationships
We will work with interested system and software
vendors to build support for Linux as a server
platform.  This work may range from helping Veritas
create a Linux version of their VxFS file system,
to working with Intel's performance engineers,
or working with the makers of Purify to port their
products to run on Linux.
Staged delivery
We've broken our project goals and deliverables
into three stages.
Project prioritization is based on what expertise
and resources are available to our project, and
what will have the highest pay-off and probability
of success.
In other words, we will start with "low-hanging
fruit," and as our resource base and experience
grows, we will tackle more difficult and
riskier problems.
Stage one
Initially, we are interested in providing
improvements that require no changes to
application architecture or to the system
interface.
These changes are easy and have a high
probability of pay-off, with little risk of
introducing new bugs or performance problems.
This is a period during which we will build
our expertise, and create ties to the Linux
development community.
We also anticipate forming relationships with
several ISVs.
Finally, we will construct and benchmark a 
small local test harness in preparation for 
measuring later implementations.
Our deliverables during stage one include
a finalized version of this work 
scope document, scholarly papers and status 
reports describing our progress, and the 
construction of our local test harness.
We will also establish OS and application
benchmark levels with microbenchmarks and
application-level benchmarks.
The benchmark results will provide both
a base-level performance measurement, and
specific improvement initiatives.
Specific stage one projects include:
- 
Removing file descriptor limits
 
- 
This will involve increasing the default file descriptor
limits in select(), poll(),
and get_unused_fd(), as well as measuring and
analyzing file descriptor related kernel functions to
see that they scale well with the number of file descriptors.
We will also explore improvements and changes to the C
run-time library.
 
- 
Improving memory bandwidth
 
- 
We will implement and measure new versions of memset()
and memcpy() in the kernel and in the C run-time library
that can approach much more closely hardware memory bandwidth
maximums.
We will also tune malloc() in the kernel and the C run-time
library to help mitigate latencies in underlying system
resource providers, and to help these routines layout memory
in a more hardware cache-friendly way.
 
- 
Improving TCP bandwidth
- 
  - 
Several interesting innovations can help boost TCP
throughput, and reduce latency due to lost packets.
We will implement and study a new mechanism which connects recovery
processing on one connection to all other connections between
the server and a particular client.
We will also tune and improve current TCP recovery mechanisms,
including SACK, duplicate ACK, slow start, and fast retransmit.
Finally, we will dynamically analyze Linux's current TCP
implementation to check its compliance with TCP standards,
and that its congestion behavior is neighborly.
 
- 
Improving mmap() efficiency
 
- 
The mmap() system call is used pervasively in Linux
and in server applications.
We can examine and improve mmap()
performance on Linux to help make the whole system faster
and more scalable.
The mmap() system call uses system disk I/O and memory
resources heavily, so this work may also give us an excuse
to fiddle with pieces of the Linux VM system, kernel memory
allocation, the C run-time library, and disk device drivers.
 
- 
Constructing and benchmarking our test harness
 
- 
In order to measure scientifically our improvements, we will
require a local test harness on which to stress and benchmark
Linux and the various network server applications.
Selecting the specific hardware and choosing which versions
of server and benchmarking software will be part of stage
one.
 
Stage two
By stage two, we will explore solutions that
may involve some changes to Linux's system
API and/or to server application architecture.
We may also try some of the more risky
or more complicated improvements, now that
we have some experience under our belts.
We will also have built constructive relationships
with some ISVs and with parts of the Linux
development community.
Our deliverables during stage two include
the improvements themselves, 
with scholarly papers and reports
describing our improvements.
Specific stage two projects include:
- 
Implementing a caching send_file()
- 
  - 
There already exists a Linux implementation of
send_file() but there may be room for improvement.
Integrating send_file() with the kernel's SKBUFF cache
will improve its performance significantly, for example.
This project would involve finding ways to make send_file()
more efficient, then re-architecting one or more servers
to use it.
 
- 
Improving NSPR
 
- 
We can focus some effort on the Linux version of NSPR to help it
make the best use of system resources, including new system
calls such as send_file(), or new features such as
asynchronous I/O completion notification.
 
- 
Discovering Linux scalability limits
 
- 
Linux may have some unfortunate system limits that we will
need to discover in order to address them within Netscape
server product software.  Examples of such limits might be:
 
- small process address space size
 
- small kernel address space size
 
- inability to page most kernel data structures
 - fixed limits on kernel data structure size
 
- 32-bit limits on file system interfaces (like VFS)
 
- 
This work would attempt to stress various parts of the
Linux kernel to determine where its limits lie.  We can
also engage in research and communication with Linux
developers to uncover architected limits, and find ways
to relieve the limits.
 
- 
Exploring asynchronous I/O models
 
- 
The lastest version of the Gnu C run-time library (glibc 2.1)
has asynchronous I/O support, based on the POSIX.4 spec, built
into it.  Recent versions of the Linux development kernel
support glibc's aio API with POSIX.4 realtime signals, and
I/O completion notification via these signals.
 
- 
This work would explore the usefulness of the current support,
and also implement other system interfaces that provide
different programming models, to compare the efficiency of
different implementations, and to see
which are easier to use in applications.
It may also involve modifications to base server product
code to try out some of the new interfaces.
 
- 
High-performance filesystems
 
- 
Linux is scheduled to get a journalling file system, as
well as support for 64-bit files, in the near future.
It is important that the underlying filesystem implementations
can realistically scale to provide these features.
 
Some such areas might include boosting the ability to create
many files in the same directory, providing support for
swapping memory-based filesystems,
improving the efficiency of metadata operations and
data writes, and supporting very large
filesystems via variable block sizes (for use with RAID
subsystems).
Stage three
During stage three, we will subject 
our most successful improvements
from early stages to more stringent performance
and scalability testing by 
working directly with Netscape's
server development 
teams, and by providing some of our 
work to service providers we know well, 
such as Netcenter.
As part of our stage three efforts, we
might also focus on creating a Linux
Center of Excellence at CITI.
This Center of Excellence could host
other researchers and provide hardware
arenas for advanced research and
development of the Linux operating system.
Our deliverables during stage three 
include scholarly papers and reports
describing the deployment and
performance measurement results.
We will also identify a comprehensive
set of performance measurement tools
and methodologies.
Finally, we will complete a Linux
server "Best Practices" guide which
describe Linux-specific configuration
options and enhancements to help
customers get the most out of their
Linux-based network servers.
Specific stage three projects include:
- 
Measuring and improving SMP scalability
 
- 
Our specific interest here is finding where SMP scalability
is constrained within the kernel, and relieving those areas.
Many recent versions of Linux have greatly improved SMP
support, including support for up to 16 processors, and
multiprocessor interrupt support.
However, there may be areas where resource contention or
scheduler limitations, for instance, might limit the amount
of real scalability obtainable.
 
- 
Optimizing PCI performance on multiple buses
 
- 
This work would require a server configuration with
high memory bandwidth and multiple PCI buses.  We would
attempt to understand the interaction between the
operating system and multiple PCI buses, and try
some improvements based on our analysis.
As well, we will explore ways of improving the efficiency
of SCSI drivers by increasing the capacity of the
device driver to handle overlapped I/O and RAID.
 
- 
Linux device driver support for ATM cards
 
- 
ATM networking can help increase 
server throughput over and above FDDI 
or fast Ethernet technologies. Therefore, 
we can explore much further the edges 
of server performance with ATM. It's 
not clear, though, how well ATM and 
other types of high-performance networking
are supported in the Linux kernel.
 
- 
Improved interrupt handling
 
- 
This work would combine SMP enhancements with the addition of
a hybrid polling/interrupt-driven interrupt model to the kernel
to allow device drivers to handle batches of interrupts rather
than one interrupt at a time.
Such support already exists for serial devices; we may find that
it significantly improves the performance of disk and high-bandwidth
network devices, too.
 
- 
Zero-copy networking
 
- 
Reducing or eliminating data copy operations that
result from processing a network packet can help improve
application data and request bandwidth. 
Mechanisms for improving networking 
efficiency include checksum caching, 
reducing the number of data copy operations, and moving data
directly from one driver to another without context 
switches (using a mechanism such as IO-Lite). Some of these
changes are easy, but something like IO-Lite would be a 
significant undertaking.
 
Benchmark methodologies
The current Linux development kernel (v2.1)
is about to be rolled over into the next
version of the stable kernel (v2.2).
The 2.1 kernel has been "feature-frozen" since
the Spring of 1998, meaning that bug fixes are
gladly accepted, but new features generally are
not.
Since the 2.2 kernel is so close, it is likely that
most or all of our enhancements will be added
to the 2.3 kernel when it arrives.
Whereever possible, we will work with
the current development tree, since it contains
a number of enhancements that are required by the
Netscape server products.
The development tree 
contains many improvements to the
kernel, but is sometimes made unusable by work 
in progress, so it may be a source of delays.
To provide truly useful measurements of
performance and scalability, we 
will choose benchmarking systems that 
performance researchers and Netscape's 
own performance engineers use most 
often. This permits comparison and 
repetition of our work, increasing its 
value over time. At the same time, we 
recognize that standard benchmarks are 
often inadequate for measuring certain 
types of performance problems, so we 
will use other benchmarks as well.
There will also be cases where we 
want to examine directly the effects of 
certain modifications to operating
system features. For analyzing OS-specific 
modifications, McVoy's microbenchmarks
and the Byte Linux Benchmarks 
will be useful. File system benchmarks, 
such as Bonnie, the Modified Andrew
Benchmark, and SPEC's SDET and KENBUS 
benchmarks, will provide cross-sections 
of overall system performance.
We are especially interested in
application performance, so
application-specific benchmarks will also be used to 
measure our progress. Webstone and 
SPECweb96 appear to be the standard 
web server benchmarks. However,
S-client and httperf have features that 
would exercise pathological network 
behavior, and may be useful in judging 
networking improvements.
Directory-Mark is Netscape's directory server 
benchmark of choice.
We have a strong bias towards Web-server
benchmarks, even though our work will initially
be focused on the Netscape Directory and Messaging
Servers, for several reasons:
- 
Using freeware web servers and benchmarks means CITI
and others can remain without nondisclosure while still
making significant contributions.
 
- 
Many web server performance issues are common to other
types of network servers, and are easier to measure
in web server.
 
- 
There are a variety of web servers and web server
benchmarks available.
 
Hardware
High speed networking technologies 
will be an integral part of our test harness.
Either switched fast Ethernet or 
ATM will comprise our test harness 
network.
We will also have multiple CPU 
hardware on hand to implement and test 
SMP changes. It may be advisable to use 
the more powerful machines to drive 
server loads on smaller machines in the 
test harness to approach server performance
limits more quickly and repeatably.
Testing and evaluation of large-scale server
configurations is beyond the scope of this
project.  We can go as far as understanding
compatibility and Linux-specific performance
issues with large-scale and esoteric configurations,
but our expertise is focused on software
optimization.  Besides, we believe that all our
operating system optimizations will
benefit moderate and large-scale server
deployments.  And, as our work progresses, we
will be better positioned later to investigate
large-scale server performance issues.
Milestones
In this section, we list project milestones by date.
- November 1, 1998
U-M has selected 
GSRA. This work scope document is available to 
Netscape and U-M researchers for
review. Stage one begins. 
- December 1, 1998
Research agreement is executed by 
Netscape and U-M.
Netscape has on-site employee up to 
speed with orientation and project
supplies. 
- January 1, 1999
Test harness hardware is available
to U-M researchers.
Initial reports on stage one improvements
available from U-M.
 
- February 1, 1999
Work scope draft is accepted. U-M 
has working test harness, and initial draft 
of development priorities. Research and 
prototyping is nearing completion.
Scientific benchmark results of
test harness are available. 
- March 1, 1999
U-M provides three to five papers 
describing their work to date. A
technical exchange is scheduled during
March for Netscape and U-M to meet and 
discuss progress. 
- April 1, 1999
Stage two begins.
U-M makes developments available 
for testing at Netscape. Two to four 
more papers are available from U-M.
U-M and Netscape select arenas for further 
testing in production environments.
 - July 1, 1999
Stage three begins.
U-M is working with Netscape
and several service
providers to test and deploy some of the most 
promising developments. 
- September 1, 1999
U-M provides two to three papers 
describing deployment experiences.
U-M provides final status paper for 
Netscape to review. 
- November 1, 1999
Netscape and U-M agree on final 
work completion. Stage three is
complete. 
If you have comments or suggestions, email
linux-scalability@citi.umich.edu
Revision: $Id: workscope.html,v 1.12 1999/11/12 20:15:46 cel Exp $