Work scope for the Linux Scalability Project

Chuck Lever, Niels Provos, Naomaru Itoi

linux-scalability@citi.umich.edu

Abstract

We outline areas of research for the Linux Scalability research project, and list specific deliverables.

This working document is Copyright © 1998, 1999 Netscape Communications Corp., all rights reserved. Trademarked material referenced in this document is copyright by its respective owner.

Introduction

This document outlines areas of research and development for the Linux Scalability research project. The primary goal of this research is to help make the Linux operating system enterprise-ready by improving its performance and scalability. Linux is beginning to compete with other UNIX and non-UNIX operating systems in the server market, and is becoming more popular among ISPs and ESPs who are using it to provide enterprise-class network services to their customers.

We are specifically interested in finding immediate and practical improvements to Linux that will increase the performance of Netscape's network server suite, which includes an LDAP directory server, an IMAP electronic mail server, and a web server, among others. To achieve our primary goal, we will select a number of areas of potential improvement to the Linux operating system, prioritize the areas based on their estimated improvement pay-off versus their implementation cost, then implement the highest priority improvements. We will evaluate each improvement using server and OS benchmarking methodologies that are as close to standard as possible to allow scientific comparison with other research in this area. Finally we will work with Linux developers to incorporate our improvements into the baseline Linux source code. Improving existing features and adding new ones to an operating system to boost application-specific performance via independent research is an opportunity afforded by freeware operating systems such as Linux.

In the following sections, we outline the areas where we think we can have the most success. In addition to technical achievements, part of our work will include building collaborative relationships among freeware advocates, and with system and software vendors.

Network server performance issues

There are several common performance and scalability issues recognized by researchers and server developers. Here we summarize some of the most important factors we have considered.

File descriptor scalability

As the number of network users and clients grows, the number of concurrent open file descriptors on network servers can easily exhaust system limits. The number of file descriptors maintained on servers often grows proportionally with the number of concurrent clients served. For example, IMAP servers need to maintain a socket to connect with each client, and an open file descriptor for the client's mailbox. A system-wide file descriptor limit of 1024 prevents such a server from supporting more than about 500 concurrent users.

Data throughput

Service scalability depends on the ability of network servers to deliver more and more data at higher and higher rates. Operating system architecture and implementation can have significant effects on data bandwidth. To improve a server's effectiveness, we need to address issues in the operating system and application that limit the amount and rate of data flowing from the server's disk to the network.

On some types of network servers, such as mail servers, the disk read-write ratio is significantly skewed towards writes. Metadata updates and data writes are among the most expensive disk and file system operations. Careful analysis of these operations may be of great benefit.

Memory bandwidth is also important in this regard. Memory allocation and system memory management can be optimized to make good use of hardware memory caches. As well, maintaining I/O data cached in main memory can improve overall server efficiency.

Network traffic generated by heavily-used network servers exhibits unique characteristics not easily reproduced when analyzing server performance. Clients are often situated behind high latency network connections, resulting in a high degree of server packet retransmission. Packet retransmission creates unnecessary levels of network congestion. Furthermore, servers often maintain an increasingly large number of concurrent connections because most clients are slow to retrieve data, and thus maintain their connection for a longer time.

Research has suggested ways to improve TCP congestion management and startup behavior. The good news is that these changes can be implemented on the server, benefiting server network data throughput without dependencies on client networking software.

Specific OS dependencies

Lock contention

Locks are used extensively in server applications, so the performance of an OS's lock primitives is very important. Also, support for mutexes that can be shared among processes is required.

Memory management

Server applications make heavy use of shared regions, anonymous maps, and mapped files. Special features like locking down regions so they aren't swapped, fast mmap(), support for allocating very large shared regions and memory areas, and efficient memory allocation are especially useful. For example malloc() in the C run-time library appears not to scale well across multiple processors, since sharing the heap requires a single global heap lock.

MP scalability

To provide more processing power to a server application, we can add more CPUs to a server, but first we must be sure that the operating system and the server application can take full advantage of more than one or two CPUs at a time. Network servers are generally I/O bound, however. Increasing the number of CPUs, while not directly increasing the I/O bandwidth of a system, may have other benefits, such as increasing the amount of CPU available for handling interrupts and processing network protocols. The very latest versions of Linux use MP hardware significantly more efficiently than some earlier versions do. However, there is still room to improve.

Asynchronous events and thread dispatching

Network servers require an integrated approach to asynchronous I/O and thread dispatching. Most modern server architectures make heavy use of both asynchronous I/O and threads. Asynchronous I/O support helps keep the amount of kernel resources and number of outstanding read buffers to a minimum. Having an asynchronous I/O model that is easy to program with and allows reuse of server software among various OS platforms is a big win. Most importantly, an OS-provided integrated asynchronous I/O and event dispatching facility has been shown by researchers to be critical to the performance and scalability of internet servers.

More flexible and efficient system call interfaces

Under some circumstances, Netscape server products appear to perform better on Windows NT than on UNIX platforms. Many have conjectured that NT has better system call interfaces for network servers than UNIX. A way of improving server performance and scalability is to help the server application itself make more efficient use of the operating system and the resources it provides. We can do this by adding improved interfaces, or by making the current interfaces, such as poll(), more efficient. System interfaces should also support 64-bit files and filesystems, as well as very large address spaces and more than a few gigabytes of physical RAM.

Prioritizing our work

To decide which improvements provide the most benefit, we will rate each potential improvement in the following categories.

Estimated pay-offs

Measurable throughput, performance, scalability improvements.: These are the most important gains we hope for. We can estimate these pay-offs by using research of preexisting literature and simple microbenchmarks.
Added stability during overload.: While some improvements may provide little, none, or even slightly negative performance or scalability gains, they might offer significant enhancements of system behavior during overload conditions.
Synergy with other potential improvements.: Several potential improvements to Linux can be accomplished in different ways. Choosing to implement one improvement may make others much simpler to implement.

Estimated implementation costs

Implementation time with existing resources.: We want to make sure the work we plan is feasible for our developers, and can be completed successfully during the course of this project.
Estimated complexity.: Complexity relates directly to the amount of testing required, for example, and also increases the likelihood that improvements may introduce new bugs. We are also concerned about introducing improvements that require significant changes to applications, especially changes that are not backwards compatible.
Potential introduction of security or scalability problems.: While an improvement might be easily implemented, it also might introduce other problems that make it unsuitable, such as unstable overload behavior, or unacceptable security exposures.
Amount of server re-engineering required.: A strong bias towards work that requires no or only incremental modification of system interfaces is appropriate. This will improve the ability of everyone to take advantage of our changes immediately.
Degree of expected acceptance by Linux developers.: We'd like to provide improvements that are likely to be accepted as permanent modifications to the Linux kernel, as maintained by the Linux developers. Our improvements lose value if they have to be installed or included separately.

Coordination efforts

These efforts help build collaborative relationships among research and corporate entities to forward the mission of our research.

EECS research on performance and NT-like system call APIs

Graduate research at U-M's College of Engineering may be approaching several of these issues already. We would like to coordinate with this work so we don't duplicate it.

Input from Netscape server product teams

We will consult with members of Netscape server product teams to itemize and prioritize work that the teams would like to see to improve Linux/server product performance.

Collaboration with Linux development community

We will co-operate with members of the Linux development community to determine the current state of Linux, and determine how that work affects Netscape server product performance. We will offer development resources for work on scalability and performance issues.

ISV relationships

We will work with interested system and software vendors to build support for Linux as a server platform. This work may range from helping Veritas create a Linux version of their VxFS file system, to working with Intel's performance engineers, or working with the makers of Purify to port their products to run on Linux.

Staged delivery

We've broken our project goals and deliverables into three stages. Project prioritization is based on what expertise and resources are available to our project, and what will have the highest pay-off and probability of success. In other words, we will start with "low-hanging fruit," and as our resource base and experience grows, we will tackle more difficult and riskier problems.

Stage one

Initially, we are interested in providing improvements that require no changes to application architecture or to the system interface. These changes are easy and have a high probability of pay-off, with little risk of introducing new bugs or performance problems. This is a period during which we will build our expertise, and create ties to the Linux development community. We also anticipate forming relationships with several ISVs. Finally, we will construct and benchmark a small local test harness in preparation for measuring later implementations.

Our deliverables during stage one include a finalized version of this work scope document, scholarly papers and status reports describing our progress, and the construction of our local test harness. We will also establish OS and application benchmark levels with microbenchmarks and application-level benchmarks. The benchmark results will provide both a base-level performance measurement, and specific improvement initiatives.

Specific stage one projects include:

Removing file descriptor limits: This will involve increasing the default file descriptor limits in select(), poll(), and get_unused_fd(), as well as measuring and analyzing file descriptor related kernel functions to see that they scale well with the number of file descriptors. We will also explore improvements and changes to the C run-time library.
Improving memory bandwidth: We will implement and measure new versions of memset() and memcpy() in the kernel and in the C run-time library that can approach much more closely hardware memory bandwidth maximums. We will also tune malloc() in the kernel and the C run-time library to help mitigate latencies in underlying system resource providers, and to help these routines layout memory in a more hardware cache-friendly way.
Improving TCP bandwidth
: Several interesting innovations can help boost TCP throughput, and reduce latency due to lost packets. We will implement and study a new mechanism which connects recovery processing on one connection to all other connections between the server and a particular client. We will also tune and improve current TCP recovery mechanisms, including SACK, duplicate ACK, slow start, and fast retransmit. Finally, we will dynamically analyze Linux's current TCP implementation to check its compliance with TCP standards, and that its congestion behavior is neighborly.
Improving mmap() efficiency: The mmap() system call is used pervasively in Linux and in server applications. We can examine and improve mmap() performance on Linux to help make the whole system faster and more scalable. The mmap() system call uses system disk I/O and memory resources heavily, so this work may also give us an excuse to fiddle with pieces of the Linux VM system, kernel memory allocation, the C run-time library, and disk device drivers.
Constructing and benchmarking our test harness: In order to measure scientifically our improvements, we will require a local test harness on which to stress and benchmark Linux and the various network server applications. Selecting the specific hardware and choosing which versions of server and benchmarking software will be part of stage one.

Stage two

By stage two, we will explore solutions that may involve some changes to Linux's system API and/or to server application architecture. We may also try some of the more risky or more complicated improvements, now that we have some experience under our belts. We will also have built constructive relationships with some ISVs and with parts of the Linux development community.

Our deliverables during stage two include the improvements themselves, with scholarly papers and reports describing our improvements. Specific stage two projects include:

Implementing a caching send_file()
: There already exists a Linux implementation of send_file() but there may be room for improvement. Integrating send_file() with the kernel's SKBUFF cache will improve its performance significantly, for example. This project would involve finding ways to make send_file() more efficient, then re-architecting one or more servers to use it.
Improving NSPR: We can focus some effort on the Linux version of NSPR to help it make the best use of system resources, including new system calls such as send_file(), or new features such as asynchronous I/O completion notification.
Discovering Linux scalability limits: Linux may have some unfortunate system limits that we will need to discover in order to address them within Netscape server product software. Examples of such limits might be:; This work would attempt to stress various parts of the Linux kernel to determine where its limits lie. We can also engage in research and communication with Linux developers to uncover architected limits, and find ways to relieve the limits.
Exploring asynchronous I/O models: The lastest version of the Gnu C run-time library (glibc 2.1) has asynchronous I/O support, based on the POSIX.4 spec, built into it. Recent versions of the Linux development kernel support glibc's aio API with POSIX.4 realtime signals, and I/O completion notification via these signals.; This work would explore the usefulness of the current support, and also implement other system interfaces that provide different programming models, to compare the efficiency of different implementations, and to see which are easier to use in applications. It may also involve modifications to base server product code to try out some of the new interfaces.
High-performance filesystems: Linux is scheduled to get a journalling file system, as well as support for 64-bit files, in the near future. It is important that the underlying filesystem implementations can realistically scale to provide these features.

Stage three

During stage three, we will subject our most successful improvements from early stages to more stringent performance and scalability testing by working directly with Netscape's server development teams, and by providing some of our work to service providers we know well, such as Netcenter. As part of our stage three efforts, we might also focus on creating a Linux Center of Excellence at CITI. This Center of Excellence could host other researchers and provide hardware arenas for advanced research and development of the Linux operating system.

Our deliverables during stage three include scholarly papers and reports describing the deployment and performance measurement results. We will also identify a comprehensive set of performance measurement tools and methodologies. Finally, we will complete a Linux server "Best Practices" guide which describe Linux-specific configuration options and enhancements to help customers get the most out of their Linux-based network servers.

Specific stage three projects include:

Measuring and improving SMP scalability: Our specific interest here is finding where SMP scalability is constrained within the kernel, and relieving those areas. Many recent versions of Linux have greatly improved SMP support, including support for up to 16 processors, and multiprocessor interrupt support. However, there may be areas where resource contention or scheduler limitations, for instance, might limit the amount of real scalability obtainable.
Optimizing PCI performance on multiple buses: This work would require a server configuration with high memory bandwidth and multiple PCI buses. We would attempt to understand the interaction between the operating system and multiple PCI buses, and try some improvements based on our analysis. As well, we will explore ways of improving the efficiency of SCSI drivers by increasing the capacity of the device driver to handle overlapped I/O and RAID.
Linux device driver support for ATM cards: ATM networking can help increase server throughput over and above FDDI or fast Ethernet technologies. Therefore, we can explore much further the edges of server performance with ATM. It's not clear, though, how well ATM and other types of high-performance networking are supported in the Linux kernel.
Improved interrupt handling: This work would combine SMP enhancements with the addition of a hybrid polling/interrupt-driven interrupt model to the kernel to allow device drivers to handle batches of interrupts rather than one interrupt at a time. Such support already exists for serial devices; we may find that it significantly improves the performance of disk and high-bandwidth network devices, too.
Zero-copy networking: Reducing or eliminating data copy operations that result from processing a network packet can help improve application data and request bandwidth. Mechanisms for improving networking efficiency include checksum caching, reducing the number of data copy operations, and moving data directly from one driver to another without context switches (using a mechanism such as IO-Lite). Some of these changes are easy, but something like IO-Lite would be a significant undertaking.

Benchmark methodologies

The current Linux development kernel (v2.1) is about to be rolled over into the next version of the stable kernel (v2.2). The 2.1 kernel has been "feature-frozen" since the Spring of 1998, meaning that bug fixes are gladly accepted, but new features generally are not. Since the 2.2 kernel is so close, it is likely that most or all of our enhancements will be added to the 2.3 kernel when it arrives. Whereever possible, we will work with the current development tree, since it contains a number of enhancements that are required by the Netscape server products. The development tree contains many improvements to the kernel, but is sometimes made unusable by work in progress, so it may be a source of delays.

To provide truly useful measurements of performance and scalability, we will choose benchmarking systems that performance researchers and Netscape's own performance engineers use most often. This permits comparison and repetition of our work, increasing its value over time. At the same time, we recognize that standard benchmarks are often inadequate for measuring certain types of performance problems, so we will use other benchmarks as well.

There will also be cases where we want to examine directly the effects of certain modifications to operating system features. For analyzing OS-specific modifications, McVoy's microbenchmarks and the Byte Linux Benchmarks will be useful. File system benchmarks, such as Bonnie, the Modified Andrew Benchmark, and SPEC's SDET and KENBUS benchmarks, will provide cross-sections of overall system performance.

We are especially interested in application performance, so application-specific benchmarks will also be used to measure our progress. Webstone and SPECweb96 appear to be the standard web server benchmarks. However, S-client and httperf have features that would exercise pathological network behavior, and may be useful in judging networking improvements. Directory-Mark is Netscape's directory server benchmark of choice.

We have a strong bias towards Web-server benchmarks, even though our work will initially be focused on the Netscape Directory and Messaging Servers, for several reasons:

Using freeware web servers and benchmarks means CITI and others can remain without nondisclosure while still making significant contributions.
Many web server performance issues are common to other types of network servers, and are easier to measure in web server.
There are a variety of web servers and web server benchmarks available.

Hardware

High speed networking technologies will be an integral part of our test harness. Either switched fast Ethernet or ATM will comprise our test harness network.

We will also have multiple CPU hardware on hand to implement and test SMP changes. It may be advisable to use the more powerful machines to drive server loads on smaller machines in the test harness to approach server performance limits more quickly and repeatably.

Testing and evaluation of large-scale server configurations is beyond the scope of this project. We can go as far as understanding compatibility and Linux-specific performance issues with large-scale and esoteric configurations, but our expertise is focused on software optimization. Besides, we believe that all our operating system optimizations will benefit moderate and large-scale server deployments. And, as our work progresses, we will be better positioned later to investigate large-scale server performance issues.

Milestones

In this section, we list project milestones by date.

November 1, 1998
U-M has selected GSRA. This work scope document is available to Netscape and U-M researchers for review. Stage one begins.
December 1, 1998
Research agreement is executed by Netscape and U-M. Netscape has on-site employee up to speed with orientation and project supplies.
January 1, 1999
Test harness hardware is available to U-M researchers. Initial reports on stage one improvements available from U-M.
February 1, 1999
Work scope draft is accepted. U-M has working test harness, and initial draft of development priorities. Research and prototyping is nearing completion. Scientific benchmark results of test harness are available.
March 1, 1999
U-M provides three to five papers describing their work to date. A technical exchange is scheduled during March for Netscape and U-M to meet and discuss progress.
April 1, 1999
Stage two begins. U-M makes developments available for testing at Netscape. Two to four more papers are available from U-M. U-M and Netscape select arenas for further testing in production environments.
July 1, 1999
Stage three begins. U-M is working with Netscape and several service providers to test and deploy some of the most promising developments.
September 1, 1999
U-M provides two to three papers describing deployment experiences. U-M provides final status paper for Netscape to review.
November 1, 1999
Netscape and U-M agree on final work completion. Stage three is complete.

If you have comments or suggestions, email linux-scalability@citi.umich.edu

Revision: $Id: workscope.html,v 1.12 1999/11/12 20:15:46 cel Exp $

Projects: Linux scalability: Project workscope