Adding Reliability Support to Nswap,
a Network Swapping System for Linux Clusters

Swarthmore College CREU Project, 2004-2005
Funded by CRA-W CREU Grant

Students:

America Holloway, Heather Jones, Jenny Barry

Faculty Sponser:

Tia Newhall

Project Abstract:

We are working with the program Nswap (A Network Swapping Module for Linux Clusters). Nswap allows clustered or networked computers to store over-flow memory on the idle memory of another computer, rather than putting it on the hard disk. Currently, even on a 1 or 10 Gbit ethernet, Nswap almost always outperforms swapping to disk. However, there are some issues with the program. In particular, if one computer in a cluster crashes, it could potentially crash any other computers in the cluster that are storing pages on the faulty node. Our research focuses on methods for solving this problem. We are still in the research stage, but we are considering applying a RAID-like approach in our solutions. (Full CREU Proposal)


Jenny and America at CCSCNE'05

Final Report

Goals and Purpose

The goal of our project was to develop, implement, and test reliability algorithms for a network swapping system running on Linux clusters. Network swapping systems allow individual cluster nodes with over-committed memory to use the idle memory of remote nodes as their backing store, and to swap their pages over the network. Network swapping is motivated by two observations: first, network speeds are getting faster more quickly than are disk speeds, and this disparity is likely to grow [3]; second, there is often a significant amount of idle memory space in a cluster that can be used for storing remotely swapped pages [1, 2, 4]. Thus swapping to local disk will be slower than transferring pages over a faster network and using remote idle memory as a "swap device".

As the number of nodes in a cluster increases, it becomes more likely that a node will fail or become unreachable, making it important that such a system provide reliability support. Without reliability, a single node crash can affect programs running on other cluster nodes by losing remotely swapped page data that was stored on the crashed node. Any reliability support will add extra time and space overhead to remote swapping; reliability information uses idle RAM space that could be used for storing remote pages and typically requires some extra computational and message passing overhead. RAID-based[5] reliability schemes are likely to provide a good balance between reliability and cost in addition to providing some flexibility.

Nswap has design features that make implementing a strict RAID-like reliability scheme difficult. First, Nswap is designed to adapt to each node's local memory needs. The amount of local RAM space each node makes available to Nswap for storing remotely swapped pages grows and shrinks in response to the node's local processing needs: the amount of "swap" space is not fixed in size, and an individual node's Nswap storage capacity changes over time. The second difficulty is caused by Nswap's support for migrating remotely swapped pages between cluster nodes. Remote page migration occurs when a node needs to reclaim some of its RAM space from Nswap to use for local processing. Page migration complicates reliability support. For example, two pages in the same parity group could end up on the same node, resulting in a loss of reliability for that parity group. Lastly, due to asynchronous protocols, there is no guarantee as to the state of the system at any given time. Prior work in reliability schemes for network swapping systems have relied on fixed placement of page and reliability data and are not applicable to Nswap.

Process

We spent the first semester learning about cluster computing, researching reliabilty schemes, and familiarizing ourselves with the existing Nswap system. Based on our research we developed our own reliability scheme that accounted for the unique aspects of Nswap.

Our solution uses parity logging and dynamic parity groups. In a cluster of m nodes, m-1 are normal Nswap nodes and the remaining node is a dedicated parity server, which stores parity pages. We use parity logging to reduce network traffic to the parity server. In parity logging, a client calculates the parity page (the XOR of the page data) for a set of n page swap-outs. Only after n swapouts, does it send the parity page to the parity server.

The protocol to migrate a page under our scheme is slightly more complicated. When a server needs to migrate a page, it tries to migrate the page so that it can stay in the same parity group. If it cannot, the page is still migrated, but the parity server changes its parity group membership. The parity server is also responsible for initiating the recovery algorithm to restore lost page data.

In the second semester, we implemented a limited version of our reliability scheme on a six node Linux cluster. We created a simple parity server that only accepted parity pages and we implemented parity logging. Although we did not fully implement the parity server, nor dynamic parity groups, these aspects of our algorithm are not likely to significantly affect the performance of the cluster; system performance is dominated by message passing in which parity logging represents the only major increase.

Conclusion and Results

We tested our system against swapping to disk, mirroring, and the original Nswap algorithm without reliability using three workloads designed to test both cases when disk swapping is expected to perform well and when disk swapping is expected to perform poorly. We were pleased with the results. Nswap with reliability was only marginally worse than the original Nswap system, and better than mirroring. As expected, all three versions of Nswap were significantly better than swapping to disk.

Publications

"Reliability for Nswap", Jenny Barry, America Holloway, Heather Jones, Advisor: Tia Newhall, Tenth Annual Consortium for Computing Sciences in Colleges Northeastern Conference Student Research Poster Session, Providence, RI, April 2005

www.cs.swarthmore.edu/~newhall/creu

References

  1. Anurag Acharya and Sanjeev Setia. Availability and Utility of Idle Memory on Workstation Clusters. In ACM SIGMETRICS Conference on Measuring and Modeling of Computer Systems, pages 35-46, May 1999.
  2. Remzi H. Arpaci, Andrea C. Dusseau, Amin M. Vahdat, Lok T. Liu, Thomas E. Anderson, and David A. Patterson. The Interaction of Parallel and Sequential Workloads on a Network of Workstations. In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 267-278, 1995.
  3. John L. Hennessy and David A. Patterson. Computer Architectures A Quantitative Approach, 3rd Edition. Morgan Kaufman, 2002.
  4. Evangelos P. Markatos and George Dramitinos. Implementation of a Reliable Remote Memory Pager. In USENIX 1996 Annual Technical Conference, 1996.
  5. David A. Patterson, Garth Gibson, and Randy H. Katz. A case for redundant arrays of inexpensive disks (RAID). In ACM SIGMOD International Conference on Management of Data, pages 109-116, 1988.