Harvard University: Parallel Database Systems and a Distributed Data Coherence Strategy

In the following paper I will explore how database applications attempt to exploit a variety of parallel architectures. I will look briefly at two approaches to the text search problem. Next, I will explore more deeply, commercial Relational Database Management Systems (RDBMS) which sit on a layer below a variety of database applications. The actual database applications layered on top of the RDBMS might require high user high transaction volume or lengthy batch type processing. A scalable database application should support both scaleup and speedup. A scaleup would enable the data volume to be increased along with the hardware without impacting execution time. Speedup requires execution time to decrease as hardware is increased [2]. It is important to note that it is the RDBMS which actually exploits the parallel hardware.

The actual hardware architectures focused on will include smp architectures, clustered systems, loosely coupled systems, and a combination of the three. In each case, Several software strategies address a particular hardware architecture. A detailed look at shared memory utilization, using a Distributed Lock Manager (DLM), and data partitioning follows. These strategies allow a RDBMS to utilize many nodes and/or processes in parallel.

These systems all provide some level of scalablity by utilizing parallel hardware. At the same time the stratagies do suffer from several drawbacks. A clustered system is limited by high contention for a single critical resource or process. There can be a large administrative complexity associated data partitioning for loosely coupled machines. I will also propose a solution which builds on an area of extensive research.

Much work has been done on protocols to ensure cache coherence in shared memory systems Implementing these in the RDBMS layer would provide scalability across a large number of distributed computers. The RDBMS would take advantage of an existing network of data and application servers. A particular data server might run the RDBMS on a 32 node hypercube or a single risc processor. I propose that the RDBMS run in parallel across the application and data servers. The line between data servers and application servers would be merged. A directory based protocol could be used to enforcing distributed data coherence across the servers. The RDBMS would take advantage of commands directly supported by current or future hardware. Caching data on the application server would improve the performance by minimizing remote server access.

2. THE TEXT SEARCH PROBLEM

A. The LEXIS Legal Decision Retrieval System

The LEXIS service provides the legal profession with a timely search and retrieval operation on millions of legal documents. The system constructs very large index structures on all key words in each structure [1]. This operation is very time and space intensive. Because of this, LEXIS is unable to address a quickly changing database. The scalability of LEXIS could improve with the use of several strategies used by RDMS. A parallel index build implemented by Informix Software could allow LEXIS to take specific advantage of SMP architectures. This could allow LEXIS to support a more dynamic database. Additionally, LEXIS could support more users accessing a single database if it took advantage a clustered system’s shared disk. This especially true considering the application is read only. If the application were read/write, it might use a DLM in order to take advantage of a clustered architecture. Both schemes are examined in greater detail.

B. The Connectionist Machine Approach

While the Connection Machine (CM) does address a variety of applications, its massive architecture is especially applicable to the text search problem. The CM proves to be very scalable. Both scaleup and speedup are supported by the text search implementation. A subdivision of the entire task is very straight forward. The CM’s primary strategy revolves around distributing both articles and their required processing across the available nodes. Each node hashes its article into vectors. The vectors are saved to disk for future use, but remain in the node’s memory for most of the search operation. The search criteria are combined into another vector. The relationship between the search and article vectors predict a “hit” with 99.99% accuracy [1]. There is no path selecation phase required by the search. Each time a “full scan” of the data is required. The optimization phase required by the RDBMS makes massive partitioning of data and processing much more difficult. A single relational database object cannot be effectively distributed at such a high granularity.

3. RDBMS EXPLOITATION OF PARALLEL HARDWARE ARCHITECTURES

In this section I will focus on how RDBMS take advantage of SMP hardware, clustered systems, and loosely coupled systems. I will point out various advantages and limitations. Database Parallelism can take several forms. Relational operators map easily into a pipelined structure but each stage might be lengthy and require individual completion. Search and Sort are good examples of such operators. These operations can be subdivided if a combine phase is added to the operator. This type of parallelism is often called partitioned parallelism as apposed to pipeline parallelism [2]. Because I/O plays such a significant role, partitioning data across disk increases throughput by allowing parallel access. This requires that cpu intensive work to be partitioned hand in hand with the data. This strategy is used with both SMP systems and loosely couple systems.

A. Exploiting A Symmetric Multi Processor (SMP) Architecture

RDBMS take advantage of a SMP machine in several ways. All of them utilize the shared memory to cache data rows. I/O is minimized and concurrency is optimized. The individual CPUs implement their own internal cache coherence transparent to the RDBMS. Read, write isolation and concurrency at the table level is implemented by the RDMS. Typically read and write locks are supported via latches. The latches control access to a shared resource in memory. A process desiring a latched resource typically spins while another process is in a critical section. A single multithreaded database server can also be split up into virtual servers.

Figure 1

One class of server, perhaps an optimizer, can operate while an I/O server is blocking. Additional servers of a particular class can be added when needed to increase throughput. This allows database use to scale without excessive overhead. Hardware parallelism is exploited to allow more applications to access the database simultaneously. Batch oriented decision support applications also can exploit the hardware parallelism of SMP platforms.

Informix Software and Sequent computers are involved in joint development of a Parallel Data Query (PDQ) project. PDQ allows an index build operation to be serviced by numerous processes. First many processes perform an optimize merge sort on the data to be indexed. Then these processes cooperatively manipulate the index structures in memory. Other processes write dirty pages to disk in synchronization with the manipulating processes. This feature has cut index build time by several orders of magnitude [6].

The project also applies partitioned parallelism to the sort, scan, and join operators [6]. At the heart of the system, is an optimizer with detailed knowledge of how data is partitioned across disks. When a table is created the partitioning method is specified. Data can be separated by range, hashing or in a round-robin fashion.

Figure 2

Round-robin is especially good at applications that need sequential access, Hashing is very effective if tables are to be joined in the retrieve operation [2]. The optimizer spawns processes based upon the query, the data partitioning and data distribution. These processes can communicate and share data via the hardware supported shared memory. The PDQ project has realized a large gain in both speedup and scaleup.

B. Loosely Coupled Architectures
The Gamma database systems extend the communication and cooperation of PDQ to a loosely coupled architecture. The system currently runs on a 32 node Intel iPSC. Each node has its own private disk. All three partitioning methods mentioned above are supported. In addition, several hybrid strategies are supported [2]. Communications between the nodes is handled by a combination of hardware and software routing. The system is more complicated than PDQ, because it does not allow the same level of data sharing. In order to access a particular disk a process must run on the node to which that disk is attached. The data must then be routed to appropriate nodes. This creates a possible hot spot or contention point. Even so, Gamma displays excellent scaleup and speedup. Next I will look at another loosely coupled system.

The NCR 3600 is a highly configurable hardware platform. It can contain up to eight Application Processors (AP), Several Parse Engines (PE) and up to ten Access Module Processors (AMP) with disk subsytems. Data is partitioned across the AMPs is a round robin fashion. These components are connected by A high speed Interconnection Network (IN) called the Y-NET [7].

Figure 3

The APs are collections of two through eight 486 processors. They support a global shared memory with a directory based implementation. This is required over a snoopy cache protocol due to the dual bus architecture of the Y-NET. The APs support virtual memory and raid disk subsystems. The APs can exchange messages with TCP and UNIX signals. Unique process identifiers are also supported across the APs. The APs are used to run applications making database requests. Together the PEs and AMPs retrieve the data. The PEs determine the execution method for an SQL statement. They determine which AMPs should search their disk subsystems. The AMPs send the Data over the Y-NET and it is presented to the AP as a single collection of data. The Y-NET uses a binary tree structure with dual channels.

Figure 4

It is able to combine data from up to ten AMPs and return it in a sorted fashion [7]. This specialized piece of hardware separates the 3600 from a generic hypercube. It speeds response time by implementing a basic database function at the hardware level.

C. Clustered Approaches

The NCR 3600 also supports another method of parallelism. Up to 4 APs can be clustered around a single disk. Many hardware providers allow a single disk system to be accessed simultaneously from multiple computers. The computers must synchronize the activity to assure integrity on the disk. This type of architecture was highly developed in Digital’s VAX Cluster. This technology enables a new computer can be added to the cluster. This way more database users can access a single database. The size of the cluster is limited by the actual disk and bus hardware. Current technology does not allow the system to scale past ten. A Distributed Lock Manager (DLM) is used to manage all I/O. The VAX’s Lock Manager supports system calls to enqueue, dequeue, and report the status of locks requests. It directly supports local buffer chaching to decrease I/O in low frequency update situations. Alternatively, resource versioning is supported for high update situations [3]. It is up to the application to choose the type of locking.

The Oracle Parallel Server (OPS) currently uses such a strategy. It is implemented on many clustered systems including the Encore MultiMax. Because of its overhead, OPS minimizes its use of the DLM. It implements a local versioning strategy for read isolation [8]. OPS has also been optimized to generate unique sequence numbers across nodes. OPS allows concurrent inserts on each node without contention during space allocation [4]. While these optimizations have been implemented, scaleup or speedup are limited by the disk technology or contention for the DLM. The fact remains that this is a well-understood technology and has historically enabled performance gains for an existing hardware solution [5].

The Sequoia is a fault tolerant multi processor machine with shared memory support. Cache coherence is supported with a non-write through cache. This is because of the special memory check pointing required for the fault tolerance. A dual bus supports up to 64 tightly coupled processing elements [10]. A RDBMS cannot use the series 400 memory like a traditional SMP platform. Two processes running on different processor cannot write to the same piece of memory efficiently. This is because of the non-write through cache and the large block size used when writing memory from the cache to the main memory. OPS can also take advantage of the series 400’s many processors. Multiple instances of OPS are run on the different processors. Each instance of OPS has its own private memory. All the instances are able to share a single database on disk. The DLM is used to control parallel access to the database. The entire task must be split by the developer across the available instances of OPS. Even though the instances cannot use shared memory to store shared data they are able pass messages via the shared memory. A DLM would not be neccessary if the disk was assigned to indivdial instances of the RDBMS. This type of system is more scalable than the clustered systems, but requires very complicated partitioning strategies. The partitioning is both critical to performance and a manual task.

4. DISTRIBUTED DATA COHERENCE

My proposal is a RDBMS which allows a single data page to be cached in the memory of many different computers. Data coherence could be enforced with a invalidation based scheme. Each computer would also use its private disk to store actual data pages. The physical disks attached to each computer in the network would form a logical disk for the entire system. This would be conceptual similar to the distributed global memory of the DASH project. The system would use similar notions of home and remote pages. A RDBMS would be able to take advantage of hardware supported cache coherence commands.
In a shared memory machine we have a fixed amount of memory and an easy way to reference it. Instead of recording the state of individual memory addresses. The RDBMS would ensure distributed data coherence by tracking the state of individual disk pages on the home pages. Each computer’s optimizer would determine a query path based on a single global disk layout. The query optimizer would run on each computer and implement the appropriate coherency commands based upon the state and location of required disk pages. This does increase the demands on the optimizer. Research could help determine the optimal page size is when doing reads or write through.

The optimizer could track where a disk page is accessed most often. The data on that page could be moved to the computer that uses it the most. In other words, certain rows might move their “home” location. This would provide an optimal data partitioning that could change along with actual access patterns. The partitioning would be automatic.

The DASH system must deal with potentially long write latency. The RDBMS give us some relief from this problem. In a database application we can perform asynchronous disk writes if we know that we a writing to a log file in a synchronous manner. If each system does synchronous writes to a “home” disk page than we could avoid a long latency for the application.
Hopefully this paper has served as an introduction to existing database parallelism. The alternative proposed could integrate both parallel and non parallel RDBMS into a single distributed system. It would be a RDBMS capable of caching data at many sites simultaneously.

Bibliography
[1] A. R. Hurson, et al, Specialized Parallel Architectures for Textual Datbases”. Advances in Computers, Vol. 30 1990.
[2] D. DeWittZ, J. Gray, “Parallel Database Systems, The Future … “, Communications of the ACM, June 1992 Vol. 35 No. 6.
[3] Digital Eq. “Lock Management Services”, Cpt. 12, VMS Cluster Reference, 1990.
[4] Encore Computer Corp. “The Encore Infinity 90 Series and The Oracle Parallel Server.”, SM- ECC-0992 1992.
[5] Gartner Group , “The Rush to Clusters”, Research Note March 15, 1993.
[6] Informix, co., Sequent Corp., “Parallel Data Query”, Document number 000-20182-70 or DB- 1030, June 1991.
[7] NCR Corp., “NCR System 3600 Product Description”, pages 1-1, NCR E&M San Diego ST- 2119-91, 1992.
[8] Oracle Corp., “Oracle 7 for Digital Alpha AXP”, Part No. 53129.0293, February 1993.
[10] Sequoia, “Series 400 Technical Summary”, Sequoia Systems, Inc. Marborough Mass, pages 1-1, June 1992.
Figure 2 page 86 of [2]
Figure 3 page 14 of [7]
Figure 4 page 28 of [7]