Description
This is the training course for which I searched, in vain, when I was given my first terabyte database project. It’s also the training course I wish I’d had, as I watched my software grind on for several hours, while ‘top’ showed me that the machine was 80% idle. Also, to add insult to injury, my multi-threaded program was sitting stalwartly on one CPU, leaving the other eleven CPU’s doing virtually nothing.
If you need to manipulate data in databases where several hundred tables, each containing over five million rows, and you need to finish the task before the office opens in the morning, then the software architecture techniques described here will help you. If you work with smaller data volumes, then your software will run in five minutes, instead of one hour. In either case, choosing the correct architecture for your software will make an order of magnitude difference to the run time.
This training course is neither language-specific nor database-specific. However, the Unix operating system is written in ‘C’, and the kernel system call interface is a ‘C’ interface. If you can afford the five to nine times performance hit resulting from wrapping all the system code described here in a C++ layer, then that is your choice. It’s a bit like tuning your car for speed, then hitching a trailer to it.
Most of the major database vendors support the ANSI standard embedded SQL format, with few variations. Although the code examples refer to Pro*C, throughout this training course, all of the example code will work equally well on either DB2 or Informix and, possibly, Sybase.
It may be evident, from the above, that this is not a training course for those new to programming, or to the Unix operating system. The design of software architecture occurs at the next higher level of abstraction, so we have to assume a fairly expert knowledge of, at least, the ‘C’ programming language. Also, although we will devote a little time to the construction of embedded SQL programs, we will assume that the reader is proficient in the design of efficient SQL queries. Clearly, this implies a high level of familiarity with the workings of a relational database, such as Oracle, DB2, Sybase or Informix.
The aim of this training course is to help you, the expert programmer, to optimally use the resources of your machine to minimize the run time of your program.
In order to achieve this objective, we will describe, in moderate detail and with copious code examples, where appropriate:
• The kernel system calls which we need for the creation and manipulation of processes and daemons, including:
• fork()
• exec()
• select()
• Performance-optimal methods of dynamic memory allocation and reallocation
• Inter-process communication methods, including:
• Sockets and the design of client-server systems – forking, multi-threaded, and queued.
• Queues and FIFO’s
• Shared memory
• Software interrupts
• Pipes
• The principles of applying multi-threading techniques, using pthreads.
• The design of process schedulers
• Using hash tables for random access to data.
• Preferred methods of writing embedded SQL, portable across Oracle Pro*C, Informix ESQL, and IBM embedded SQL. This includes:
• Multiple connections to the same or different databases
• Cursor creation
• Array cursor SELECTs
• Array INSERTs
• Using qsort() instead of ORDER BY
Most training courses on programming techniques include an example application, which is usually totally useless, and which the reader is invited to code and compile as part of the learning process. We have departed from this model slightly, in so far as we define a generic application, and provide modular code, which can be reconfigured to suit individual variations.
We will look at two basic architectures which feature fairly heavily in the information technology field.
The message hub
This architecture receives data messages, as files, TCP/IP streams, or any other format. It performs some kind of processing on the data, and then passes it on to the next destination – not necessarily a database.
The data manipulation hub
The functionality embodied in this architecture is a superset of that in the message hub. This process extracts data from a database, performs some kind of intermediate processing, then puts the data back into either the same, or a different database.
This is the application on which we will concentrate our coding efforts.
Since either of the above may be implemented as a daemon or background process, which wakes up at a particular time to do its processing. We will also examine the design of a simple, efficient scheduling mechanism.
It is always easier to relate to an application, if it happens to coincide with one’s own line of work. In view of this, we make the following suggestions, as to what our generic application actually does, and invite readers to select that which is most suitable.
• A bank database, comprising of customer data, account details, details of financial services, transactions etc.
• A major Telco database, comprising of customer data, account details, details of telecommunication services etc.
• A retail chain database, comprising customer data, account details, details of stock, stores, prices etc.
• Any other field, where databases of over a terabyte need to be interrogated, or supplied with data.
INSIDE
03 Introduction
06 Database Query Optimization for Huge Databases
16 Inserting and Retrieving Data from Huge Databases
26 Implementations to Achieve Performance Goals in The Context of Multi-Dimensional Architecture
29 Processing Data in Parallel Using Multi-threading
37 Processing Data in Parallel Using Multithreading – the Mutex
43 Software Design for High Performance Applications
45 High Performance Applications – Performance Considerations
50 The Design of Hash Tables for Fast Retrieval of Data
54 Binary Trees and Binary Search
58 Introdution to Unix Kernel
65 Unix Kernel – System Calls
72 Basic Unix Queuing Techniques
77 Server and Client Code
85 Advanced Unix Queuing Techniques
113 Unix Inter-process Communication Using Pipes
121 Unix Inter-process Communication Using Shared Memory
133 Unix Inter-process Communication Using TCP/IP
145 Dynamic Memory Allocation in Unix Systems
158 Introduction to Signals and Interrupt Handlers
162 Interrupt Mask
170 About the Author
Mark Sitkowski
Mark Sitkowski is a Chartered Engineer and a Corporate Member of the Institution of Electrical Engineers in London. His early career revolved around the writing of analog and digital circuit simulators, and digital signal processing applications. In Australia, he moved to writing financial software for the major banks, and telecommunications software for Telcos, besides conducting training courses on Unix and database applications. Formerly a consultant to Forticode Security, he currently works with Design Simulation Systems on mobile multi-factor authentication systems.
Design Simulation Systems Ltd
http://www.designsim.com.au
[email protected]