|
Summary
Use/Applications
Technical
Multi-core Enabled
Pocket Soft's Bona Fides
RTPatch OEM
More Information
 
|
RTPatch Server: Multi-core Enabled
::BackgroundRTPatch Server software product (previously called dfc-gorilla), uses a patented byte-level differencing algorithm to extract only the changes between two revisions of a file. RTPatch Server employs a resource efficient differencing algorithm that utilizes a fixed memory window and streaming read (i.e., single pass) of the input files, yielding linear build time ("patch" creation), with respect to input file size at a given change pattern. Resource efficiency and linear build time enables RTPatch Server to process multi-gigabyte sized files in a restricted and predictable timeframe.
As with other forms of byte-level differencing, byte-level differencing in RTPatch Server is CPU-bound.
::Problem
RTPatch Server, version 4 and previous, used a single threaded differencing step, therefore execution on systems with multiple cores produced results no faster than a single core/processor alternative. Additionally, since the individual core clockspeed tends to be slower on a multi-core machine compared to the single processor alternative, Build times on "faster" multi-core machines were slightly longer than on contemporary single core/processor hardware.
RTPatch Server's inability to access all of the multi-core CPU power of a typical server or developer's workstation was a technical limitation. Although this limitation was anticipated during the initial development of the RTPatch Server Build algorithm, the challenges of implementing a solution were significant. One option to make use of additional processors when computing differences for multiple files is to assign a file pair to each core; however, since RTPatch Server 5 is designed for very large file processing (multi-gigabyte), it is not uncommon in the "real world" that only a single file pair need be processed by a user of RTPatch Server 5. Assigning file pairs to different cores, therefore, provides no benefit to the common RTPatch Server 5 user. Further, even in cases where multiple files pairs require differencing, as is the case in online file backup applications, there typically remains a file that is large relative to all other files, and causes a bottle-neck on the overall application.
For these reasons, the problem in RTPatch Server 5 is utilization of multiple cores when differencing a single file pair.
::Solution
RTPatch Server 5 introduces multithreaded byte-level differencing, dramatically reducing Build time. Goals set at the start of development:
- Achieve 90+% CPU utilization when differencing a file of sufficient size, and when instructed to use all available cores.
- No Loss of Patch Efficiency (i.e., patch size). Dividing a single file into non-overlapped sections, differencing on those sections, then assembling into one patch is a relatively easy to implement approach; however, depending on the scattering of change and the choice of division points, the generated patch file could be significantly larger than the single threaded approach, even in the common case.
- Overall Application Speedup. Portions of RTPatch Server 5 remain single-threaded (e.g., checksum generation) or do not otherwise benefit significantly from additional processing power, i.e., Amdahl's/Gustafson's Laws. However, the typical RTPatch Server input is relatively large (greater than 1GB), so the vast majority of the processing time is spent in the multi-core enabled portion of code - identifying the changes and similarities between two files. Though overall application speedup is the primary goal of RTPatch Server 5, the amount of speedup will still depend on the input itself. In any case, the maximum application speedup that was considered achievable was n-1 times faster where "n" is the number of physical cores present on the test machine (e.g., 7x faster on an 8 core machine). Consider three examples:
Example 1: A file pair that is a typical file update, resulting in a 90% reduced patch;
Example 2: A file pair that is completely dissimilar and results in a 40% reduced patch (i.e., approximating simple compression);
Example 3: A file pair that is nearly identical, and results in a 99+% reduced patch.
In the case of Example 1, overall application time is primarily devoted to the actual difference processing (which is multithreaded). As a percentage of overall single-threaded runtime, the portion of code that have little to no impact by multithreading is relatively small, thus overall application speedup can in fact approach n-1 times faster. On the other hand, in Examples 2 and 3, the percentage of overall single-threaded runtime is either spent on writing out the patch bytes (Example 2), which is file I/O bound, or in setting up the difference process, computing checksums, etc. (Example 3). In the cases of Examples 2 and 3, overall application speedup should occur, but cannot approach the degree of speedup attributed to the multithreaded algorithm portion of code. Because the RTPatch Server 5 test suite is comprised of real-world data with varying change patterns and similarities of file pairs, our goal with RTPatch Server 5 was to realize broad range of overall application speedup between 2x faster and n-1 times faster as the max attainable.
- Take advantage of Intel® Hyper-Threading when available. Rather than limit tests to use a maximum of "n" cores, where "n" is the number of physical cores available on the machine, our tests were designed to use "2n" when Hyper-Threading is detected. As the results below show (dual quad core, i.e., 8 physical cores), setting a thread count to twice the number of physical cores yielded faster results, indicating that hyperthreading provided measurable benefits.
::Environment and Tools
Compiler: Visual Studio 2005, Microsoft
Language: C, assembly code
UnThread™: A new parallel task API for use with CPU-bound applications. UnThread is a product of Postulate5, a division of Pocket Soft.
Hardware/OS: Windows 7 Ultimate, x64 Edition, dual Intel Xeon Quad Core @ 3.33Ghz (W5590 chips; 8 core total, w/HT enabled), 32GB RAM, Seagate 15K.6 SAS hard drive.
::Results, Application Speedup
Pocket Soft's test suite is gathered from real-world applications of its byte-level differencing products. Note in particular that overall patch compression was not the main focus of these test results. Since RTPatch Server includes a range of speed and resource usage options, optimizing patch size for a file pair is a separate topic, and outside the scope of this document. For more details on resource usage and efficiency, as well as patch compression results, please contact Pocket Soft.
Several file pairs of note are listed below showing application speedup results, on an Intel dual quad-core machine (8 physical cores total):
Description: US Postal Service Data
Description: Structured Data File (proprietary)
Description: Japanese Map Data
* 16 cores used to take advantage of Intel Hyper-Threading (see Goal #4 above).
The Japanese city map data results are of particular note, as it is this file pair that was a primary driving factor in creation of a multi-core version of RTPatch Server. Provided by an existing Fortune 500 customer, this data file suffered from poor simple compression results (less than 30%), as well as exceedingly long runtime in the single threaded versions of RTPatch Server. Though an optimal patch output of 77.8% compression is achievable with RTPatch Server, the results shown above are presented both to show variation, as well as for test expediency – the 77.8% compression case also benefited significantly from new version 5, yielding a 7.26x faster speedup at 16 cores, thus reducing the build time to several hours, down from over a day when using a single core.
For more statistics on any file set, please contact Pocket Soft.
::Summary
Conversion of RTPatch Server to a multithreaded application met or exceeded all goals:
- CPU Utilization. Given sufficiently sized inputs and using all available cores, CPU utilization plateaus of 99% are common, with an average CPU utilization during the differencing phase, exceeding 90%.
- No Loss of Patch Efficiency (i.e., patch size). Less than a +/- 1% variation in patch size was observed in the general case, when comparing single threaded differencing to multithreaded differencing.
- Overall Application Speedup. Overall speedup for all file sets averaged 3.78x faster at full CPU utilization, versus the single-threaded case. In typical file updates, differencing GB+ sized files with significant similarities (i.e., approximately 90% similar at the byte-level), RTPatch Server 5 obtained a typical 7x faster build times on an 8 core system, achieving the maximum stated speedup goal of n-1 times faster.
- Take advantage of Intel® Hyper-Threading when available. The max speedup that enabled RTPatch Server to achieve Goal #3 was due in part to the use of Hyper-Threading on the test machine. The extra threads served as a "slack threads" to utilize spare cycles on otherwise idle cores.
::What's Next
Nehalem EX features chips with up to 8-cores per processor. Our expectation is that RTPatch Server 5 will continue to scale on Nehalem EX's 8-core processor. We plan to test a dual 8-core Nehalem EX, providing 16 cores (32 with hyperthreading).
Next up for RTPatch Server 5, is to confirm and adjust if necessary, to ensure that n-1 speedup and 90% CPU utilization is maintained on the pending next generations of developer workstations and servers, such as Intel's 48-core cloud chip. Our goal in RTPatch Server is to continue to scale to n-1 peak speedups, until the law of diminishing returns sets some limit.
RTPatch Server 4 supports Windows and Linux/Unix. RTPatch Server 5 currently supports Windows only, due to availability of a required library on Windows only. A version of our new multi-core development library for Linux/Unix is planned for a 2010 release, at which time RTPatch Server 5 will include multi-core support for Linux/Unix as well.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
|