Using Inline::C to Speed Up Perl
Revision 1.0 2024-08-29
Perl makes it easy to call functions written in C using the Inline::C module. This can greatly speed up a Perl script, particularly when it has to perform mathematical calculations. As is usually the case with scripted languages, Perl is convenient for accessing and writing data storage but relatively slow at calculating. A common pattern when doing scientific computing in Perl is therefore to use Perl to access data files or a database, do pre-processing of data, crunch the numbers using functions crafted in C, then use Perl again to post-process and write out the output data. It is also useful to use Perl to prototype and develop a function, then work it into C, and potentially compare different implementations. To illustrate and investigate this approach, this report uses Perl and Inline::C to develop different implementations of Franke's Function, and uses the Benchmark module to compare the speed improvements.
Implementation
Franke's Function (https://www.sfu.ca/~ssurjano/franke2d.html) is a mathematical function in two dimensions often used as a test function in interpolation and regression problems, and other computational mathematics problems. It is not particularly onerous to implement in code, but it does require the evaluation of four exponential functions per point, which can make it relatively slow when evaluated many times. In the following tests, Franke's Function is evaluated over a 2D grid, defined over a certain number of (X,Y) points. Franke's Function is implemented in the following ways:-
- A Perl implementation which uses Perl loops to iterate over the grid points, and Perl code to evaluate Franke's Function at each point;
- A hybrid Perl-C implementation which uses Perl loops to iterate over the grid points, and a C function to evaluate Franke's Function at each point;
- A C implementation which receives all the X and Y points from Perl and uses C to iterate over the grid points and evaluate Franke's Function at each point;
- A C implementation which operates in the same form as 3) but uses C intrinsics to access the AVX registers in an attempt to optimize the inner loop. Working with doubles, this allowed the evaluation of four (X,Y) points at a time. The Sleef library (https://sleef.org) is used to vectorize the exponential function, however no attempt is made to align memory;
- A C implementation in the same form as 4) but using aligned memory.
It was expected that the pure C implementations would be faster than the Perl and hybrid Perl-C, however with optimizations enabled in the C compiler it was unknown whether vectorizing the inner loop would have any effect; likewise, whether using aligned memory, as is recommended in best practices, would increase the speed.
Testing
Franke's Function is typically evaluated over the range: x,y ∈ [0,1]. Two tests were run, the first with Benchmark performing 1000 iterations of each implementation on a grid of 1023 X points and 767 Y points equally spaced within the range, and the second with 2000 iterations with 1503 X points and 1503 Y points equally spaced within the range. These numbers of X and Y points were chosen to force the vectorized code to handle tail elements, and as such represent a kind of "worst case" for that code, although it is unlikely it would have a very significant overall effect.
The Benchmark module was used to compare the relative speeds of the above implementations. The platform used was a T3.large AWS EC2 instance running Amazon Linux. The system Perl reported as v5.32.1, and Inline::C was using gcc with the -O3 optimization flag to compile the C code. The final code used is in the listing below.
Results
1000 iterations at 1023x767
Rate | Perl | Perl and C | C no AVX | C/AVX with alignment | C/AVX no alignment | |
---|---|---|---|---|---|---|
Perl | 0.657/s | -- | -91% | -96% | -97% | -97% |
Perl and C | 6.92/s | 954% | -- | -60% | -69% | -69% |
C no AVX | 17.4/s | 2545% | 151% | -- | -23% | -23% |
C/AVX with alignment | 22.5/s | 3334% | 226% | 30% | -- | -1% |
C/AVX no alignment | 22.7/s | 3355% | 228% | 31% | 1% | -- |
2000 iterations at 1503x1503
Rate | Perl | Perl and C | C no AVX | C/AVX no alignment | C/AVX with alignment | |
---|---|---|---|---|---|---|
Perl | 0.234/s | -- | -90% | -96% | -97% | -97% |
Perl and C | 2.43/s | 941% | -- | -60% | -69% | -69% |
C no AVX | 6.07/s | 2500% | 150% | -- | -22% | -23% |
C/AVX no alignment | 7.80/s | 3238% | 221% | 28% | -- | -1% |
C/AVX with alignment | 7.89/s | 3281% | 225% | 30% | 1% | -- |
In both sets of results the relative speed differentials were very similar. In terms of raw points evaluated, the pure Perl implementation can compute approximately 520,000 points per second and the vectorized C, 17,800,000 per second; so as expected, the pure Perl implementation was quite slow. Merely implementing Franke's Function in C resulted in a 10 times speed up, and pushing the loops to C as well resulted in a 25 times speed up, clearly indicating the value of writing the function in C. Vectorizing the inner loop also had clear benefits, the C functions using the AVX registers some 33 times faster than the original Perl code. However, making the effort to use aligned memory did not appear to pay off, but that might be due to the way the functions are tested. Some further investigation will be required on that.
Conclusions
Replacing mathematical Perl code with C code in Perl programs can result in significant performance increases, and is easy to achieve using the Inline::C Perl module. Sending arrays of data from Perl to C (and receiving them back again) is also very straightforward thanks to the Perl API (https://perldoc.perl.org/perlapi) and can give further speed increases where loops are required. Although largely a C implementation issue, using C intrinsics to vectorize functions can result in further speed increases, but testing is required and it is difficult to know in advance whether it will be worth the effort when compiler optimizations are available.
A recent concern is the energy usage of code: code written in interpreted languages consumes more energy to run and is therefore potentially worse for the environment. C is listed as the most energy-efficient programming language, and Perl appears to be second-worst (behind Python). However, as others have pointed out, overall development and maintenance times when using high-level languages like Perl can be much shorter, saving on energy costs. Using Inline::C to write the computationally intensive parts of a Perl program in C can provide the best of both worlds, allowing programmers to soothe any anxieties they might have about using their CPU resources in a sub-optimal manner.