#$Id: hackingeupxs.pod 1367 2010-06-09 23:26:41Z jimk $ =head1 Hacking on Ancient Perl: Corehacking and the Case of ExtUtils::ParseXS =head2 Introduction At last year's YAPC in Pittsburgh, Chip Salzenburg and others issued a call for more hacking on the Perl 5 source code. That was soon extended to a call for hacking on the Perl modules that accompany the core, particularly the so-called Perl toolchain modules like Module::Build and CPAN.pm. In July 2009, David Golden, who had become co-maintainer of several of these dual-life distributions, asked me to take a look at one of them: ExtUtils::ParseXS. David wondered whether I could get ParseXS to better reflect Perl best practices. In particular, he wanted ParseXS to run under C. David asked me to take this on because six years ago I had participated in the Phalanx refactoring and testing project. More recently, I refactored many of the Parrot project's configuration and build programs. I accepted. This talk is a report on my progress to date: what we have been able to accomplish and what we have I been able to accomplish as well. We will also see some techniques for refactoring legacy code still used in production. =head2 What Is XS? XS is a language for writing extensions to Perl 5. At the user level these extensions typically appear as subroutines in packages such as List::Util. They're not implemented in Perl. They are written in XS, then compiled into C source code, and finally compiled into C object code. XS provides mappings between Perl internal data types and C data types. ExtUtils::ParseXS is a Perl 5 module that takes a file written in XS as input and produces a C source code file as output. Depending on whether you are using ExtUtils::MakeMaker or Module::Build, F or F then invokes your C compiler to generate the object code. To be more precise, your build program invokes a program called F to parse and compile the XS file. As far as I can tell, F was written by Larry Wall and introduced with Perl 5 itself. In the early part of this decade Ken Williams took the guts of F and placed them in a package called ExtUtils::ParseXS. F is now simply a wrapper around ParseXS's most important subroutines. David Golden became co-maintainer of this distribution in 2009. Notwithstanding this history, if you were to look at, say, the F that shipped with Perl 5.004 in 1996 and compare it to the F that shipped with Perl 5.12 in April of this year, you'd see that it was very much the same. It is truly Ancient Perl. In fact, it was not until December 2007, when Perl 5.10 was released, that the modularized version of this code first shipped with the Perl core distribution. =head2 Standards and Objectives for Hacking on ExtUtils::ParseXS When David Golden asked me to hack on ParseXS, he mentioned that many core hackers and Perl 5 porters felt that all Perl code that shipped in the core distribution ought to run under C; ParseXS did not. Other than that he gave me no specific directives, so I interpreted his directive more broadly: I For the purpose of this talk, such code should: =over 4 =item * C Which implies: No global variables. =item * Have tight scoping Variables should be declared so that they live in the smallest possible scopes. Where a variable can be declared as a lexical or C variable it should. Otherwise it will be have to be declared as a package global or C variable. =item * Have self-documenting symbol names The names of variables should describe what they do and should be distinct from one another. =item * Have fully encapsulated subroutines All variables declared outside a subroutine but used within it should be explicitly passed in its argument list. (Partial exception: closures.) We can then extract the subroutine into a separate package. The subroutine can then be called in test suites. =item * Have skimmable code Michael Schwern and other have argued that, wherever possible, code should be written in blocks small enough for a human to take the code in in a single glance. =item * Have high test coverage The extent to which the test suite exercises the code should be measured by coverage tools such as Devel::Cover. Wherever possible coverage should be maximized at the subroutine, statement, branch and condition level. =item * Have thorough documentation of interface All publicly available subroutines should be documented with respect to their purpose, arguments, return values and side effects. =back The payoff to writing Perl code which meets these standards is that it will work as documented. And you won't have to be a guru to read it or to have a good idea as to what it does. If these standards apply to all contemporary Perl code, they ought to apply to the Perl 5 toolchain as well. XS is the canonical way of extending Perl. If we want to improve it or use the other ways of extending Perl, then we have to clearly understand what ParseXS is doing. That's the rationale for bringing ParseXS up to modern Perl standards, and that's why I've been working on it since last July. =head2 ExtUtils::ParseXS: The Reality "The horror! The horror!" The versions of ExtUtils::ParseXS shipped with Perl 5.10 and 5.12 meet few of the standards of modern Perl. Older versions of F were even worse. For instance: =over 4 =item * Global variables By my count, the version of F that shipped with Perl 5.6.0 had 113 global variables. It follows by definition that the program could not run under C. Its variables were not tightly scoped, and most of its subroutines were not fully encapsulated. The modularization begun by Ken Williams early in this decade improved matters somewhat. For example, Ken was able to move 56 of those global variables into C. However, he also recognized that the day when ParseXS would run under C was a long way off: # use strict; # One of these days... =item * Confusing variable names Among those 113 global variables in the 5.6.0 version of F were these: $Interfaces $interface $interface_macro $interface_macro_set %Interfaces ... as well as these: $XsubAliases %XsubAliasValues %XsubAliases =item * Non-skimmable code The 5.6.0 version of F contained one loop of 533 lines length. That's grown to 566 lines in the equivalent code in the version of ExtUtils::ParseXS that shipped recently with 5.12.0. That loop is contained in a subroutine that 1025 lines long. =item * Inadequate test coverage There was no unit testing of F in earlier versions of Perl. Some testing has been done in 5.10 and 5.12. But because little of the code has been encapsulated into subroutines, there is little unit testing -- only some overall functional testing. =back =head2 Development of Tools for Use in Refactoring I began work on ExtUtils::ParseXS in July 2009. I set up a Yahoo! Group for a mailing list and other services (F). David Golden had pulled the code into a F repository. My progress was limited at first because I was intimidated as much by F's learning curve as by the complexity of ParseXS itself. I got help with F on IRC I<#corehackers> from I, I, I and, of course, I. But the main problem I faced was the difficulty of proving that any changes I made in ParseXS would do no harm. When I am writing code from scratch, I write unit tests which exercise every statement in the code. The thoroughness of the test coverage all but guarantees that when I break something, I learn about that fast. But when I am refactoring legacy code -- particularly legacy code with inadequate test coverage -- I face an additional burden: I have to prove that my refactorings don't cause damage in places where the code is currently used in production. In this case, I means: "Others already depend on it." I faced a similar challenge three years ago when I was refactoring the Parrot build tools. There, I could always tell when my refactorings were wrong because such errors would simply cause Parrot's F to fail and the F executable would not get built. By August I recognized that any revisions I would make to ParseXS would have to be backwardly compatible with all the XS code that ships with the Perl core or that is found on CPAN. I recognized that testing against a I repository would be helpful. But it was not until February that I realized I still had enough space left on this 6-year-old iBook G4 to create a minicpan. Once I had my minicpan, I had to figure out how to use it to test ParseXS. At the February meeting of Perl Seminar New York, we had a collective hacking session aimed at improving techniques for I CPAN distributions and learning something interesting about them (F). The results were two new CPAN-visitation distributions, one by David Golden and one by me called CPAN::Mini::Visit::Simple (F). I first used CPAN::Mini::Visit::Simple to identify distributions on CPAN that contained XS code. By late February I had identified approximately 1800 such distributions -- approximately one-tenth of all distributions on CPAN. I then visited these distributions to learn which used ExtUtils::MakeMaker as their build tool and which used Module::Build. (MakeMaker is still used about 10 times as often as Module::Build.) I then attempted to build all the CPAN distributions that contained XS and that built with MakeMaker. Over 600 distributions built successfully. Most did not, mostly due to missing prerequisites, whether those were uninstalled CPAN distributions or C libraries for which XS modules provide Perl interfaces. I hypothesized that if the C source code files built by running those 600+ CPAN distributions through my revised ParseXS were (with the exception of whitespace) identical to those built by running those same distributions through the core version of ParseXS, I would be doing no harm. I wrote a program which, in effect, functioned as a test harness: =over 4 =item * Start with a list of over 600 distributions with XS known to build correctly on this machine. =item * Visit each distribution, first building with the existing version of ParseXS. Identify F<.c> files so created. =item * Next build with the revised version of ParseXS. =item * Finally, F the resulting F<.c> files against those built with existing ParseXS. The result should be: no differences at all. =back This test harness enabled me to refactor with confidence and gave me some space in which to write more unit tests. =head2 My Refactorings: My refactoring included the following: =over 4 =item * Eliminate remaining global variables First, there were over 15 global variables not found in the C statement. I converted that C list to C variables. (That was okay because we're only going to support the toolchain for Perl 5.6 and later.) I then redefined each remaining global as C. After each such conversion, I ran the refactored code over the minicpan modules (or a randomly selected subset thereof). If the C code so generated was unchanged, I considered the redefinition a success. =item * Identify possible lexically-scoped variables Some of the C variables didn't really need to be package globals because their scope of operation was confined to lexical scopes. I redefined them as C variables. =item * Rename variables for distinctiveness and self-documentation I renamed many variables, but there's more work to be done. With so many variables needed to keep state, we really need to describe their purposes in a glossary. =item * Encapsulate code into subroutines Wherever I could encapsulate a block of code into a subroutine, I did so. Then, I moved the subroutines to a separate package. I documented them in that package and wrote tests for them in files organized around individual subroutines. =back The result of these refactorings was that by March 20th, for the first time I was able to run ExtUtils::ParseXS under C. I thereby met the principal objective set by the Corehackers project. =head2 My Refactoring: Beyond C However, I was not satisfied with simply getting ParseXS to obey strictures. Based on my experience with comparable programs in the Parrot project, I hypothesized that the state being kept in the over 70 C variables could just as well be kept in a single variable: an ExtUtil::ParseXS object. That object could then be passed to the over 30 internal subroutines. Since those subroutines would then presumably be fully encapsulated methods, I could make them subject to unit tests and thereby dramatically increase ParseXS's test coverage. It almost worked. I can eliminate more than 90% of the C variables. But in this case 90% is not good enough. The reason for this limitation is that ParseXS depends heavily on so-called string evals. eval EXPR; Some of these string evals permit the author of an F<.xs> file to write XS code that contains what I would describe as C-ish strings with what looks like a Perl scalar plunked down in the middle: C_ish_string_with_$var_inside The XS author intends that when such a C-ish string is evaluated by a string eval, the then-current content of a similarly named C<$var> inside ParseXS will be written to the F<.c> source code file. But this means that we are required to spell the variable as C<$var>. Something like C<$self-E{var}> will not work here. =head2 Conclusions It may be that the best we can do with respect to improving ExtUtils::ParseXS is to get it to run under C. Transforming ExtUtils::ParseXS into an object-oriented module may simply not be possible. Before we make that determination, however, we would like to hear from people more experienced with the Perl guts than we are. We would also like to hear from people who have experience converting string evals into other code such as closures. More broadly speaking, we can say that ExtUtils::ParseXS can be made to run under C and that there are ways we can test elements of the Perl 5 toolchain against all the code currently existing on CPAN. Welcome back to the refactory! =cut # vim: tw=72 ts=2 sw=2 et: