#$Id: hackingeupxs.pod 1367 2010-06-09 23:26:41Z jimk $

=head1 Hacking on Ancient Perl: Corehacking and the Case of ExtUtils::ParseXS

=head2 Introduction

At last year's YAPC in Pittsburgh, Chip Salzenburg and others issued a
call for more hacking on the Perl 5 source code.  That was soon
extended to a call for hacking on the Perl modules that accompany the
core, particularly the so-called Perl toolchain modules like
Module::Build and CPAN.pm.

In July 2009, David Golden, who had become co-maintainer of several of
these dual-life distributions, asked me to take a look at one of them:
ExtUtils::ParseXS.   David wondered whether I could get ParseXS to
better reflect Perl best practices. In particular, he wanted ParseXS to
run under C<use strict;>.  David asked me to take this on because six
years ago I had participated in the Phalanx refactoring and testing
project.  More recently, I refactored many of the Parrot project's
configuration and build programs.

I accepted.  This talk is a report on my progress to date:  what we have
been able to accomplish and what we have I<not> been able to accomplish
as well.  We will also see some techniques for refactoring legacy code
still used in production.

=head2 What Is XS?

XS is a language for writing extensions to Perl 5.  At the user level
these extensions typically appear as subroutines in packages such as
List::Util.  They're not implemented in Perl.  They are written in XS,
then compiled into C source code, and finally compiled into C object
code.  XS provides mappings between Perl internal data types and C data
types.

ExtUtils::ParseXS is a Perl 5 module that takes a file written in XS as
input and produces a C source code file as output.  Depending on whether
you are using ExtUtils::MakeMaker or Module::Build, F<make> or
F<Build.PL> then invokes your C compiler to generate the object code.

To be more precise, your build program invokes a program called F<xsubpp> to
parse and compile the XS file.  As far as I can tell, F<xsubpp> was
written by Larry Wall and introduced with Perl 5 itself.  In the early
part of this decade Ken Williams took the guts of F<xsubpp> and placed
them in a package called ExtUtils::ParseXS.  F<xsubpp> is now simply a
wrapper around ParseXS's most important subroutines.  David Golden
became co-maintainer of this distribution in 2009.

Notwithstanding this history, if you were to look at, say, the F<xsubpp>
that shipped with Perl 5.004 in 1996 and compare it to the
F<lib/ExtUtils/ParseXS.pm> that shipped with Perl 5.12 in April of this
year, you'd see that it was very much the same.  It is truly Ancient
Perl.  In fact, it was not until December 2007, when Perl 5.10 was
released, that the modularized version of this code first shipped with
the Perl core distribution.

=head2 Standards and Objectives for Hacking on ExtUtils::ParseXS

When David Golden asked me to hack on ParseXS, he mentioned that many
core hackers and Perl 5 porters felt that all Perl code that shipped in
the core distribution ought to run under C<use strict;>; ParseXS did
not.  Other than that he gave me no specific directives, so I
interpreted his directive more broadly:  I<All the Perl code that ships
with core ought to be written in whatever is considered 'modern Perl' as
of its release date.>  For the purpose of this talk, such code should:

=over 4

=item * C<use strict;>

Which implies:  No global variables.

=item * Have tight scoping

Variables should be declared so that they live in the smallest possible
scopes.  Where a variable can be declared as a lexical or C<my> variable
it should.  Otherwise it will be have to be declared as a package global
or C<our> variable.

=item * Have self-documenting symbol names

The names of variables should describe what they do and should be
distinct from one another.

=item * Have fully encapsulated subroutines

All variables declared outside a subroutine but used within it should be
explicitly passed in its argument list.  (Partial exception:  closures.)
We can then extract the subroutine into a separate package. The
subroutine can then be called in test suites.

=item * Have skimmable code

Michael Schwern and other have argued that, wherever possible, code
should be written in blocks small enough for a human to take the code in
in a single glance.

=item * Have high test coverage

The extent to which the test suite exercises the code should be measured
by coverage tools such as Devel::Cover.  Wherever possible coverage
should be maximized at the subroutine, statement, branch and condition
level.

=item * Have thorough documentation of interface

All publicly available subroutines should be documented with respect to
their purpose, arguments, return values and side effects.

=back

The payoff to writing Perl code which meets these standards is that it
will work as documented.  And you won't have to be a guru to read it or
to have a good idea as to what it does.

If these standards apply to all contemporary Perl code, they ought to
apply to the Perl 5 toolchain as well.  XS is the canonical way of
extending Perl.  If we want to improve it or use the other ways of
extending Perl, then we have to clearly understand what ParseXS is
doing.  That's the rationale for bringing ParseXS up to modern Perl
standards, and that's why I've been working on it since last July.

=head2 ExtUtils::ParseXS:  The Reality

"The horror!  The horror!"

The versions of ExtUtils::ParseXS shipped with Perl 5.10 and 5.12 meet
few of the standards of modern Perl.  Older versions of F<xsubpp> were
even worse.  For instance:

=over 4

=item * Global variables

By my count, the version of F<xsubpp> that shipped with Perl 5.6.0 had
113 global variables.  It follows by definition that the program
could not run under C<use strict;>.  Its variables were not tightly scoped,
and most of its subroutines were not fully encapsulated.

The modularization begun by Ken Williams early in this decade improved
matters somewhat.  For example, Ken was able to move 56 of those global
variables into C<use vars>.  However, he also recognized that the day
when ParseXS would run under C<use strict;> was a long way off:

  # use strict;  # One of these days...

=item * Confusing variable names

Among those 113 global variables in the 5.6.0 version of F<xsubpp> were
these:

  $Interfaces
  $interface
  $interface_macro
  $interface_macro_set
  %Interfaces

... as well as these:

  $XsubAliases
  %XsubAliasValues
  %XsubAliases

=item * Non-skimmable code

The 5.6.0 version of F<xsubpp> contained one loop of 533 lines length.
That's grown to 566 lines in the equivalent code in the version of
ExtUtils::ParseXS that shipped recently with 5.12.0.  That loop is
contained in a subroutine that 1025 lines long.

=item * Inadequate test coverage

There was no unit testing of F<xsubpp> in earlier versions of Perl.
Some testing has been done in 5.10 and 5.12.  But because
little of the code has been encapsulated into subroutines, there is
little unit testing -- only some overall functional testing.

=back

=head2 Development of Tools for Use in Refactoring

I began work on ExtUtils::ParseXS in July 2009.  I set up a Yahoo! Group
for a mailing list and other services
(F<http://tech.groups.yahoo.com/group/parsexs/>).  David Golden had
pulled the code into a F<github> repository.  My progress was limited at
first because I was intimidated as much by F<git>'s learning curve as by
the complexity of ParseXS itself.  I got help with F<git> on IRC
I<#corehackers> from I<mst>, I<apeiron>, I<mikecanz> and, of course,
I<xdg>.

But the main problem I faced was the difficulty of proving that any
changes I made in ParseXS would do no harm.  When I am writing code from
scratch, I write unit tests which exercise every statement in the code.
The thoroughness of the test coverage all but guarantees that when I
break something, I learn about that fast.

But when I am refactoring legacy code -- particularly legacy code with
inadequate test coverage -- I face an additional burden:  I have to
prove that my refactorings don't cause damage in places where the code
is currently used in production.  In this case, I<production> means:
"Others already depend on it."

I faced a similar challenge three years ago when I was refactoring the
Parrot build tools.  There, I could always tell when my refactorings
were wrong because such errors would simply cause Parrot's F<make> to
fail and the F<parrot> executable would not get built.

By August I recognized that any revisions I would make to ParseXS would
have to be backwardly compatible with all the XS code that ships with
the Perl core or that is found on CPAN.  I recognized that testing
against a I<minicpan> repository would be helpful.  But it was not until
February that I realized I still had enough space left on this
6-year-old iBook G4 to create a minicpan.

Once I had my minicpan, I had to figure out how to use it to test
ParseXS.  At the February meeting of Perl Seminar New York, we had a
collective hacking session aimed at improving techniques for I<visiting>
CPAN distributions and learning something interesting about them
(F<http://tech.groups.yahoo.com/group/perlsemny/message/948>).  The
results were two new CPAN-visitation distributions, one by David Golden
and one by me called CPAN::Mini::Visit::Simple
(F<http://search.cpan.org/dist/CPAN-Mini-Visit-Simple/>).  

I first used CPAN::Mini::Visit::Simple to identify distributions on CPAN
that contained XS code.  By late February I had identified approximately
1800 such distributions -- approximately one-tenth of all
distributions on CPAN.  I then visited these distributions to learn which used
ExtUtils::MakeMaker as their build tool and which used Module::Build.
(MakeMaker is still used about 10 times as often as Module::Build.)

I then attempted to build all the CPAN distributions that contained XS
and that built with MakeMaker.  Over 600 distributions built
successfully.  Most did not, mostly due to missing prerequisites,
whether those were uninstalled CPAN distributions or C libraries for
which XS modules provide Perl interfaces.

I hypothesized that if the C source code files built by running those
600+ CPAN distributions through my revised ParseXS were (with the
exception of whitespace) identical to those built by running those same
distributions through the core version of ParseXS, I would be doing no
harm.  I wrote a program which, in effect, functioned as a test harness:

=over 4

=item *

Start with a list of over 600 distributions with XS known to build correctly
on this machine.

=item *

Visit each distribution, first building with the existing version of
ParseXS.  Identify F<.c> files so created.

=item *

Next build with the revised version of ParseXS.

=item *

Finally, F<diff> the resulting F<.c> files against those built with
existing ParseXS.   The result should be:  no differences at all.

=back

This test harness enabled me to refactor with confidence and gave me
some space in which to write more unit tests.

=head2 My Refactorings:

My refactoring included the following:

=over 4

=item * Eliminate remaining global variables

First, there were over 15 global variables not found in the C<use vars> statement.
I converted that C<use vars> list to C<our> variables.  (That was okay
because we're only going to support the toolchain for Perl 5.6 and
later.)  I then redefined each remaining global as C<our>.  After each
such conversion, I ran the refactored code over the minicpan modules (or a
randomly selected subset thereof).  If the C code so generated was
unchanged, I considered the redefinition a success.

=item * Identify possible lexically-scoped variables

Some of the C<our> variables didn't really need to be package globals
because their scope of operation was confined to lexical scopes.  I
redefined them as C<my> variables.

=item * Rename variables for distinctiveness and self-documentation

I renamed many variables, but there's more work to be done.  With so many
variables needed to keep state, we really need to describe their
purposes in a glossary.

=item * Encapsulate code into subroutines

Wherever I could encapsulate a block of code into a subroutine, I did
so.  Then, I moved the subroutines to a separate package.  I documented them
in that package and wrote tests for them in files organized around individual
subroutines.

=back

The result of these refactorings was that by March 20th, for the
first time I was able to run ExtUtils::ParseXS under C<use strict;>.  I
thereby met the principal objective set by the Corehackers project.

=head2 My Refactoring:  Beyond C<use strict;>

However, I was not satisfied with simply getting ParseXS to obey
strictures.  Based on my experience with comparable programs in the
Parrot project, I hypothesized that the state being kept in the over 70
C<our> variables could just as well be kept in a single variable: an
ExtUtil::ParseXS object.  That object could then be passed to the over 30
internal subroutines.  Since those subroutines would then presumably be
fully encapsulated methods, I could make them subject to unit tests and
thereby dramatically increase ParseXS's test coverage.

It almost worked.  I can eliminate more than 90% of the C<our>
variables.  But in this case 90% is not good enough.

The reason for this limitation is that ParseXS depends heavily on
so-called string evals.

  eval EXPR;

Some of these string evals permit the author of an F<.xs> file to write
XS code that contains what I would describe as C-ish strings with what
looks like a Perl scalar plunked down in the middle:

  C_ish_string_with_$var_inside

The XS author intends that when such a C-ish string is evaluated by a
string eval, the then-current content of a similarly named C<$var> inside
ParseXS will be written to the F<.c> source code file.  But this means
that we are required to spell the variable as C<$var>.  Something like
C<$self-E<gt>{var}> will not work here.

=head2 Conclusions

It may be that the best we can do with respect to improving
ExtUtils::ParseXS is to get it to run under C<use strict;>.
Transforming ExtUtils::ParseXS into an object-oriented module may simply
not be possible.

Before we make that determination, however, we would like to hear from
people more experienced with the Perl guts than we are.  We would also
like to hear from people who have experience converting string evals
into other code such as closures.

More broadly speaking, we can say that ExtUtils::ParseXS can be made to
run under C<use strict;> and that there are ways we can test
elements of the Perl 5 toolchain against all the code currently existing
on CPAN.

Welcome back to the refactory!

=cut

# vim: tw=72 ts=2 sw=2 et: