Pittsburgh Perl Workshop: The Joys of Open Source
November 8 2014
Once every year or so, I find myself struggling with a software development problem at my day job that I don't manage to get solved by the time I head out for the weekend on Friday night. But often putting some space and time between me and the job enables me to have flashes of insight into the problem. I often see that the problem is not just my problem, but could be any Perl developer's problem -- and sometimes even any developer's problem, regardless of the language he or she is working in. So if I solve the problem for myself, I can solve it for a whole lot of other people as well. The way I solve it for other people is simple: I put it on CPAN.
So it's often the case that I start writing a program or library on a Friday night, put its first version up on CPAN on Saturday morning, and continue refining it through the weekend. By Monday morning it's sufficiently refined, tested, debugged and documented that I can use it on the job. In that sense, such a program is a weekend wonder.
This evening I'm going to introduce you to several of these CPAN distributions I've written over the years. I'll describe the original software development problem I was facing, the high-level details of the CPAN distribution and how they solved the problem I was facing.
None of these modules are ground-breaking. They are, by my own admission, modest modules aimed at solving rather narrowly defined problems in a satisfactory way. Don't get too hung up on the details; I'm not "teaching" these modules tonight. What I do hope you take away from tonight's presentation is a heightened awareness of where writing some modest but useful modules and sharing them can help you as a developer.
From 2006 to 2012 I worked at a large email service provider headquartered in New York City. The team on which I worked was responsible for receiving data files in custom formats transmitted to us by clients over SFTP and preparing that data for loading into a database via an API. We became very adept at elementary ETL (extract, transform and load). Usually we opened up the incoming file, read the header row, then iterated over each record to write a munged record to a new file which was sent to the API.
Sometimes, however, we received data files that had different formats. One such format had two different sections within the file:
The header consisted of key-value pairs which constituted, in some sense, metadata.
# comment
a=alpha
b=beta,charlie,delta
c=epsilon zeta eta
d=1234567890
e=This is a string
f=,
This particular header consists of key-value pairs delimited by =
signs. The key is the substring to the left of the first delimiter. Everything to the right is part of the value.
The body section consisted of data records, which were either delimited or fixed-width.
some,body,loves,me
I,wonder,wonder,who
could,it,be,you
This particular body consists of comma-delimited strings. Whether in the body or the header, comments begin with a #
sign and are ignored.
The header and the body are separated by one or more empty records.
Our instruction was to apply business rules to the header data from the header to determine whether or not to proceed with processing of the records in the body of the file. If certain criteria were not met after processing the header, we would simply close the file and ignore the body records. Only body records would ever be considered for presentation to the API.
Suppose you are told that you should proceed to parse the body if and only if the following conditions are met in the header:
There must be a metadata element keyed on d
.
The value of metadata element d
must be a non-negative integer.
There must be a metadata element keyed on f
.
This file would meet all three criteria and the program would proceed to parse the three data records.
If, however, metadata element f
were commented out:
#f=,
the file would no longer meet the criteria and the program would cease before parsing the data records.
What stumped me for a long time was how to read and parse some of the lines in the file in one manner, then parse the remaining lines of the file in a different manner. Sitting at my desk at work I couldn't clear my head enough to come up with the solution. It was only when I got home and relaxed that the solution came to me -- a solution which, by the end of the weekend, I had put up on CPAN as Parse::File::Metadata.
Tonight I'm only going to give you a high-level, over-simplified description of Parse::File::Metadata.
Setup
You set-up a constructor which you provide the constructor with:
A path to the file you want to process;
file => 'path/to/myfile',
A pattern describing how to split a header row on a delimiter; and
header_split => '\s*=\s*',
An array of code references which spell out whether, at the end of the header rows, you should proceed to process the body rows.
rules => [
{
rule => sub { exists $metaref->{d}; },
label => q{'d' key must exist},
},
{
rule => sub { $metaref->{d} =~ /^\d+$/; },
label => q{'d' key must be non-negative integer},
},
],
Run
Once you have parsed the header, you write another code reference specifying how to process each record in the body of the file.
$dataprocess =
sub { my @fields = split /,/, $_[0], -1; print "@fields\n"; };
You pass that code reference to method process_metadata_and_proceed()
.
$self->process_metadata_and_proceed( $dataprocess );
To refresh my memory as to how Parse-File-Metadata's development proceeded, I consulted the Changes file in the distribution.
I put the first version up on CPAN on a Sunday night.
0.01 Sun Jan 17 14:17:07 2010
- original version; created by ExtUtils::ModuleMaker 0.51
Over the next two weeks I refined its API and corrected problems that were showing up on CPANtesters reports.
0.04 Fri Jan 29 21:36:00 2010
- In tests, use 'File::Spec->catfile()' to create paths for files used
in testing (in hope that we get better CPANtesters results on Win32).
And there things sat for three months. The project at the day job which had inspired Parse-File-Metadata was stalled -- but at least that gave me time to persuade my co-workers that using this module was a good idea. At last the project moved forward. One of my co-workers caught some documentation errors which I then corrected on CPAN.
0.07 Fri May 14 21:32:58 EDT 2010
- Correct misleading documentation, per suggestion of Preston Cody.
So, by the point when we put Parse-File-Metadata into production, it had been improved as a result of both being put on CPAN and being reviewed by my colleague.
When I started my current job, I was far from fluent in SQL. I worked my way through a MySQL book in 2004, but from 2006 to 2012 I only had to write SQL about once a year. When I came to MediaMath, however, I had to write SQL from the gitgo.
Toward the end of 2012 I had a big project which entailed a big overhaul of the mapping of advertising audience targets to regional codes. Some regions, such as the new nation of South Sudan, were coming into the regional coding system for the first time.
Top-Level Region Code
Bonaire/Sint Eustatius/Saba bq
South Sudan ss
Within those top-level regions, new regions were being introduced.
Region Top-Level Region Code
Bonaire Bonaire/Sint Eustatius/Saba 98172
Saba Bonaire/Sint Eustatius/Saba 98173
Sint Eustatius Bonaire/Sint Eustatius/Saba 98174
Central Equatoria South Sudan 98175
Upper Nile South Sudan 98176
Some existing regions were getting new code numbers.
Region Code
Australian Capital Territory 72986
New South Wales 72987
Queensland 72988
Tasmania 72989
Victoria 72990
And some new regions were being created within existing parent regions.
Region Top-Level Region
Espaillat Dominican Republic
La Romana Dominican Republic
La Vega Dominican Republic
Maria Trinidad Sanchez Dominican Republic
Monsenor Nouel Dominican Republic
Puerto Plata Dominican Republic
Santiago Dominican Republic
Santo Domingo Dominican Republic
I knew what the RegionalTargeting table currently looked like, and I knew what I wanted it to look like after the data migration. But I didn't know how to get from here to there. I was getting lost in a forest of sub-SELECTs and WHERE clauses.
As my deadline approached, I was getting more and more desperate. To boost my confidence (as much as anything) I decided to first solve the problem in Perl, then re-tackle it in SQL. If the RegionalTargeting table were a Perl data structure, what steps would I take to transform that data structure into a second Perl data structure? If I understood the logic in Perl, I hypothesized, I would have a better shot at reproducing it in SQL.
But first I had to get the data out of a PostgreSQL table and into a Perl data structure. I was already familiar with the excellent CPAN distribution Text-CSV which dates to 1997 and has received many improvements over the years, including both XS and pure-Perl versions. I realized that I could use the psql copy command to save the RegionalTargeting table to a CSV file, i.e., to a plain-text file holding records of comma-separated values. The new regional targeting data arrived in various file formats, but I could convert them to plain text, CSV files as well.
Once all the data was in CSV format, I could convert it to a Perl data structure and manipulate it to my heart's content. Once data munging was complete, I could transform the Perl data structure into a new CSV file, load that to a temporary table in Postgres and compare the two tables to see if I had performed the data migration correctly.
I realized, however, that for the purpose for the purpose of writing little programs that would serve as my development tools, Text-CSV's interface was somewhat overkill. I simply wanted to say:
This CSV file represents a PostgreSQL table.
The table's primary key is: _____.
Turn this file into a hash.
This was the impetus for the creation of CPAN module Text::CSV::Hashify. Text::CSV::Hashify has a modest object-oriented interface, but it has an even simpler functional interface:
use Text::CSV::Hashify;
$hash_of_hashes = hashify('/path/to/file.csv', 'primary_key');
Once I had the data in a hash, I could transform it one step at a time until I got another hash which I thought represented the desired final state of the regional targeting codes. I could then iterate through that hash and print it to a nicely formatted plain-text file. I could then send that plain-text file to the Product Manager supervising the project and ask, "Is this the way the RegionalTargeting codes should wind up?"
So, in effect, solving the problem first through Perl not only enabled me to understand the problem more easily; it also provided a basis for validation of my approach -- what in my days in the printing industry we would simply call proofreading.
Once the Product Manager and I were in agreement as to the desired final state of the data, I could then go back to the SQL. Where in Perl I had used hashes to hold the result of each step in the data migration process, in SQL I used temporary tables. When I got to the point of writing a temporary table which had the same records as the final Perl hash, I had essentially solved the problem. At that point, I was able to go back over my SQL, optimize it to make its logic more SQL-ish and prepare the patch which was finally applied to the production database to effect the data migration.
At MediaMath, before merging branches into trunk, we are required to run our tests through a smoke server as well as pass human code review. Even before I think a particular branch is ready for code review, I like to fire off a smoke test, for two reasons:
so as not to tie up my laptop while I'm continuing to code; and
to see if my changes break any tests in files I don't think are germane to those changes.
As Chris Masto described in a New York Perlmongers meetup here in 2013, we use Jenkins to run the smoke tests. Several months ago Jenkins was giving us a lot of problems. Test suite runs that previously took 15 to 20 minutes now took 40 -- and sometimes didn't complete at all. There also were cases where tests would pass on our laptops but tickle something on Jenkins that would result in failures. In either case, I, for one, needed to get some results out of Jenkins faster than Jenkins was then capable of giving me.
Now, in many development environments, when you push a branch to a central location, smoke tests are triggered if and only if you name the branch in a certain manner. For example, in the Perl 5 core distribution, where we use git for version control, if you push a branch as follows:
git push origin mybranch:smoke-me/jkeenan/mybranch
... that branch will be channeled to a smoke-testing system where it gets picked up by several different machines at different locations around the world running different operating systems in different configurations.
Here at MediaMath, we also use git, but we trigger smoke-testing by creating a topic branch:
git push origin mybranch:topic/mybranch
I hypothesized: What if, instead of having Jenkins run a suite of more than 100 individual test files, I had it run a reduced suite of the three or four files I felt germane to the assignment at hand?
I should note that this thought did not come to me while I was sitting at my desk getting frustrated over the problems with Jenkins. I had to actually be away from my desk in both space and time to achieve the mental clarity needed to come up with this hypothesis. That is how I was able to come up with my newest CPAN distribution: Git-Reduce-Tests.
Git-Reduce-Tests is a command-line utility reduce-tests which is implemented by a library whose principal module is Git::Reduce::Tests.
reduce-tests \
--dir=/path/to/git/workdir \
--branch=master \
--remote=origin \
--include=t/90-load.t,t/91-unload.t \
--prefix=smoke-me \
--verbose
Tell the program where your git checkout is; the name of the branch whose tests you are reducing; where the remote is; which test files to include (or exclude) in the reduced branch; and what to prepend (or append) to the branch's name to kick off a smoke test.
Creating a branch with a greatly reduced number of tests enabled me to get results out of the smoke server faster than running the full test suite -- provided the smoke server was actually completing the smoke run and exiting cleanly. If the smoke server was failing to complete a run, it did not matter how many tests were in the suite.
I've also found reduce-tests useful when I simply want a branch with a small number of tests -- regardless of whether I'm sending the reduced branch to a smoke server or not. In this case, I add an option to tell reduce-tests not to push the branch to the origin. For example, if I were working in the Parrot virtual machine's master branch, I could say:
reduce-tests \
--dir=/home/jkeenan/gitwork/parrot \
--include=t/src/embed/api.t,t/src/embed/pmc.t,t/src/embed/strings.t \
--branch=master \
--prefix=jkeenan/reduced_ \
--remote=origin \
--verbose \
--no_push=1
This would create a branch called reduced_master whose make test would only run three tests found in t/src/embed.
The three CPAN distributions I've discussed today have two things in common:
The major development of each of these modules, from creating the distribution structure with ExtUtils::ModuleMaker to getting version 0.01 up to CPAN, was done in a weekend.
Each of these distributions has its genesis in very specific software needs and is therefore tightly focused. As a consequence, the number of files or functions in each is small. But that makes writing tests easier, so each of these distributions has good test coverage.
The three distributions differ in a number of ways:
Only Parse-File-Metadata was intended for use on a production server. Text-CSV-Hashify certainly could be used in production, but it was originally created as a development tool. Instead of making the computer go faster, it helped the developer to get the job done faster. Git-Reduce-Tests was also created as a development tool.
Parse-File-Metadata has no non-Perl5-core dependencies. Text-CSV-Hashify, as its name implies, depends on Text-CSV. Git-Reduce-Tests, as its name implies, assumes you have git and are working in a git checkout directory. It is currently implemented as a wrapper around Git-Wrapper, a CPAN distribution originally written by Hans Dieter Pearcey and subsequently maintained by Chris Prather and John Anderson. Git-Wrapper has both core and non-core dependencies.
But none of these distributions has "heavy" non-core dependencies in the way that something like Catalyst, DateTime or Dist::Zilla does. That's part of their modesty.
Step away from the keyboard
Sometime you need to put time and space between you and your problem in order to see that it is a general problem, not just one specific to your work situation.
Is there an open-source solution already?
Then use it.
If not, design the interface.
Which, in effect, means writing the documentation first!
Then write the code and the tests.
Get your tools ready!
Source code repository, e.g., github.com
If solution is Perl, CPAN account
Share!
Thank you very much.