Wednesday, June 22, 2011

Making things happen one unittest at a time


There was a lot of discussions about our (PyPy's) plan with regard
to reimplementing Numpy. I would like to give a slightly more personal view
on things as they go as well as arguments about the approach in general.

Maybe let's start with a bit of background: the numpy effort in PyPy is the work
of volunteers who either need to extend it a little or find it fun to work on.
As of now it implements a very small subset of numpy – single dimensional float
arrays with a couple of ufuncs to be precise – and is relatively fast.

There are two obvious questions: (1) whether the approach of reimplementing numpy
might potentially work and (2) whether it makes sense from a long-term perspective.

The first part I'll leave alone. I would think that we have enough street cred
that we can build things that work reasonably well, but hey, predicting the future
is hard.

To answer the second part, there are two dimensions to the problem. One is the
actual technical perspective in short-mid-long term, the other being how
likely are people willing to spend time on it. It's actually pretty crucial that
both goals are fulfilled. Creating something impossible is hard
(has been tried before), while creating something that's tedious from
the start makes people not want to work on it. It's maybe less of a problem
in a corporate environment, but in open source it's completely crucial.

Technical part

Everyone seems to agree, with varying degrees of trust, that the JITted numpy
is the way to go in the long term. What can a JIT give you? Faster array
manipulations (even faster than numpyexpr) and most importantly faster
array iterations without hacks like using cython or weave. This it the
thing you get for free when you implement numpy in RPython and you don't
get at all when using cpyext. Note that it'll still reuse all parts of numpy
and scipy that are written in C -- this is most of it. The only part requiring
rewriting is the interface part.

With cpyext:

  • short term: nothing works, segfaults

  • mid term: crappy slow numpy, 100% compatible

  • long term: ? I really don't know, start from scratch?

With reimplementing parts in RPython:

  • short term: nice, clean and fast small subset of numpy

  • mid term: relatively complete numpy implementation, not everything though,
    super fast, reusing most parts of pure C or Fortran

  • long term: complete JITted numpy, hopefully achieving a better split
    of numpy into those parts that are CPython-specific and those that actually implement functionality.

If you present it like that, there is really not that much choice, is there?

To be fair, there is a one missing part, which is that the first approach
gives you a much better cpyext, but that's not my goal for now :)

Social part

The social aspects are quite often omitted. How pleasant is it to work on a problem
and how reasonable is it to expect achieving one's goals within the foreseeable
future. This is a really tricky but important part. If your project is run by volunteers
only, it has to have some sort of "coolness" factor, otherwise people
won't contribute. Some people also won't contribute if the intermediate result
is not useful to them at all, so we want something usable (albeit limited)
from the very beginning. There is a great difference here between cpyext
approach which is either adding boring APIs or fixing ugly segfaults (with the
latter being more common and more annoying) and writing RPython code,
which is a relatively pleasant language. With PyPy's numpy we had already
quite good success with people popping up from the blue and implementing
a single piece that they really need as of now. In my opinion this is how
you can make things happen - one unittest at a time.

Personal part

I plan to spend some time in the near future working on making numpy on PyPy
happen, without any other day job. If you have a thing that requires numpy
and will greatly benefit from having a fast python interpreter with a fast
numpy, this is the right point to contact me, money can make some APIs
appear faster than others :)