Thursday, October 27, 2011

PyPy and the road towards SciPy


Recent PyPys effort to bring NumPy and the associated fundraiser
caused a lot of discussion in the SciPy community regarding PyPy, NumPy,
SciPy and the future of numeric computing in Python.

There were discussions on the topic as well as various blog posts
from the SciPy community who addressed few issues. It seems there was a lot
of talking past each other and I would like to clarify on a few points here,
although this should be taken as my personal opinion on the subject.

So, let's start from the beginning. There are no plans for PyPy to
reimplement everything that's out there in RPython. That has been pointed
out from the beginning as a fallacy of our approach -- we simply don't plan
to do that. We agree that Python is a great glue language and we would like
to keep it that way. PyPy can nicely interface with C using ctypes with
a slightly worse story for C++ (even though there were experiments).
What we know by now is that CPython C API is not a very good glue for PyPy,
it's too tied to CPython and it prevents a lot of interesting optimizations
from happening. The contenders are a few with Cython being a favorite
for now, however for Cython to be usable we need to have a story for C++
(I know Cython does have a story but it's unclear how that would work with
the PyPy backend).

Which brings me to second point that while a lot of code in packages like
SciPy or matplotlib should be reusable in PyPy, it's probably not in
the current form. Either a lot of it has to move to Cython or some other
way of interfacing with C will come across. This should make it clear that
we want to interface with SciPy and reuse as much as possible.

Another recurring topic that seems to pop up is why we just don't reuse Cython
for NumPy instead of reimplementing everything. The problem is that we need
a robust array type with all the interface before we can start using Cython
for anything. Since we're going to implement it anyway, why not go all the way
and implement the full NumPy module? And that is the topic of the current
funding proposal is exactly that -- to provide full NumPy module. That
would be a very good start for integrating the full stack of SciPy and
matplotlib and all other libraries out there.

But also the trick is that a robust array module can go a long way alone.
It allows you to prototype a lot of algorithms on it's own and generally has
it's uses, without having to worry "but if I read all the elements from the
array it's going to be dog slow".

The last accusation is that we're trying to split the community. The answer is
simply no. We have a relatively good roadmap how to get to support what's out
there in scientific community and ideally support all people out there. This
will however take some time and the group of people that can run their
stuff on top of PyPy will be growing over time. This is indeed precisely what
is happening in other areas of python world -- more and more stuff run on PyPy
and people find it more and more interesting to try and to adapt their
own stuff to run.

To summarize, I don't really think there is that much of a gap between us
and SciPy people. We'll start small (by providing full NumPy implementation)
and then gradually move forward reusing as much as possible from the entire


Monday, October 17, 2011

Wikipedia, tag clutter, pypy and the dangers of bureaucracy

So, the PyPy article on wikipedia first got tagged with primary sources, then
after not so civil discussion from my side with potentially not notable.
This tags are gonna stay for the time being until someone will go ahead and
laborious work of going and trying to prove that PyPy either is notable or
will try to delete it. As far as I'm concerned the discussion is largely
irrelevant -- PyPy is a fairly notable subject to me personally and it likely
won't change because of the wikipedia article. I did make contributions to
this precise article in the past, mostly trying to be up to date, bumping the
release numbers, correcting links etc.

The reason why the article got tagged is silly -- the grand general notability
guidelines are not cut for open source projects. Indeed, there are no books
written or anything, even though on most python conferences everyone knows
what PyPy is and people are using it quite a bit. For all I know PyPy seems
not notable according to the guidelines written on wikipedia. I would put
it up for deletion myself if I were to follow the rules exactly.

But this is precisely the problem here -- putting rules, which I presume
I called guidelines for a reason, without thinking. For anyone living in
the open source world, it's relatively clear what considers "notability" and
it would be something else than for most wikipedia articles. For some
information, like compiler optimizations, the best source I can find is a
post on Lua mailing list, by Mike Pall. You can't change it - no book published
will change it. This is the original research performed and done somewhere
outside of the academia, yet pushing the boundaries of human knowledge forward.

The solution doesn't seem to be to simply establish rules for Open Source in
general. In my opinion the problem is with people who are not understanding
or refusing to understand and trying to stubbornly adhere to written rules.

What do you think?


Saturday, October 8, 2011

PyPy's future directions

The PyPy project was long criticised for being insufficiently
transparent about the direction of its development. This changed
drastically with the introduction of the PyPy blog, Twitter stream,
etc., but I think there is still a gap between the achievements
reported in the blog and our ongoing plans.

This post is an attempt to bridge that gap. Note, however, that it is
not a roadmap -- merely a personal opinion about some interesting
directions currently being pursued in the PyPy project. It is not
intended to be exhaustive.

NumPy for PyPy

Even though people might not quite believe that we can deliver it,
there is an ongoing effort to bring NumPy to PyPy by reimplementing
the interface pieces originally written in C in RPython. A lot work
has recently been done by Justin Peel and Alex Gaynor, and there have
been many smaller contributions from various volunteers.
This is very exciting, since PyPy is shining in numerics, which means that
with the full power of NumPy, we can provide a good alternative to
Matlab, etc.. We also have a vague plan to leverage platform-level vector
instructions like SSE to provide an even faster NumPy. Stay tuned!

Concurrent GC

There is a branch where Armin is experimenting with a simple
concurrent GC. This will offload your GC work to another thread
transparently in the background. Besides improved performance, this
should also remove GC pauses which is crucial for real-time
applications like games.

JSON improvements

There is ongoing work to make JSON encoding fast. We aim to beat the C
extension in CPython's standard library by using only pure Python.
Stay tuned, we'll get there. :-)

GIL removal

There is another branch and an advertised plan to remove the GIL using
software transactional memory. While implementing an STM inside a
dynamic language with lots of side effects is clearly a research
project, the prospects look promising. There is a risk that the
overhead per thread will end up fairly high, but we hope to avoid this
(the JIT may help here) -- and Armin Rigo is well known for
delivering the impossible.

Minor improvements left and right

Under the radar, PyPy is constantly improving itself. Current trunk is
faster than 1.6 and has fewer bugs. We're always looking at bug
reports and improving the speed of various common constructions, such
as str % tuple, str.join(list), itertools or the filter
function. Individually, these are minor changes, but together they
speed up applications quite significantly from release to release.

All of the above is the ongoing work. Most of it will probably work out
one day, but the deadline is not given. It's however exciting to see so
many different opportunities arising within the PyPy project.