PHP Manual Masterpieces

RSS

Better late and spearheaded by another corporation than never

Of interest to the readers of PHP Manual Masterpieces may be the advent of the PHP specification.

Twenty years after PHP was started, that is, and apparently mostly being pushed by HHVM i.e. Facebook. Meh, my toy language doesn’t have a spec so who am I to criticize?

It’s extremely long and I haven’t read most of it yet but I assume that formalizing some long-standing insanity as bug-for-bug compatible is an inescapable requirement. It doesn’t cover the standard library, however, which is what really makes PHP what it is (both good and bad). 

But mostly I was just baffled by comment #4 on that announcement page. What on earth does PHP’s speed have to do with anything when talking about specifications? Y’all PHP fans can have a discussion about the language without bringing up that it’s fast and supports C extensions (or rather it’s fast when what you’re doing is implemented mostly in C extensions). Try it some time! 

Jul 3

This blog writes itself.

DateTimeImmutable::modify()

Alters the timestamp. Like DateTime::modify() but works with DateTimeImmutable.

Duh. I don’t see what’s so hard about this, @pda.

Of course, it doesn’t actually do what the name says and modify an unmodifiable object at all. So, at least there’s that; PHP didn’t actually go through with its threat.

Mar 3

Nothing Is Deprecated, Everything Is Permitted

Blogging something real quick while I maybe look to see what kind of fun factor I can get out of the Mt. Gox leak:

It is absolutely 100% true that PHP, the platform, has finally deprecated a bunch of Dumb Stuff, and in some cases has even gotten past the deprecation stage to remove the Dumbest Stuff completely. This is good! This is great.

But that doesn’t mean PHP, the community, is suddenly absolved of the problems those misfeatures brought in the first place. For one thing, most PHP deployments are not continuously upgrading to the newest version of PHP or even the newest stable. The last time I wanted to test something, I had to pull PHP and compile it from scratch because the features in question weren’t in Debian yet. However, some people were already saying that the assorted changes in PHP 5.4 meant I couldn’t pick on those things anymore, before 5.4 was available through standard package distribution!

More critically, though, is that the amount of PHP code in production which is old will always exceed that which is new. The normal lifecycle of a piece of code is to write it once, make notable adjustments once or twice, and put it on minimum maintenance forever until the website goes away. The majority of commercial PHP code that I have seen with my own eyes clearly dates to the PHP 4 era or was written by someone who stopped learning in the PHP 4 era, which was the perfect storm of popularity and screwed-up-ness. This code did not magically get better when it was acknowledged that certain language features were bad news. It stayed more or less exactly the same. It’s still running.

Old code does not magically improve. New code is often written with reference to old code. Heck, new code is often literally just copy-pasted from a forum comment written in 2003. PHP’s misfeatures will persist many years past the official attempt to fix them. And that’s terrible :(

Jan 8

I thought @eevee was putting me on. And now you think I’m putting you on. But I’m not.

But it’s not like anyone would ever notice - no-one is blocking javascript on php.net after the malicious javascript incident! You know, the one we’ve been waiting for the full postmortem on since October? In all seriousness, if they cannot determine the root cause of the breach, they should say so for the record: it happens. It means whatever went wrong could probably go wrong again, but, it happens. From my point of view, their incident response kind of fell apart for a while due to confusion; I hope they have regrouped and learned from the experience so that it goes smoother next time. I’m not being snarky. Incident response is hard.

I’d laugh and cry myself to death, though, if this was them getting hacked again but apparently it’s just someone with commit access being cutesy.

Jan 1

Regarding "On the interest of being fair", h t t p : / / a x o n f l u x . c o m / 5 - q u o t e s - b y - t h e - c r e a t o r - o f - p h p - r a s m u s - l e r d o r f -- why wouldn't tumblr let me include links?!

Anonymous

Scorched earth anti-spam policy. I guess we know where Twitter learned it from. For readers’ convenience: link that Tumblr’s markdown parser better figure out godsdam

I personally am of the opinion that not everybody needs to like programming and take an artist’s pride in their results - but I do kinda maybe want the people who make the tools intended for reuse by thousands to have that quality.

I really like how Tumblr creates answers to asks in WYSIWYG then when I edit them they suddenly become markdown except with HTML tags everywhere. Isn’t this site written in PHP…? ;)

In the interest of being fair

I felt compelled to add this note that I understand Rasmus Lerdorf was about 20 years younger than he is now when he started PHP. I understand that most of PHP’s problems are rooted in gaining too much traction too quickly and nobody wanted to introduce breaking changes. I lament that PHP won, not that it ever existed. It’s pretty typical of a personal project from the 90s.

But dang. He was actually older when he released PHP than I am now. … yep back to feeling judgmental and smug

I’m crying. Literally crying. Actual tears in my eyes. Salty. They sting. PHP is physically hurting me.

All of these freaking ridiculous function names we’ve been stuck with for twenty years are because of a POORLY CHOSEN HASH FUNCTION ON A DATASET OF ONE HUNDRED SHORT STRINGS.

Screencap via @DefuseSec because the actual site is down, presumably from everyone gawking in sheer disbelief.

I’m crying. Literally crying. Actual tears in my eyes. Salty. They sting. PHP is physically hurting me.

All of these freaking ridiculous function names we’ve been stuck with for twenty years are because of a POORLY CHOSEN HASH FUNCTION ON A DATASET OF ONE HUNDRED SHORT STRINGS.

Screencap via @DefuseSec because the actual site is down, presumably from everyone gawking in sheer disbelief.

Language Field Trip: IDL

All aboard the school bus, we’re going on a field trip. Did you know there are things that might actually be worse than PHP? It’s true! It’s true and it causes me to doubt the goodness of the cosmos. If you are a tumblr URL purist, I apologize for the deviation from the strict theme of the PHP manual, but I promise there are truly some other masterpieces of program design to be unearthed.

A few years ago, as I was finishing up my degree, I tried very hard to get a job as a programmer for some radio astronomers because radio astronomy freakin’ rules. Unfortunately I graduated right into the very heart of the bad economy, so that didn’t pan out and now I’m a professional hacker or something (I’m not really sure). Preparing for the interviews, however, brought me into sustained contact with a commercial programming language environment called Interactive Data Language aka IDL. It’s for scientific programming, it has some neat things like built-in cartography data, and it’s terrible.

IDL dates to the late 1970s and it shows in every facet of its being. There is of course a reason anyone ever used it in the first place: it is a language oriented to efficient transforms of entire arrays, which is exactly what scientists working on datasets want. In modern times, languages like Python have filled this role about six bajillion times better, but the dark legacy of, well, legacy code lives on. There have been improvements in recent years – apparently it now has automatic GC(!) and a lot of new graph types – but most of the things I’ll point out here can’t be changed without breaking legacy code, and perusing the current code samples on the site does not make the language seem particularly fundamenantally improved.

To avoid the tedium of retypesetting several tables and code listings, this manual masterpiece is structured around screenshots taken from a book called Practical IDL Programming by Liam E. Gumley. It’s a bit dated but, as mentioned, they had a legacy problem then and they have a legacy problem now. (The current website of IDL does a good job of not making it at all obvious where the official documentation is. It’s here.) The screenshots constitute a very small portion of the overall book, mostly from chapter 2, used for critique purposes bla bla bla. (If you are in tumblr dashboard view, click/tap on any image thumbnail to expand all of them.)

Let’s begin with giving you a taste of what we’re dealing with: this is a while loop from a larger program.

I want to point out one thing in particular. on_ioerror sets a goto for any future IO errors within the current function scope (so why is the statement inside a while loop?). That should set the tone for how this language works. (For the record, I am a fan of a well-placed goto in low-level code; after all, sometimes I program in asm for fun.)

I don’t even have any idea what order I should present these in. It’s just a steady trickle of arbitrary WTF.

Tiny Integers

Quick! What’s the default integer size in a Big Data scientific programming environment? 64-bit, or do we cheap out and use 32-bit to align with the native width of more modest machines? Or do we define it to be the width of the currently executing machine?

Don’t be ridiculous! Integers are sixteen bit. 32 bits are for long types! (And note how it freely admits that Typecast Hell is a threat you must stand ever vigilant against, as though a programming language actively creating problems for you is simply how they are.)

There may be some vague idea that this is to align with 16-bits-per-pixel image storage formats which I presume were more dense on the ground in the 80s. Or maybe most scientists really did have 16-bit machines (can’t afford a VAX?) and the performance penalty of using larger integers was a huge problem. I don’t know, I wasn’t born yet. In any case, this is a wonderful inheritance passed down from generation to generation: having to remember in 2013 that all your low integer literals are being declared as signed 16-bit. Check this out:

That’s right! You have to remember to explicitly cast your literals if you want them to be comparable to a number north of about thirty two thousand!

Has your head hit the desk yet? Get a pillow. Trust me.

Odd and Even Booleans

Hey, you know what would save like, one whole opcode in the runtime’s boolean routine? If we only checked the lowest bit of an integer! BRILLIANT here is your Christmas bonus, Engineer Shortsighted! Oh no your Christmas bonus is an even number of dollars so if(ChristmasBonus) doesn’t evaluate to true.

So… 2.0 is true and 2 is false? – faithful follower @sakjur

This has consequences that break perfectly sensible design patterns:

And standard library routines explicitly defy the linguistic definition of boolean:

And this is as good a place as any to mention that not setting a flag is not necessarily the same thing as setting the flag to false??? Apparently an example is /noclip in graph drawing. I dunno.

Procedures and Functions, Parameters and Keywords

IDL maintains a first-class distinction between procedures (doesn’t return a value) and functions (does return a value) which I think most people see as kind of pointless these days; even C doesn’t care very much. This in and of itself is just a quirk, but the syntax for calling them is completely different and in the case of procedures is just weird:

IDL> procedurename, argument, argument

It’s just like… a comma-delimited list, floating in space? The name of the procedure is not differentiated from its arguments except by virtue that it’s first. It’s gross and I don’t see any reason it should be structured differently from function calls, which take a more typical name(arg, arg) style.

IDL also has a first-class distinction between mandatory arguments, called parameters, and optional arguments, called keywords (in contrast to what “keyword” means in most other languages). “Mandatory” is apparently a bit too strong of a term because “a well-written procedure or function will check that any mandatory input arguments are defined before doing anything else.” An apparently intentional misfeature is you can pass non-existent variables as arguments and expect them to suddenly have meaningful contents in the caller’s scope as a side effect of the function.

Of course, the language contains both pass-by-value and pass-by-reference, and which applies when is of course entirely consistent and intuitive!

I mean, it’s obvious to everyone here that an array which is a subset of another array is a fundamentally different type of data than an array that isn’t, right? Of course such rules would be totally different. (I’m contractually obligated to tell you that my best friend wants you to know this is also how Python does it. Well, I never claimed to want to marry Python, now did I! Edit: Except in Numpy, apparently, where it works the way I think is Right and True, which is probably why I thought Python was Right and True, as I’ve used Numpy for something before.)

Since procedures and functions are completely not the same thing, of course the error messages for not being able to find one are completely different:

Read that second one carefully and let the horror sink in: it cannot distinguish between an invalid function name and an uninitialized array. Unless of course you happened to use a single keyword argument to your invalid function name, in which case you get a third unique error message:

Arrays

This sounds reasonable in isolation:

This sounds reasonable in isolation:

But these are both true in the same language. You see, bad indexes in an array are less bad than lone wolf bad indexes. The companionship tames them.

We already hinted at this one: shipping syntax ambiguity, waking up the next morning, and shipping both ambiguity and non-ambiguity going forward.

Pointers

Yes, it’s a high level language. Yes, there are pointers. I suppose they’re really handles or something.

Accessing undefined variables through pointers: a critical and useful feature and definitely not a cause of interesting bugs.

Quirky?

Yucky.

Assorted Brain Damage

Followed by a code sample that explodes for x = 0 due to lack of short circuiting, of course.

The creat school of function naming thought - ie Ken Thompson’s Regret.

Excerpt from a much larger table – the implication being that there is no hexadecimal notation.

Wait, there are objects?! (And strings are limited to 32 kilobytes?!) THERE ARE OBJECTS?!?! AND YOU’RE JUST NOT GONNA MENTION THAT AGAIN IN OVER FIVE HUNDRED PAGES?!?!

"Don’t bother correctly specifying the expected input. That will just increase the rate at which malformed data is rejected instead of stuffed into places it doesn’t fit!"

And we run entire labs on this.

Nov 8

I Can’t Spell PBKDF

How much longer can a critique of a manual page run than the actual page itself? I hypothesize: quite a lot longer. (Edit: someone has submitted a patch to address some of these concerns. Jump to the bottom for expanded thoughts on why I don’t submit these myself.)

"PDKBF" stands for AUGH I screwed it up already. Let’s try again. "PBKDF" stands for Password-Based Key Derivation Function, which is basically the only real-world usecase of deliberately slowing down your own computation. Here’s a crash course in the theory as it pertains to its use in webapps: we collectively made a huge mistake when we chose fast hasing algorithms such as MD5 and SHA1 as a basis for password security. Faster to calculate is faster to crack, and in particular they lend themselves well to GPU computing. A password hashing algorithm should be as slow as possible without interrupting the functionality of your login process. PB-whatever is a hacky but functional fix for this which is essentially just a wrapper that repeats hashing in a loop for a number of iterations under your control. Super.

I am totally 100% for including this in the standard library of PHP due to its, well, standardness. (That being said, I have been asked to point out that there is another new function which is the recommended way to hash passwords in PHP.) It was proposed last year, accepted by a vote of 9-0, and implemented in PHP 5.5. Unfortunately, many prebuilt PHPs in repositories are still on PHP 5.3 (which dulls the joy of hearing that some truly vile misfeatures are finally complelely removed in 5.4) so if you don’t have full control of your environment this may not be available to you for a while yet. (This was actually the first time I ever had to compile PHP completely from scratch. It turns out to not be a horrible process; they have, at least, got this “deploying” thing nailed.)

So why are we here? Well, a faithful follower slipped me a tip to check out the documentation. It turned out I agreed: I don’t like it. It also turns out I am acquainted with the person who both proposed the RFC and implemented the actual code. Awk-ward. Can I be my usual cruel, demanding, and unforgiving self in the face of the actual hearts and souls of PHP developers? Dangit, I intend to try.

Actual footage of the author of PHP Manual Masterpieces.

Let’s be clear: I have read the backing C code of this feature and I see nothing wrong with the actual functionality. My issues are strictly with the documentation and the API, both of which are very PHP-ish in the sorts of ways that drive me to hateblog about a programming language on a Friday night. It turns out there are people who are totally okay with these design decisions, and I can’t help that their subjective tastes are wrong, but that’s just how it is.

Issue The First: Non-copypaste-safe cryptography

We all know that any and all example code will be used in production somewhere. If it doesn’t error-check, production won’t. If it uses unreasonable defaults, so will production. One can argue that it’s okay to have “some assembly required” example code if the documentation itself – that is, on the same page – clearly explains what assembly is required where. That’s not happening here.

In this case: the documentation shows $salt as a constant string, with no mention that this is bad, when the only safe thing to do in the common use case is absolutely not have one constant string. Setting a salt to a constant pretty much destroys the entire point of a salt; many people are under the impression that since a constant one will still defeat rainbow tables that it has done its job, but that’s living in the nineties. The real threat is massively parallel cracking. Having the same salt across all hashes does not do very much to stop that.

This is what the original RFC’s sample documentation used:

$salt = mcrypt_create_iv(16, MCRYPT_DEV_URANDOM);

And that’s GOOD! But it turns out that mcrypt absolutely cannot be relied on at all to be even probably present, so file that away under Issue The First Subpoint The First: the illusion of PHP having a reasonable supply of built-in cryptography functionality.

That’s an awful lot of words to say that what I want to see is:

$salt = PHP'S_BUILT_IN_SALT_GENERATOR(); 
// use a unique salt per hash. See [here] for details!

Whenever I say things like this, someone always pops up to say that it’s the consuming developer’s job to already know this stuff. Good thing PHP is a language explicitly targeted at seasoned, well-trained experts who have studied cryptography in university.

Issue The Second: Catastrophically Fail and Carry On

PHP has a deep-seated obsession with never, ever terminating execution with an error, except for stupid reasons. If it’s anything short of the underlying computer physically exploding, PHP’s policy is to return a nonsensical answer and continue with execution. Compounding this problem is that it’s totally normal to disable displaying errors entirely. (Technically, PHP only calls it an error if it is fatal: otherwise it’s a warning, a notice, et cetera.) The result is quite foreseeable: any and all non-fatal errors will go unnoticed somewhere.

Sometimes this isn’t a big deal, but this is cryptography. This isn’t like mt_rand() which is documented as not cryptographic but is often abused for it: this is explicitly intended for cryptographic use. The stakes are high by default. My issue here is that there are multiple ways to cause hash_pbkdf2() to non-fatally return false when a cryptographically usable string is expected. False is of course a perfectly serviceable thing in PHP to use as a string with no explicit casting. Do you see the problem yet?

It is a little bit too easy for code to begin spitting out the exact same output for all inputs and have this go unnoticed. Maybe someone typos a hash name, or some function upstream spits out -1 as an error code and this gets used as the length, or in the future a new hash algorithm is added and someone runs it on an older version of PHP that doesn’t have that algo yet. The end result is that two completely different passwords would end up with the same meaningless “hash” (even in the face of the legendary TRIPLE EQUALITY operator) and cause a catastrophic failure of security.

It is my strongly-held opinion that all errors in cryptographic code should be fatal. If the intended results cannot be obtained then end the world rather than risk an empty string being treated as a meaningful result to be used in security decisions.

To keep this focused on the manual: it does not explicitly say that you can get false as a result. It’s left to be the sort of inference made by the highly experienced programmers we already sarcastically established are PHP’s exclusive audience. If the designers do not want to change errors in this function to fatal, the documentation should be made much more explicit about failure modes.

Issue The Third: Metric or Imperial Bytes?

What’s the one thing your high school physics teacher always told you? Take a look at this documentation excerpt and see if you can recall.

length    

The length of the derived key to output. If 0, the length of the supplied algorithm is used.

Need a hint? ALWAYS WRITE DOWN YOUR UNITS!

Well, you might think, this isn’t that big a deal: run the function once and see how long a string it outputs and compare it to the length you passed. It’s gonna be either bits or bytes, right?

BZZT Wrong. The length is measured in characters of the final PHP string (which you might assume is the same as bytes but hold on). Well that’s not so bad, right? At least it’s consistent?

BZZT Double wrong! The length parameter has an undocumented interdependency with the raw output boolean parameter! If it’s false (the default), the hash is measured in hexadecimal digits aka nibbles converted to full characters whereas if it’s true it’s measured in bytes converted to characters. What this means is: the number of actual crytographically significant bits in the result may be HALF or DOUBLE what you were expecting. The prior in particular may be catastrophic.

In Conclusion: (╯°□°)╯︵ ┻━uoıʇɐʇuǝɯnɔop━┻

I feel like this function’s documentation and API are set up to cause issues in the usual PHP sorts of ways. Since it is cryptographic functionality designed to be used in security, I don’t feel bad making exacting demands for it being completely predictable and explicit.

Why do I post this stuff on a tumblr instead of trying to get involved with PHP’s documentation project? Because that would take all the joy out of being angry as a hobby.

Okay, actually, let’s expand on that: editing this one page to be more explicit on the correct use of salts doesn’t solve PHP’s documentation-of-dangerous-stuff problem. Editing this one page to remind that PHP functions like to return a meaningless answer alongside a soft error doesn’t solve PHP’s maddening tendency to sprinkle these everywhere. So on and so forth. (The length parameter one is at least an issue pretty specific to this page, I think? I hope.)

I’m not “involved with” PHP and I don’t want to be. It seems all the PHP contributors i know are ex-contributors because they got fed up with the community and bailed. Does complaining about their documentation and not personally editing it make me the world’s biggest prissy brat? Maybe. But I’ve got quite enough flamewars to juggle already on the feminism, secularism, and LGBT (add letters as necessary in context) fronts, and hobbies that very deeply personally matter to me using up my evenings and weekends. If people who choose to allocate their hobby points to working on PHP agree with me that they have a systematic design and documentation problem that needs a systematic answer, that’s super great. Otherwise, this is a monument of warning to passerby, and nothing more.

Nov 3

So PHP. Such Documented

So some infosec acquaintances of mine have dropped a random seed cracker for PHP’s mt_rand() - that is, given a good sample of the random output, determine what the original seed input was, and thereby recreate the entire random stream. Depending on the application, this could be used to crack encryption, cheat at games, et cetera. The fact that it is possible to do in a reasonable timeframe is kind of worrying in and of itself but that is besides the point. The point is PHP is a mess.

What is mt_rand()? The “mt” stands for "Mersenne Twister" because we should be building the name of the underlying algorithm directly into the namespace and forcing consumers of the API (you) to sit around saying “which version of rand() should I be using?” Fortunately, the documentation is here to help us decide!

mt_rand — Generate a better random value

Better than…?

Many random number generators of older libcs have dubious or unknown characteristics and are slow. By default, PHP uses the libc random number generator with the rand() function. The mt_rand() function is a drop-in replacement for this. It uses a random number generator with known characteristics using the Mersenne Twister, which will produce random numbers four times faster than what the average libc rand() provides.

This is such the most PHP thing. Most languages, I think, would do the exact opposite thing when faced with this problem: they would change the implementation of rand() to a known good one and shuffle off the old version to old_rand() if you absolutely needed to keep the old, platform dependent one (ie the seed stream was not portable between machines anyway). But no! Why do that when you can leave the bad one in place at the obvious name and implement the good one with an awkward name?

But of course, the documentation of rand() will clearly point out that it’s effectively deprecated and one should always use mt_rand(), right?

Nah. Sounds like a lot of work. Stuffing it in “See Also” should do the trick.

A quick look on github suggests that whether people use rand() or mt_rand() is about 50/50. And mt_rand() isn’t “cryptographically secure” anyway - for that you need OpenSSL! Github shows about ten thousand results for that versus about a million results for rand()/mt_rand().

I literally almost fell out of my chair laughing when I saw the awkwardness of this design and the asymmetry of the documentation. This sort of namespace clutter is just the most PHP thing.