Open-source software enthusiast, Node.js and PHP expert, Lean & Agile management fan, Leadership & motivation experimenter, CEO at marmelab Francois has posted 12 posts at DZone. You can read more from them at their website. View Full User Profile

Node.js For PHP Programmers #4: Streams

01.20.2013
| 1591 views |
  • submit to reddit

After dealing with web applications a lot, you realize that it's all about a series of bytes transiting from one computer to another. You can accomplish tremendous things without ever understanding how this flow works. But when you want to go really fast, when you want to unlock the ultimate power hidden deep inside your server, then you have to speak like the computers do.

Which brings me to streams. This feature is greatly overlooked in PHP, and ubiquitous in Node. I'll explain how Node streams differ from PHP I/O manipulations, but first, I have a confession to make.

I'm Not a Geek

Just because I always felt at ease with computers, my friends always took me for some kind of geek. Okay I can use the OFFSET() function in Excel to amaze the accountants, but that doesn't make me comparable to cosplayers or government database hackers. But it doesn't make any difference for my friends, and so they always ask me for help when they struggle with computers.

About ten years ago, a couple of friends went for a year in New-Zealand. Before leaving, they asked me if I could setup a photo blog for them to communicate with their friends (it was before Flickr and Picasa and Facebook and Instagram ever existed). So I downloaded an open-source photo album application (written in PHP by the way) and uploaded it to a server I was renting for my own. They made a lot of pictures, published them regularly, and we were all very jealous. Then they came back, and we all forgot about the photo blog.

A few days ago, the same couple of friends reminded me of the photo blog, and they wanted to get the pictures back. Or rather, as we're now living in a different era, they asked me if I could transfer the pictures to Flickr. Why did they think I could do that? Do they really think I'm a geek?

Wasting Time

After thinking about it, the transfer doesn't sound too hard to do. I only need to upload a PHP script to the server that can browse the filesystem and make a POST http request to the Flickr API. How hard could that be?

1234567891011121314151617181920
<?php
$filenames = scandir($path);
foreach ($filenames as $filename) {
// get the image content
$image = file_get_contents($path . '/' . $filename);
// open an HTTP request to Flickr
$fp = fsockopen('api.flickr.com', 80, $errno, $errstr, 30);
// send the request headers
$out = "POST /services/upload/ HTTP/1.1\r\n";
$out .= "Host: api.flickr.com\r\n";
$out .= "Content-Disposition: attachment; filename=' . $filename' . \r\n";
$out .= "Content-Type: application/octet-stream\r\n";
$out .= "Content-Length: " . strlen($image) . "\r\n\r\n";
fwrite($fp, $out);
// send the image content in the body
fwrite($fp, $image);
fclose($fp);
echo "Sent file " . $filename . "\n";
}
echo "Finished!\n";

You may wonder: why use fsockopen() instead of Guzzle, Buzz, or even ZendHttpClient? Because they don't change the result of this script: it's too slow. There are tons of images in the directory, and the execution never ends.

The problem is that PHP does one thing at a time, and that Input/Output operations are blocking. In other terms, when you execute a PHP I/O function, the PHP process waits until the I/O completes. And it can take a very long time. Here is what really happens in the central loop of the previous script:

123456789
<?php
$image = file_get_contents($path . '/' . $filename);
// wait until the file is loaded into memory
$fp = fsockopen('api.flickr.com', 80, $errno, $errstr, 30));
// wait until the DNS is resolved and the flickr server acknowledges the connection
// ...
fwrite($fp, $image);
// wait until the body is sent to flickr and the flickr server acknowledges the reception
fclose($fp);

The PHP process wastes a lot of time waiting. Even if you have a very fast CPU, file and network I/O make the script too slow to be really usable on a large number of files.

Tip: Of course, Flickr has an authentication system which grants a token that should be added to each API call. But it's been removed from the current example to keep your attention focused.

Streams To The Rescue

To exchange data between systems, the script uses files and requests, but these concepts are too high level to be truly efficient in heavy usage scenarios. There is another reality, at a lower level. It might me daunting at first, but once you've discovered it, you can never come back. Come on, take the red pill, and let me introduce you to streams.

A stream represents a flow of bytes between two containers. In the previous PHP script, the data flows first from the disk to memory (file_get_contents()), then from the memory to the distant server (fwrite()). Wouldn't it be more efficient to start flowing bytes from the memory to the distant server before the initial disk flow is finished? Using streams, it's possible: a script can send an image to Flicker while it reads this image from the filesystem.

PHP offers a Stream API with low-level I/O functions. Here is how to rewrite the Flickr upload script using streams:

123456789101112131415161718192021222324
<?php
$filenames = scandir($path);
foreach ($filenames as $filename) {
// open an HTTP request to Flickr (returns a stream resource)
$httpStream = fsockopen('api.flickr.com', 80, $errno, $errstr, 30);
// send the request headers
$out = "POST /services/upload/ HTTP/1.1\r\n";
$out .= "Host: api.flickr.com\r\n";
$out .= "Content-Disposition: attachment; filename=" . $filename . "\r\n";
$out .= "Content-Type: application/octet-stream\r\n";
$out .= "Content-Length: " . filesize($path . '/' . $filename) . "\r\n\r\n";
fwrite($httpStream, $out);
// open a file stream on the local image
$fileStream = fopen($path . '/' . $filename, 'r');
// read from the file and write to the HTTP request
while (!feof($fileStream)) { // while the fileStream is not finished
// read 1024 bytes from the file and write these bytes on the Flickr http stream
fwrite($httpStream, fread($fileStream, 1024));
}
fclose($fileStream);
fclose($httpStream);
echo "Sent file " . $filename . "\n";
}
echo "Finished!\n";

File reads are now less blocking: PHP only waits for chunks of 1024 bytes to be read from the disk to sent them over HTTP. Consequently, this second upload script is a bit faster than the first one.

But dealing with streams in PHP is painful. The API is purely functional, not object-oriented. There are tons of functions with opaque names, several ways to do a simple operation, and very defective pieces of documentation. There must be a better way to do this.

Code-switching

Node.js comes with a native asynchronous stream API. In fact, most I/O operations in Node result in a stream by default. HTTP requests are network I/O, file reads are disk I/O, so Node naturally treats both as streams.

This means that streaming data from one source to another is really a breeze. The following script is the equivalent of the previous PHP script using Node.js:

123456789101112131415161718192021222324252627
var fs = require('fs');
 
var filenames = fs.readdirSync(path);
filenames.forEach(function(filename) {
var postOptions = {
host: 'api.flickr.com',
port: '80',
path: '/services/upload/',
method: 'POST',
headers: {
'Content-Disposition': 'attachment; filename=' + $filename,
'Content-Type': 'application/octet-stream'
}
};
// open an HTTP request to Flickr (returns a stream)
var httpStream = http.request(postOptions, function(res) {
// dispose of the response status and body
});
// open a file stream on the local image
var fileStream = fs.createReadStream(path + filename);
// read from the file and write to the HTTP request
fileStream.pipe(httpStream);
fileStream.on('end', function() {
console.log('Sent file ' + filename);
});
});
console.log('Finished!');

As you can see, HTTP is a first-class citizen in Node. If you compare the postOptions object with the $out string of the PHP example, where each new header of the HTTP request was appended to a string with \r\n as separator, the difference is striking. The http.request() API encourages you to check the HTTP response, while PHP functions just write to a distant resource without worrying about possible errors in the process. Node builds up on the HTTP protocol, and encourages its usage.

Also, Node allows to "pipe" two streams, just like you can pipe two commands in Linux. Here, the output of the fileStream becomes the input of the httpStream, by chunks of 64kB by default. In one simple method call:

1
fileStream.pipe(httpStream);

Tip: Node uses a chunked transfer encoding by default on HTTP requests, so there is no need to specify the 'Content-Length' header.

The Node.js version is faster to execute, but it is not really faster to read and write than the PHP version. Let's find an even better tool for the job.

There Is An NPM Package For That

HTTP client requests with Node can be somewhat verbose. I recommend the use of request, a npm package facilitating HTTP requests. This package is so generic that chances are that you may use it in all your Node projects.

With the request package, the code to initialize a POST request is simply request.post(url), so the Flickr upload script reduces to:

123456789101112131415
var fs = require('fs');
var request = require('request');
var apiUrl = 'http://api.flickr.com/services/upload/';
 
var filenames = fs.readdirSync(path);
filenames.forEach(function(filename) {
// open a file stream on the local image
var fileStream = fs.createReadStream(path + filename);
// read from the file and write to the HTTP request
fileStream.pipe(request.post(apiUrl));
fileStream.on('end', function() {
console.log('Sent file ' + filename);
});
});
console.log('Finished!');

If you take out console messages and temporary variables, the code becomes extremely concise:

123456
var fs = require('fs');
var request = require('request');
var apiUrl = 'http://api.flickr.com/services/upload/';
fs.readdirSync(path).forEach(function(filename) {
fs.createReadStream(path + filename).pipe(request.post(apiUrl));
});

Compare that to the first PHP script. Stunning, isn't it?

The Need For Speed

Instead of reading one file after the other, why not use the power of Node to do that asynchronously? This kind of asynchronous iteration requires a meeting point to make sure all the operation are closed. You could try redeveloping this logic from scratch, but someone already did it better (as often with Node.js), in an npm package called async. async.forEach(array, iterator) applies an iterator function to each item in an array, in parallel. Here is the Flickr upload script with parallel file reads:

123456789101112131415
var fs = require('fs');
var request = require('request');
var async = require('async');
var apiUrl = 'http://api.flickr.com/services/upload/';
 
var filenames = fs.readdirSync(path);
async.forEach(filenames, function(filename, callback) {
// open a file stream on the local image
var fileStream = fs.createReadStream(path + filename);
// read from the file and write to the HTTP request
fileStream.pipe(request.post(apiUrl));
fileStream.on('end', callback);
}, function(err) {
console.log(err ? err.message : 'Finished!');
});

But is this script really faster? The answer is yes and no. It is faster because you don't need to wait for the end of a file transfer to start the next. It is not faster because asking a disk to read from several files at the same time can more expensive than reading one file sequentially (unless you have a RAID 10 array, a NAS, or a SSD, your hard drive has a single read/write head. The boost of the parallel HTTP stream probably outweights the slowdown of the parallel disk reads. But the real problem is that if the script opens a lot of simultaneous HTTP connections to Flickr, and Flickr will probably kick you out for that. So this is a bad enhancement.

Use asynchronous streams wisely, and always check the benefit they offer. Sometimes they can be counter-productive.

Conclusion

Streams are a geek feature in PHP. They are for everybody in Node.js - because the stream API is much easier to use, and deeper into the core principles of Node. Even if you don't look for high performance I/O, you should use Node streams as much as possible, as they bring to the front what used to be hidden behind a curtain of abstraction, without adding any significant complexity to programming.

As for the Flickr upload, I ended up zipping all the photos together on the server, transfer the archive to my desktop using FTP, and then bulk transfer the photos to Flickr using their Desktop Uploadr. No PHP, no Node.js. No need to be a geek to talk to computers these days.

Published at DZone with permission of its author, Francois Zaninotto. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)