Here’s a fun bug report:
After talking further with @bluesunrise, it turns out that this error only appeared on his development machine, and even more specifically, only when the Crome browser was open!
This is a problem with the node-resque project, which, among other things, is used by ActionHero to enqueue and work background tasks. One of the things Resque does is that on boot, it asks Redis (the backing store for this data) what workers it thinks are are running on this host. We do this to check to see if any old workers have crashed while working on a job… and if they have, we:
To check which workers this host can manage, all workers have the system’s "hostname” saved, and we can look at what PID they were running as. If that PID no longer exists on this host, we can assume the worker has crashed and clean up the data as described above.
This means our Node.JS process needs to check on all the running PIDs on my system. Here’s how we used to do it (simplified):
1worker.prototype.getPids = function(callback){ 2 var cmd = 'ps awx'; 3 var child = exec(cmd, function(error, stdout, stderr){ 4 var pids = []; 5 stdout.split('\n').forEach(function(line){ 6 line = line.trim(); 7 if(line.length > 0){ 8 var pid = parseInt(line.split(' ')[0]); 9 if(!isNaN(pid)){ pids.push(pid); } 10 } 11 }); 12 if(!error && stderr){ error = stderr; } 13 callback(error, pids); 14 }); 15 }); 16};
Check out ps awx. We are asking the OS for the whole process list, and then extracting all the PIDs… which does accomplish our goal. To compare, check out how Ruby’s Resque does the same job:
1def linux_worker_pids 2 `ps -A -o pid,command | grep -E "[r]esque:work|[r]esque:\sStarting|[r]esque-[0-9]" | grep -v "resque-web"`.split("\n").map do |line| 3 line.split(' ')[0] 4 end 5end
Ruby has the luxury of knowing that the name of the process running this application will be called "Resque”. However, for the Node.JS version, it might be called "node”, but it also might be called "electron”, or "iojs”. Since we can’t be sure of the process name, this means we need to look at all processes.
When you look at all processes on a system, there might be a lot of them… I learned that @bluesunrise had a *lot* of tabs open in Chrome. Every tab in counts as a process. Also, the process list contains a lot of data: the PID, the name, the path, etc. After about 10,000 characters, Node.JS’ Buffers start to get full, and apparently in some cases, crash.
So now we know the source of the problem, how do we fix it? Since Node.JS’s parsing of the large string returned by the sub-process was the problem, can we off-load this work? We can!
1var cmd = "ps -ef | awk '{print $2}'";
Here, rather than load in all the data from ps, we are using AWK to return only the process IDs. AWK is a safe choice, because It is part of the kernel, and thus available on all unix/Linux/OSX distributions. This returns a far shorter string back to Node.JS to parse.
Hooray!
I write about Technology, Software, and Startups. I use my Product Management, Software Engineering, and Leadership skills to build teams that create world-class digital products.
Get in touch