In my last job, me and one other guy had a task to harvest security data from approximately 500 servers, weed out some initial garbage data and create one report for each server containing relevant information each month. As this was not our only responsibility, we had to approach the problem with a programmatic solution, or in other words we had to write a program to harvest the data for us. We did this, and the first time it ran, it took just under 7 days to finish, running non-stop. As this was a month-end problem, starting it on the first of the month and waiting a week to get started on the analysis portion of the task was unacceptable, so we had to find another way to do this.
We were talking in one of our managers offices and just having a brainstorming session when it came upon me that we could emulate clustering if we did it the right way. Clustering, simply put, is using multiple computers at the same time to do work faster. Before we had been accomplishing the harvesting serially, but now we had the power of parallel processing. There were some unique challenges as we were only allowed to use DOS batch scripting on these servers, but we did get it to work. We also noticed that each session used so little processor power (both clock time and memory) that we were able to run multiple sessions of the program on one server, so we didn't have to have tons of resources tied up while this was running either. Incidentally, the way we solved the clustering problem also introduced collision protection as a freebie, so that meant that no two sessions would be doing the same work or get hung up trying to do the same work. We started off 10 sessions of the program the first time we tried it and viola, the task was done overnight.
This was a good solution to our problem, but we wanted to get some more information on what was going on, and the next month we had added logging that checked not only when each server began and ended, but also when the whole process began and ended. We bumped up to 20 sessions on 2 servers and on our second run were done in about 4 hours. This meant that we could actually get started on the other work we had to do on the same day that we ran our harvesting program if we wanted to, but since we wanted to see just how fast we could make it, the next month we pushed it up a bit more.
On our third attempt, we ran 40 sessions on two servers, and the task was completed in about 2 1/2 hours. On first look this might seem just about right as we doubled the sessions and it finished in about half of the time, but we were wondering exactly what was going on, because the last time we doubled the sessions from 10 to 20, we had over a 60% gain in efficiency, but this time we only had 37.5% gain in efficiency, and 60% was with 10 sessions, but 37.5% was with 20 sessions, so if you divide the numbers out, on our second run, each additional session added a 6% efficiency jump, but in the third run each added a 1.875% efficiency jump. How did we get both 6% and 1.875% results by doing the same thing? For a bit we were stymied.
We decided to test what was going on in our fourth run. So that we'd have more data to analyze, we ramped up to 80 sessions on 8 servers and the task completed in 2 hours. That's only 1/2 an hour faster, or 20% spread out across 40 sessions, or per session they added 0.5% efficiency. We went down from 6% to 1.875% then to 0.5%? As Joe (my scripting partner) and I were talking about this over a coffee, I had an epiphany and realized that I had seen something like this once in a Calculus class I'd taken way back in 1994, so I looked it up and sure enough, we'd managed to bump into a Calculus principle without realizing it.
The particular bit of Calculus that was hindering us was something called "The Limit of a Function" which looks at a function containing a variable and sees what happens to the function as the variable's value changes. What we had was a finite amount of work that we were dividing between a variable amount of help. It doesn't matter if I count the work as 1 task or as 500 servers, the principle still applies, but our function could be looked at as either 1/x or 500/x and over the four months x moved from 1 to eventually 80. The notation x=1-->80 is similar to how it looks in Calculus. Let's look at the problem with small numbers first and see if we can see how this was working.
If we go from 1 to 2 sessions, we should finish in 50% of the time or if you subtract that from 100%, which was how much of the time it was at 1 session, we have a 50% time savings. When we go from 2 to 3 though, we should finish in 33.3% of the time, which when we subtract that from 50%, we get a 16.7% time savings. From 3 to 4 we get a 8.3% savings and from 4 to 5 we are down to 5%. I
Another thing we noticed was that about 1% of our servers were taking up to 2 hours each to complete. This was due to them being hard to access as they were in South America and on slower lines, but since they had to be done for the whole task to be complete, we always had to wait for them. Since we sorted alphabetically and these particular server's names started with either a B or C, we only had to get through the beginning of the C's to get them started in our initial launch, which was about 25 sessions. If we started fewer sessions than that, we had to wait for some of the sessions to go on to their next servers before these would get picked up, but even if we waited, the difference was negligible.
These five or so servers would take a couple hours no matter what you did, so after this we focused on finishing most of the work efficiently. We had a queue of servers to get to, and as long as there was a server in the queue, each session would stay alive, but when it was empty, each session would terminate when it completed the server it was on. We could push up our first termination by increasing sessions, but we couldn't actually finish the whole task faster by doing it. Since it wasn't really to our benefit to finish part of the task faster and not all of the task faster, what we did was try to bring these two numbers together by reducing sessions so that the queue was emptying at about the same time as the slow servers were finishing, thus balancing optimal efficiency with use of resources. What we determined was that somewhere around 30 sessions was our sweet spot, where the task was evenly spread across as many sessions as we could get and all the sessions were terminating as near to the end of the task as we could get.
So what we determined was that if we added more than 30 sessions to our cluster, the overall result wasn't significant. 100 sessions couldn't really do all the work any faster than 30 could, even though that just seems illogical, but the operative word is all. What was the point of trying to squeak out a fraction of a percentage point of efficiency by adding more sessions? What was humorous about this situation was trying to explain this concept to our managers who never took an advanced math class in their life, and believed with utmost faith that they would never need to use more than 10th grade math in their real life. Fortunately some of the managers just trusted us, and the speed jump from 7 days to a couple hours was enough for them, even if they couldn't figure out just why we could never seem to finish the task in less than 2 hours.