I began working for IBM in 1963, on Unit Record accounts, and there was as much fun and challenge in wiring a 604 or 403 board as trying to stuff a 13K program into a 12K 1401 or debug the various releases of software for the S/360. I spent 5 years at IBM, then 5 more years at a software house, a year freelancing, then 24 years at NYSE and its subsidiary SIAC. From 2000 to 2013 I worked at HealthQuest.
I have been fortunate that for much of my career, I have been able to work on projects and systems where the company simply wanted the best that could be built, regardless of cost. I realize this is not the norm in the business world, but it certainly brings out the best in the programmers and produces some remarkable code.
While working for a software house, did major design and much coding for a message-switching system. Originally built for St. Regis Paper, I modified it for NYSE to pass OddLot Orders from brokers to the OddLot dealer and Trade Reports back to the brokers when the trades were executed. Used just about every kind of communications protocol short of smoke signals and all driven via EXCP. Also found a way to run ‘non-resident’ command programs without giving up control to MVT Load SVC (which would have risked delay in handling interrupts from our TP devices). We built all ‘command’ code in 2048-byte CSECTS. At startup, did Load on 1st module which brought it in and did Relocation. Wrote the 2K block to a BDAM file, did Delete and Loaded 2nd module, etc….with nothing else going on, all modules loaded in the same location. When all were Loaded and written to BDAM, left the last one in core instead of deleting it, which provided a location I could use as a ‘buffer’ and overlay with any of the commands. That way, when we needed a Command module, the BDAM read was set up for the desired module and the ECB simply included in our multi-wait ECBLIST along with all the other ECBs for the TP activity…never gave up control, which was fortunate, as we tended to write very efficient channel programs with buffering controlled by the PCI bit, so interrupts had to be serviced very promptly. Barring hardware failures or Program Checks, once we started the application, we never lost control to MVT involuntarily.
Original system had data sent to disk. I modified it to send incoming data to OddLot Dealer, then queue in memory – overflowing to disk when memory filled – to await matching trade Report. This was a particularly felicitous design, since 70% of the incoming traffic occurred in the 1st hour and went into memory (fast). By the time memory filled, the transaction rate had dropped off, so using disk was okay – a neat mesh between software design and application behaviour. Anticipating possible future applications without this lucky circumstance, I later refined it to always queue to memory, with a lower-priority task continously rolling memory to disk so there would always be room for incoming data in memory and incoming traffic was processed at memory speed.
One of my IBM customers had paper tape cut by a Freiden Flexiwriter doing shipping paperwork. This tape went into some IBM hardware which punched cards. The cards, in turn, went into a 403 Accounting machine to generate the reports. Problem was that the layout of the forms being used in the 403 and the layout of the cards did not match very well. The customer had a bought year’s supply of nine-part, floating-carbon forms, too significant an investment to allow redesigning the form. Redesigning the card layout was impractical without redesigning the Fredien programming and probably the layout of the shipping forms, so the work had to be handled on the 403. We had to install so many extra Selectors on the 403 that IBM had to add a boost to the power supply.
It took me 3 weeks to wire the board. When I finished, only 6 of the holes on the control panel were unused and most wires were four- or five-tailed. I got it working, turned it over to the customer and went on my way on a Wednesday. On Friday, the customer decided to replace my ‘temporary’ wires with permanent wires and managed to screw it up completely. Scheduled to start a out-of-town class on Monday, I went to the customer’s office on Sunday, ripped out all the wiring and rewired it totally from memory. (In theory, I could have built a wiring diagram as documentation, but I never saw it done or even saw a blank diagram outside of Unit Record training). I gave the board back to the user and threatened him with hellfire if he touched it. He was suitably contrite, but a few weeks later called in great glee to report it was generating totals $6 too high – must be a mistake in my wiring.
Knowing better, I split his card deck in half and made two runs. One was good, the other $6 off. Repeated this until I got down to a small deck containing the error and eyeballed it. Found he had keypunched an O (letter) instead of a 0 (number). On BCD cards, read 9-edge first, the zero was a single punch in the zero row of the card, while an O (letter) was a punch in row 6 plus a punch in row B. The Card Reader, picked up the 6-punch but ignored the B-punch since the field was numeric, so the value was brought in as six instead of zero, hence the $6 error. Showed the customer, corrected the mispunched card and everybody happy…and he never questioned me again.
Customer with a 1401 application which read in a deck of cards and spent several hours crunching numbers, then punched out a deck of cards. User would load it up at 8AM and come back at 2PM to retrieve output deck, only to find every 7th card missing. It only happened when the 1401 was unattended – if the operator stood by the reader/punch, it worked fine. Hardware was checked out and rigorously tested, software was examined with a fine-tooth comb and nobody could find the problem. They were beginning to think someone was sneaking in and stealing cards. I went to investigate. The place was a warehouse with no air-conditioning. They simply opened some huge windows and had several large rotating fans which swept back and forth enough to keep the 1401 from melting its circuits. I sat down 30 feet away to read a book while the application crunched. After several hours, the punch kicked on and I watched in awe as six cards dropped into the appropriate pocket and the seventh was magically plucked out of mid-air and swept out the open window. I went up to the machine and it suddenly behaved itself. The timing of the fans sweeping back and forth created an air current coinciding with every 7th card and tossed the card out the window! Standing beside the machine disturbed the air current and eliminated the problem. I took the manager to the window to show him an alley littered with dozens of his missing cards. We solved his ‘computer bug’ by moving a couple of his fans a few feet further from the 1401.
Customer had a Unit Record application wherein he sorted out a deck from a card file, ran a report and collated the cards back into the card file. One day, the collator tore a card in half. He repunched it and it happened again. After continuing to do what didn’t work a few more times, he called me. (The control board had been wired by the salesman, not a Systems Engineer). Found it was taking a pulse thru a Selector and using a pulse of the same duration to ‘pick’ the Selector instead of using a long pulse to ‘pick’ and a short pulse thru the Selector. Result was signal got to both posts of the Selector. One post controlled Stacker One, the other controlled Stacker Two. With boths posts turned on, the collator let the card go past the proper location for Stacker One (since Stacker Two was selected), then belatedly discovered Stacker One was also selected and tried to drop it in Stacker One. Since the card was past the normal location for Stacker One, the little metal ‘finger’ dropped a fraction late. One half of the card went to Stacker One, the other half to Stacker Two.
Always keep technical work out of the hands of amateurs.
I began writing Assembler in college, with MAP on the IBM 7090/7094, then writing Autocoder for 1401/1460s. With the advent of the S/360 in 1964, began implementing systems on BPS, TOS (yes, there was a Tape Operating System for the S/360 – lasted only a few months, until DOS came out), DOS and all non-virtual versions of OS – PCP/MFT/MVT.
Had one customer installing a DOS system but knowing he would eventually migrate to OS. I wrote a bunch of ‘OS’ Macros so when he eventually converted, he only had to reassemble. These included ‘DCB’ macros which expanded into DOS DTFxx macros, ‘WTO/WTOR’ macros which expanded into DOS file I/O routines for Console communications and a macro to gather any parameters passed to the program at startup. This neat solution got his application running immediately and saved him a lot of time down the road.
While in IBM’s NY Field Systems Center, I was the customer’s last resort for all OS problems except compilers and Sort. If I didn’t have the answer, I went to the programmer in SDD who wrote the code. Will never forget the time an IBM SE called with a question about the Link Editor. By an incredible coincidence, the gal who answered the phone just happened to be reading the Link Editor PLM and just happened to be on the page which dealt with the caller’s question. Conversation went something like this:
“Hello, New York Field Systems Center, this is Miss Jones.”
A pause while the caller explained his problem.
“Link Editor PLM, page 62, left-hand column, figure 12-7.” CLICK.
…..always wondered what went through the poor SE’s mind…..
I was often called on to benchmark systems in competitive situations or when the sales team was pushing a user to upgrade. I recall one customer – an major oil company – running a Linear Programming application on a S/360 Mod 50 and having serious performance issues. They had a Mod 91 on order when IBM announced the Mod 85. They immediately wanted to cancel the 91 and order an 85 but the CEO demanded proof that the Mod 85 was necessary. Sales called on me. I found that what the customer really needed was to upgrade to a Mod 75, which supported more memory, since Linear Programming was constrained mostly by memory and the user’s limited Mod 50 memory forced the software to run at disk speed. My opinion was not welcome and I was ‘ordered’ to benchmark. Since there was no operational Mod 85 to test on, I got the user’s jobstream and a simulator package, then both simulated and tested his jobstream on Mods 50/65/75/91. This allowed me to tweak the simulations until I knew they reflected actual performance on the various models. I was then able to do a trustworthy simulation of the Mod 85.
Result? Application used 13% CPU on Mod 50, would use 8% CPU on Mod 85. Asked the salesman how he could justify trying to sell a Mod 85 (faster CPU) when using so little CPU in the first place. His response indicated why he – instead of me- was the salesman: “I’ll tell the customer he’ll get a 40% improvement in CPU utilization.” This was my introduction to Situational Ethics, and I didn’t like it much. To the horror of the sales team and our mutual managment, I told the customer the truth. He still bought the Mod 85 because he wanted the newest toy, but a least he knew it was unnecessary and I had my self-respect.
Guess the story got around the business world, as some months later, I was called by another customer IBM was trying to upgrade. Suspecting the local sales team might just be trying to generate a commission, the user wanted me evaluate his system. In fact, the local IBMers had done a fine job squeezing the last drop of performance from his system and he really was maxed out and needed to upgrade, which he did. I found it interesting that he was willing for me to say yea/nay, even though I worked for IBM.
In the mid 1960s, the New York Stock Exchange had an IBM system supporting the Trading Floor, running on 2nd generation hardware. By the late 1960s, this system was close to its maximum capacity and NYSE needed to replace it. For several million dollars (back when a million was a lot of money), IBM built a customized version of OS/MVT as ‘Market Data System II’. The IBM team on the project replaced the guts of Task Management, Abend Processing, added a ‘parallel’ IOS and associated Interrupt Handling and added fully duplexed disk processing and duplex CPUs, in addition to routines for non-standard TP devices and the infrastructure for controlling both the I/O environment and the data side of the application.
This system ran on a pair of OS/360 Mod 50s, RPQd to share non-volatile LCS memory and with a primitive CTC, letting it provide virtually instant failover from hardware problems. It supported several TP devices unique to NYSE. Both the Mod 50s and the 2703s (communication front-ends) had hardware modifications to support the application. There were numerous pieces of the code which found their way into later versions of MVT and even into MVS. The Enq/Deq logic was better and more efficient than the normal OS routines (the dreaded ‘deadly embrace’ was impossible) and the commit/backout of transaction processing far more efficient than any other system around, better than anything around today, in fact. The reliability of this system became legendary: from the day it went live, it ran almost 18 months without any outage to the users – an unparalled achievement for those days and far better than anything NASA or FAA could put together at the time.
The system had duplexed DASD with the ability to tolerate a disk failure and rebuild back to duplex mode on-the-fly; a ‘database’ cached in shared, duplexed, non-volatile memory; the ability to swap underlying CPUs on operator command or if the Active or Backup processor detected a failure on the Active system; a validity check of outbound traffic to make sure it was not garbled, misdelivered or incomplete; Abend processing which prevented individual transaction failures from crashing the system but provided excellent debugging facilities.
The system was based on MVS-Release 11, a particular specialty of mine at the time. In 1974, I took over the ‘system’ side of MDS-II. At the time, it took 10-15 minutes to boot MVT, so my manager and I immediately implemented a ‘fast-IPL’ which allowed us to reboot in about 30 seconds. I then implemented migrating from 2314 DASD to 3330, which needed an upgrade of MVT. In Release 18, IBM repackaged IOS and Interrupt Handling into two pieces and it became my job to redesign IBM’s original modification of IOS to the new environment by extensively rewriting the ‘parallel IOS’. Aside from upgrading the underlying OS, I added support for several more non-standard I/O devices and improved performance and capacity. I/O coding was almost exclusively via EXCP, with non-standard devices sometimes driven by RPQ’d I/O commands unique to that system. By 1980, the system was running at 95% CPU, so we began looking for a replacement beyond simply upgrading from RPQ’d Mod 50s to some faster CPU which would also require the RPQs.
A first step in our upward migration was to replace the RPQ’d 2703s with 3705s running a modified EP. Eventually we also replaced the non-standard I/O devices with standard ones, eliminating the need for EP modifications. Once this had been done, I moved the application from the S/360 Mod 50s to a S/370 4341, running OS/MVT 21.8E, giving us support for 3330-2, which necessitated modification of the duplex-disk code for the expanded capacity of these DASD. The 4341 move also required us to Superzap (now there’s a piece of code I wish I had written!) the console routines to support a 20-line screen instead of a 24-line screen. I also had to drive an unsupported printer and when nobody at IBM could tell me the appropriate channel-programming to load UCS/FCB images, I realized that since VM drove the printer successfully, somebody had known the correct programing. I brought up a guest VM under VM and traced the guest VM. This let me discover the correct CCWs, code the proper channel programming and zap OS to use my code instead of the 3211 code the device was gen’d as.
Since the 4341 did not have non-volatile memory, and we did not have the same duplexed hardware as the pair of Mod 50s, the failover process needed to be totally redesigned and recoded. We put a CTC link between the MDS 4341 and another 4341 running VM. I then modifed the MVT Machine Check handlers to tap the VM system on the shoulder in the event of failure, write the ‘database’ from high memory onto a tape if possible and at least to the CTC. I modified the VM system so that on being ‘tapped on the shoulder’, VM did a CP Shutdown, then executed an additional modification whereby the last thing it did before loading a Hardwait PSW was to hang a Read on the CTC. The data from the MDS system went into high memory on VM’s 4341, we rebooted MVT underneath it and came back up, good to point-of-failure.
In the mid 1980s, NYSE decided to migrate the application to a system that was designed from the ‘get-go’ for duplex processing, instead of having to depend on non-standard in-house programming. This meant reprogramming the application from scratch as MDS-III on Tandem hardware. This took scads of people many moons – probably 20+ man-years – to get the system running at all and another year to achieve the level of reliability the application had enjoyed on the IBM 4341.
Several systems at NYSE had been migrated to Tandems, but the Market Data System was still on the IBM 4341 in October 1987. When the stock market went crazy, all the systems crashed on the volume and transaction rate – all the systems except MDS-II, the oldest system in the house, generally considered obsolete. NYSE can tolerate failures on all the systems except MDS – without MDS, NYSE stops trading. Trading was so hectic it was scary, but MDS-II on the IBM 4341 kept on truckin’ through it all. The ticker was so far behind it was useless, but that happens during normal trading, since it has to be slow enough to read and trading is often too fast for that. They now have the ability to stop the ticker and restart it at a later timepoint, skipping over data which is too old to be of any value.
There were two reasons MDS-II survived. First, it was the best-designed and best-implemented piece of code I have ever seen (and I’ve seen a lot of good code and written some of it myself). Second, while NYSE was doing an average volume of about 125K transactions per day, the maximum capacity was 350K and was generally considered sufficient for the expected lifetime of MDS-II. However, I wanted to give MDS-II enough capacity to guarantee it would last until it was replaced by the Tandem MDS-III implementation. I didn’t believe the conversion from IBM to Tandem would be on time (it wasn’t – Murphy showed up) and I wanted to move on to other things and not have to touch MDS-II again. To universal derision on the part of everyone at NYSE and SIAC, I increased the capacity to 650K transactions (limited to 650k because of the size of a single disk pack). In October 1987, when all the other systems crashed, NYSE did 635K+ transactions, so my having expanded the capacity from 350K to 650K no longer seemed so ridiculous and a whole lot of people suddenly began thinking I was prescient instead of paranoid. (Wrong! I was paranoid, or at least very cautious).
The Tandem programmers on all the other NYSE systems spent the night expanding their systems. I simply redefined one of my files as being two 3330s instead of one, increased the disk blocksize to full-track and thereby created a system able to handle 1.3-million transactions, equivalent to about a 3-billion-share trading day. NYSE only approached that volume many years later – and I had built a system with that capacity on obsolete hardware and even more obsolete software. Maybe ‘obsolete’ is a relative term, but I’ve always felt that “if it ain’t broke, it don’t need to be fixed”.
Given the near-panic in the financial community and general population, it is interesting to consider what might have happened if MDS-II had crashed and NYSE had been forced to shut down. I have heard speculation that the continued ability of the NYSE to maintain market liquidity was the only thing that kept us from slipping into a another Great Depression. Given the competition between the USA and the USSR at that time, a depression might well have elminated our ability to pay for both Guns and Butter, which in turn might have allowed the USSR to survive. One financial/political pundit has suggested that if NYSE had been forced to shut down in 1987 for even a few days, the economy would have collapsed, we would still be in the Cold War, the Berlin Wall would still be standing and Eastern Europe would still be Communist.
WWI started because a young radical shot an unpopular and ineffectual nobleman. The Mongol Horde was set to take over Europe when Genghis Kahn died and they all went home to select a new leader. All historic processes have to start somewhere, so maybe a middle-aged programmer also changed the course of history.
Sometimes the right person is in the right place at the right time.
“Don’t you just love it when a plan comes together?”
The Tandem group was very anti-IBM, just as there are those today who resent Microsoft because it is the big boy on the block. These folks pointed out that the 3330 disks were on their last legs and MVT would not support anything more modern. It was true we were buying old 3330s for spare parts, but I pointed out that we could always run MDS-II under VM on virtual 3330s, since the application only used about 40% of a 4341 CPU. The fact is, that if they had wanted to do so, they could still be driving NYSE from the MDS-II, MVT-based software, many years after the original hardware and software was phased out and supported. As it was, the system ran with the same basic design for 18 years. It remains to be seen if more modern systems will achieve the same longevity. We have certainly not seen anything which approaches that level of reliability.
The original contract for MDS-II between IBM and NYSE specified that IBM would support it forever. Forever turned out to be a relative term, since all the people who knew the system had died, retired, left IBM or moved up the corporate ladder far beyond any programming. I think NYSE really felt they had to migrate to a more modern, supported system because they knew they couldn’t run MDS-II without me and that made them very nervous – they had seen they way I ski, heard rumors about how I drive and knew I had recently taken up hang-gliding…
From 1974 thru 1988, I also handled all updates to the application code of MDS-II, bringing order to a system which had previously often resulted in one programmer’s changes being incompatible with another programmer’s changes or situations wherein Source did not match Load. Being lazy, I had programmers send their update files to a particular VM userid, which had an Exec running to wake up periodically and read in any files in its Reader. I was working from home at the time, so about 8PM, when no other programmers were accessing the Sourcelib/Maclib/Copylib, I logged onto the VM userid, terminated the looping Exec and triggered another Exec which created a third (temporary) Exec based on the updates that had accumulated during the day. This dynamically-built Exec fetched the current modules, applied the updates, assembled the updated source, then wrapped the output and updated source in JCL and sent it to a virtual MVT system, which read in the JCL and updated the libraries. It took some time to code the Execs, but once done, it was really pretty cool to just logon a VM userid, enter a couple of commands and watch TV for an hour while 30 or 40 updates were applied without me doing anything more at all.
Given the money changing hands each day, an outage at NYSE is a SERIOUS matter. One aspect of our yearly bonus hinged on meeting targeted up-time goals. Every year we met the goal and every year NYSE raised the bar. Last I knew, the req was 99.5%, but that didn’t really concern me, since over the last seven years when I ran the system, we had no software failures at all and only one hardware failure – a true, red-light, sparks-and-smoke, bells-ringing failure. Before the operators could hit the Emergency Power switch on the smoking 4341 and grab the Halon fire extinguishers, we were back up – less than 50 seconds – and NYSE didn’t even know we had taken a hit. Over seven years, that’s an up-time of 99.99997+%.
I was not the SysProg on the VM system nor do I consider myself an expert with VM, but I became fairly proficient at coding both Exec and Exec2. When REXX came along, I didn’t use it much because I already had my requirements coded in Exec2, but REXX would have made things a whole lot easier. I did code a Blackjack Exec and a Battleships Exec played between two VM users. On a more business-like note, I coded Execs and some small programs to do weekly full-pack dumps of VM Userspace packs and daily incremental dumps, plus Execs to recover entire packs, individual user spaces or particular files. Beyond some neat Assembler and Exec coding, this also involved cracking VM’s security to dynamically get the Read/Write passwords for the Users’ disk spaces. It did save the company thousands of dollars for backup software packages.
I found a way to crack VM security and changed the MAINT password just to jerk the chain of the VM SysProg (my boss). I also coded an Exec to present the VM Logo on a screen. I logged on my boss’s 3270 and ran the Exec, then sat back to watch the fun. He sat down and cleared the logo, then tried to logon, but instead of typing to VM, he was typing to my Exec, which naturally rejected his password several times. He then decided that some ratfink (glances in my direction) must have logged on as MAINT and changed his password. He therefore tried to logon as MAINT. I thought he would have a heart attack when my Exec informed him that user MAINT did not exist. That’s when I couldn’t keep a straight face any longer. It’s amazing how much coding technique one can learn if motivated by a sufficiently devious purpose…
Aside from major projects like MDS-II, I have written dozens of exits for both MVT/MVS and 3rd-party software and done a lot of modification of MVT for special circumstances, such as providing an airline with continously-running channel programming to maximize DASD performance, built a lot of accounting software before OS had the built-in tools, was a major architect of a message-switching system that was better than anything I’ve seen since, including MQ. I also wrote what was essentially a database – before the word was invented – using BDAM and multiple ISAM files for an IBM customer. In the 1960s, I help test and develop a lot of pre-release code at IBM.
Citibank had a Stock Registration application with a file so large it had to be split into four ISAM files, with a table telling the programs which range of ISAM keys were on which file. Every weekend, the ISAM files were offloaded and reloaded for backup and reorganization. Adding one widely-held stock was such a massive update the file it had to be done in pieces and the file reorganized en-progress, all of which took 36 hours. In the process of replacing ISAM with a proprietary Access Method, I encountered the problem of reconstructing individual packs. My solution was that during the weekend offload/reload, I kept track of the keys on each tape and each pack. To recover a pack, I built a small file with only that one data pack. I grabbed a replacement pack and two temporary packs, one for Index and one for Overflow. If recovering pack 12, for example, I determined the key range for that pack, reloaded it from the appropriate offload tape(s), then processed the daily log of updates for records in pack 12’s key range. The Index and Overflow information on my two packs matched the information on the real file’s Index and Overflow packs. I then returned the two temporary packs to the scratch pool and my rebuilt pack replaced the failed pack 12.
Conceptually, this was just dandy. The only problem was that after loading my ‘ISAM’ file and closing it, reopening it in Update mode should have posititioned Index/Data/Overlow back as packs 1//2/3 in the TIOT, but a bug in OS screwed up the pointers. I reported this to IBM and they admitted it was a bug but refused to fix it. Evidently, nobody else ever ran into this problem and the bug was in some critical code they didn’t want to tinker with. I would have to get into Key Zero and fix the pointers myself. Good luck. Unfortunately, the Systems Programmers at Citibank had included no User SVCs in the Sysgen and would not do so. I therefore had to find a backdoor into the system. I’ve heard there were 20+ backdoors to get into Superviser mode in MVT. I can’t vouch for that, but I did find one, got into Superviser Mode & Key Zero, corrected the pointers in protected memory and returned to User Mode and Key. Hacking did not begin with Windows, did it?
One Monday, my boss gave me two weeks to determine what it would take to build a system which could monitor trades for rule violations. I immediately set about designing and coding such a system. The application would accept data from one TP application (which was to be the responsiblity of the IBM team on the account), process the data and give any violation messages to another TP application. I began by coding three User SVCs: one to provide memory for message queues; one to feed the queues; one to service the queues.I next coded an Initialization program to build an in-memory data table from multiple source files at startup (necessary to determine rule violations); then a Checkpoint program to periodically ‘snap’ the message queues and the data table; then a Recovery program to restore the message queues and the data table from the ‘snaps’. I also included three testing programs; one to simulate the Inbound TP application by feeding the message queue from cards; one to simulate the Outbound TP application by sending to a printer; one to simulate the Processing application by pulling data from the inbound message queue and dropping it in the outbound message queue. This enabled us to test the Inbound routine, the Processing routine and the Outbound routine independently of each other, which greatly facilitated developing the real routines.
The end result was an application using an unsupported TP protocol to monitor stock trades for rule violations and generate violation notices as appropriate; able to use the Checkpoints and a Log file to recover to point-of-failure in less than 60 seconds. A week after being assigned the task, my boss asked me if I was making any progress. I told him the system was ready to go as soon as he could get the inbound TP application operational (which took IBM several weeks). Once IBM got their code working, my ‘test’ programs only had to change their I/O interface – the logic was already tested. The real kicker? I had designed, coded, keypunched and tested all my pieces – the three SVCs, the Intialization, Checkpoint, Recovery and three test modules – in three days and they all worked perfectly the first time!
I was often called on to speak to industry groups, sometimes on general issues facing IT departments at banks and brokerage houses; sometimes at vendor invitation to promote code I had written to enhance various software packages. I was always amused that when I walked into these affairs, I was invariably eagerly greeted and welcomed by the VPs and high-level executives who knew me. And on the periphery, I would see young strangers pointing at me and whispering to their companions, obviously asking who I was. When told, their eyes would get big, their mouths drop open and a look of amazement prevail. Evidently, I was a legend. Made me wonder what tales were being told about me in the community. 😀
Some checkpoint/recovery schemes restore a system to ‘start-of-day’, then reapply transactions from a log. Recovery is reasonably fast if the failure occurs early, when there is little log to process, but it’s slow when a whole day’s transactions need to be reprocessed. Other systems, such as CICS, try the approach of taking many checkpoints, but this significantly reduces the transaction rate. Some 24×7 systems have no real ‘start-of-day’, which complicates recovery even more. I once built an application with the data kept in memory (for speed) and designed a checkpoint/recovery mechanism in which I found EXCP to be too high-level and thus had to code the DASD I/O at the SIO level. It wasn’t pretty, but the result was that every 5 minutes, the application paused while the data in memory was dumped to disk. The checkpoint locked up the application for a little over 2 seconds running under VM (never timed it running native). As a result, recovery reloaded memory from the last checkpoint instead of from start-of-day and we only had to reprocess an average of 2-1/2 minutes – max of 5 – of logfile. This gave us a failure/recovery time of about 90 seconds and did not depend on ‘start-of-day’ conditions. Even in the virtual systems of today, if one is willing to put critical data in V=R, such a technique is feasible and I have seen no other approach which is as good at maximizing thruput and minimizing recovery time.
From 2000 to 2013, I babysat a hospital CICS applicaton running on z/OS 1.7, which was replaced by servers. They may eventually realize their mistake, but will never admit it – they will simply find new jobs. I have seen very few applications that were actually better suited to Intel-based platforms. The reliablility of mainframe hardware and software is vastly better than it was 40 years ago, but aside from Linux, the flexibility in software is less and there is almost no original work being done at the customer level. Why not? Because 98% of what businesses need has already been done – on the mainframe. What bothers (and amazes) me is to see the wheel constantly being re-invented in the Intel-based world as users try to accomplish in a Client/Server environment what was has already been perfected in the mainframe environment. “Those who do not read history are condemned to repeat it”. They will eventually work out the bugs in Windows and Networks – it took IBM years to achieve the level of hardware and software reliability we enjoy today – and will discover that most applications are better off centralized than distributed. The result will be a need for more and more power in Servers, at which time they will find IBM Mainframes waiting patiently to be discovered – rock-solid stability, greater capacity and better security. (Every system is vulnerable to attack from trusted insiders, but if all the world’s TCP/IP servers were IBM mainframes, attacks from outside would be virtually impossible).
The ironic thing is that while short-sighted users compare the cost of a few servers against the cost of a mainframe, they do not consider (or do not really understand) scalability. Many of the people making the IT decisions lack a broad enough view to understand the full environment, which goes far beyond mere dollars spent on platforms and operating systems. If a company were to double its users, the total cost would more than double, probably 250% in a Client/Server setup but would increase much less, probably 120% in a mainframe setup. The bigger the shop, the better the mainframe edge and the lower the cost/user. One might thus assume that the mainframe is only economical in large shops, but you’d be surprised at both how small a shop can benefit from the mainframe and at some of the pricing options available from IBM, such as paying only for as much computer power as is used. For example, if a company needs mucho computer power for seasonal activity – Christmas season, summer tourism, etc. – it must install Client/Server equipment to the meet its maximum requirements and waste (and pay for) that capacity during the off-season, but with a mainframe, the extra capacity is turned on and off as needed and only paid for as used. It’s rather like having a small, economical car for your daily commute, then being able to press a button and turn it into a sports car, 18-wheeler, pickup, SUV or van as needed.
I am obviously fond of the mainframe and with good reason. I have seen it develop from the beginning and I understand how much it can do. Most of its critics have very little real understanding of it and no idea what it can do. The instruction set on the mainframe, for example, completely beggars any other computer and this richness allows an elegance of programming unlike any other platform. When I speak to our own pro-PC, anti-mainframe folks, I find the only exposure they have to the mainframe is that they once took a COBOL course. In other words, they have the first day of the first week of Mainframe 101 and consider themselves qualified to pass judgement on what it can and cannot do. I have seen some good client/server work, but most applications are jury-rigged and clumsy. I have also seen the development of Windows, Client/Server architecture and the Internet (which is based on a buggy, extremely insecure underlying design, an interesting concept poorly implemented).
I understand and am excited about the potential of the new technology and have even done a lot of work on server-based applications, but I haven’t been impressed or pleased by the performance, reliablity or (especially) security of that environment. The only way a PC beats a mainframe is in cost/unit, but whereas an entire business workload can be run on a mainframe, to do the equivalent requires a great many servers, with the associated problems of networking and synchronization, multiple points of failure, more personnel, etc.
My current shop has one z800, hundreds of PCs and printers and 300+ servers which are our responsiblity plus another 30-40 servers belonging to other entities but whose availablility is important to our operation. On any given day, from one to four of the servers have problems. We have several techs who spend all their time building, installing, configuring, loading, debugging, updating and supporting servers. The z800 and z/OS has me. Could the users do the work Client/Server? Yes, assuming the appropriate software is available. Cheaper? I doubt it. Double the load (and this customer has increased 50% in three years), and I guarantee the mainframe will be cheaper. More reliably? No way! Several months ago, our Microsoft Exchange server had a disk crash. It took them all day to get it operational again and another two days to get the old email reloaded (users left their email on the server instead of download and deleting). On the same day, the mainframe’s Shark had a disk failure which we only discovered when it ‘phoned home’ and the IBM FE arrived with box in his hand. He replaced the disk on-the-fly, no downtime, no lost data, no hassle. There were two other mainframe disk failures – no outage, no data loss. There were several other server disk failures, each resulting in long outages. It is possible to also configure the servers with Raid-5 arrays of disk, but doing so will raise the cost of the client/server system and eliminate much of the supposed price advantage . We do have a couple of Raid5’d servers, but they are only four disks wide and can only tolerate a single disk failure. The Shark has eight-wide disk packs and can tolerate two disks failing without data loss. The server Raid cards are also significantly less robust than the IBM hardware.
My current employer has a Disaster Recovery site and we have done five tests so far. For the first test, we took the application down to make backups. The server folks snickered that they could make their backups without an outage. I replied that the mainframe could do the same if they would simply buy Flashcopy. In all of the tests, the mainframe z/OS has been brought up successfully, while in none of the tests were they able to bring up more than a handful of the servers and have never been able to bring up what they admit are the critical ones. For one test, the mainframe applications data was dumped without an outage, although they were warned this risked data corruption. Having been lucky the previous time, they chose to ignore my advice and although z/OS restored okay, the application data was corrupted. Given that a DR capacity is legally mandated, that got their attention, particularly since the auditors were very unhappy. That forced them to admit I was right, and we subsequently purchased and implemented FlashCopy. Backups are now done without disruption or data corruption. I have no worries at all about laying down a viable system at the DR site. Some clever REXX programming processed the output of the disk dumps and created Restore JCL to accompany the tapes, so I pretty much automated the Restore process. For me, a Disaster Recovery is 95% a matter of responding to tape mounts. We have still not been able to get the critical servers operational. If the servers were Linux, running on the mainframe under VM, there would be no problem.
The whole distributed-processing argument assumes that 100 servers can outperform a mainframe application because each server is its own machine dedicated to that application. In fact, servers seldom run more than a single application and usually run about 20-30% CPU (and would fail if pushed beyond 50%). Mainframes, on the other hand, run multiple applications and not uncommonly at 90%+ CPU. (There is also a mindset that says the data ‘belongs’ to the user or the application. It doesn’t – it belongs to the company and companies are not democracies, are they?) The real quality of the software depends on the quality of the programmers. In the mid 1960s, an IBM VP said to me that by 1970, the country would require 50,000 programmers. My response was that there are not 50,000 people in the country with the capability of becoming good programmers. We were both right. The country got a few hundred good programmers and thousands of hacks. The Intel explosion simply gave us tens of thousands of hacks. I suspect there are probably less than 5,000 good programmers around – and 4,000 of them are working mainframes.
I may be a Grumpy Old Dinosaur, but Homo Sapiens Sapiens has only been around for about 150 thousand years.
(And only in the last few thousand have we stopped having to outrun our food).
Dinosaurs, on the other hand, ruled the world for about 180 million years – more than 1000 times longer.
(They were the most successful life form bigger than your fist).