How I Choked
Last year we were performing a major microwave backhaul upgrade for a customer. We were working to cut over the last of several hops of high capacity (~1Gbps) radio links. This hop was the most critical in that it was the main path to half of their network, with no backup. The site we were at was the least accessible, the weather was bad and we didn’t believe we’d get back to it until the next spring season after this visit. I had three guys freezing on the tower and everyone was anxious to make the cut, button things up and get off the mountain. We were doing the work on a Tuesday, one of the most sensitive days of each week for this customer’s business data. To make things even more complicated, cellular coverage on the mountain was really poor, so communicating with the folks at the other end of the link was frustrating at best.
An outage for our customer of any noticeable length was out of the question. The pressure was palpable and coming from many directions. The customer’s employee in charge (EIC) was at the opposite end of the link with one of our guys awaiting our call and next steps.
We were within minutes of being ready to cut traffic over to the new link and the EIC indicated that we would have to delay the cutover as he was being pulled another direction. Given the circumstances we had on the mountain top I pushed hard, beyond my customer’s comfort level, and pressed the matter. He reluctantly gave me the green light to proceed and we made the cut. The transition to the new radios was without incident. Data was moving well, so the EIC went about his other business and left us to start to wrap things up at the sites.
In the process of finishing up I thought I’d make one more configuration change in the radios that would prepare us for the next step of the system upgrade a few days later, to take this 1+0 system to 2+0 (utilizing XPIC). The configuration parameter I was about to change would simplify the turn-up of the second set of radios on the same dish, at a later date. I felt like I knew the radios well enough to predict that the configuration change I was about to make would not have an impact on existing traffic. I was dead wrong.
Fifteen minutes after I made the last configuration change I received a call from a hot EIC. Half of their system had been down that long. It was not a good day for me.
I learned what I call some “life lessons” on the mountain that day. This is what we do differently now:
1) Bench test the exact configuration we plan for in the field.
2) Over communicate our cutover plans with the respective EIC and discuss what we plan to do, expected behavior, potential risks, a “plan B” and if possible a “plan C”. We plan it down to the minute, literally.
3) Faithfully document and debrief the “lessons learned” of each project. This helps prevent making the same mistakes twice.
4) Maintain a web-based scheduling document that provides our customers near-real time access to the overall project schedule so that they can be prepared for any service-impacting activities, schedule resources and just generally be aware of our progress.
It pains me to even think about what happened that day, but on the flip side it caused us to institute structure that will help prevent an issue like this again. In this case, the customer was gracious and after a long conversation the next day our partnership grew stronger.
If you have it in you, let me know below if you’ve ever been in this spot.
Don’t forget to sign up to receive updates to our blog here or in the tool bar on the right.