[16:35:01] 10Traffic, 10Operations: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) Update: a few hours later, power seemingly got back, so at 2018-10-13 03:07 UTC @bblack repooled eqsin (logged at SAL). Unfortunately, power never got back to cr1-eqsin's PEM 0, asw1-eqsin's PEM 0 and th... [16:38:00] 10Traffic, 10Operations: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) Once recovered we found that those hosts had a 5 minutes uptime: ``` dns5001.wikimedia.org lvs5001.eqsin.wmnet bast5001.wikimedia.org cp5011.eqsin.wmnet cp5009.eqsin.wmnet cp5007.eqsin.wmnet ``` Looking... [16:47:25] 10Traffic, 10Operations: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) Respond from Equinix: > With regards to this Trouble ticket, we went onsite and observed the following, > R0604 A Feed is still on live and all equipment are still powered up > R0603- A Feed in-rack break... [17:20:38] 10Traffic, 10Operations: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) On `bast5001` ferm failed to start at reboot due to failed DNS resolution query. The next puppet runs didn't restart it. I had to manually start it. The host have been 55 minutes without ferm rules applie... [17:40:04] 10Traffic, 10Operations, 10monitoring: Icinga: check_confd_vcl_reload unknown when file is missing - https://phabricator.wikimedia.org/T206950 (10Volans) p:05Triage>03Normal [17:44:06] 10Traffic, 10Operations: Puppet doesn't restart ferm on failure - https://phabricator.wikimedia.org/T206951 (10Volans) p:05Triage>03Normal [17:50:21] 10Traffic, 10Operations: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) Current status recap: - Maintenance on one power line is still ongoing, all servers are reported up and running, without icinga alarms but the loss of power redundancy. - JNX_ALARMS WARNING - 0 red alarms... [19:24:12] 10Traffic, 10Operations: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online. [19:57:57] 10Traffic, 10Operations: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) [19:59:35] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10faidon) @RobH ping? This has been pending since July, with the last update being Aug 27(!?) [20:03:20] 10Traffic, 10Operations: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) >>! In T206861#4664738, @Volans wrote: > It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online. That's great! Note that the maintenance...