[08:09:21] 10netops, 10Operations: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey) [08:29:53] 10netops, 10Operations: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey) [09:47:17] 10netops, 10Operations: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey) Went down again, but this time no recovery.. [10:15:07] 10netops, 10Operations: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) [10:15:25] 10netops, 10Operations: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) p:05Triage→03High [10:18:29] 10netops, 10Operations: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) Current impact: * purged on some cp2/cp4 nodes got stuck while connecting to kafka-main2003, a manual restart was needed. * the kafka-main cluster is currently in reduced capacity (2 nodes instead... [10:32:44] 10Traffic, 10Operations: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 (10elukey) [10:33:47] 10Traffic, 10Operations: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 (10elukey) [10:41:06] 10netops, 10Operations: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) Down around Nov 15 09:28:34 UTC. Console is unresponsive. Opening JTAC case for RMA. [10:50:17] 10netops, 10Operations, 10ops-codfw: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) We also have spares QFX5100, so on monday we can swap the dead one. [11:04:56] 10netops, 10Operations, 10ops-codfw: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) Netbox device and list of connected servers: https://netbox.wikimedia.org/dcim/devices/1892/ [11:16:03] 10netops, 10Operations, 10ops-codfw: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10Vgutierrez) switching over to lvs2010 as it will allow us to recover cp2035, only losing cp2037 on text and cp2038 on upload VS losing cp2035 and cp2037 on text with lvs2007 [23:57:19] 10HTTPS, 10Traffic, 10Operations, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Seb35) FYI I opened [[https://github.com/certbot/certbot/issues/8456|a feature request on certbot]] to propose a delay before deployment as stated here, and will soo...