[08:51:17] 10netops, 10Operations, 10SRE-swift-storage: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) It is very strange since from /var/log/swift I see the host logging requests, and pings to other ms-be in codfw work, but TCP conns to the puppet master for example fail: ` eluk... [08:58:31] 10netops, 10Operations, 10SRE-swift-storage: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Something might be messed up in the network config, I see a strange routing for v6 (no G flags for example): ` elukey@ms-be2050:~$ sudo route -n -6 Kernel IPv6 routing table Des... [09:26:08] 10netops, 10Operations, 10SRE-swift-storage: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Something changed: * puppet now runs on ipv4 * swift container availability [[ https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=8&orgId=1&var-DC=codfw&var-prometheus=co... [15:27:05] 10Acme-chief, 10LDAP, 10cloud-services-team (Kanban): acme-chief just generated invalid ldap certs - https://phabricator.wikimedia.org/T271063 (10Andrew) [15:27:52] vgutierrez: That ticket ^ suggests that acme-chief may be on the verge of breaking everything everywhere [15:33:36] :? [15:33:46] ah! there you are :) [15:34:10] vgutierrez: let me know when you've caught up [15:34:15] I'm not sure that ticket is the clearest [15:34:49] and let's move this to _security in case more people are around [15:41:37] 10Acme-chief, 10LDAP, 10cloud-services-team (Kanban): acme-chief just generated invalid ldap certs - https://phabricator.wikimedia.org/T271063 (10Andrew) Note that during this outage, I was also unable to log in to the icinga web UI, and users reported gerrit issues. So there were several systems that faile... [15:41:41] 10Acme-chief, 10LDAP, 10cloud-services-team (Kanban): acme-chief just generated invalid ldap certs - https://phabricator.wikimedia.org/T271063 (10dcaro) Some more semi-random info: * A run of openssl from my laptop (through an ssh tunnel -L 127.0.0.1:1234:ldap-labs.eqiad.wikimedia.org:389): https://phabricat... [16:18:22] 10Acme-chief, 10LDAP, 10cloud-services-team (Kanban): acme-chief ldap certs required chained (with intermediate CA) versions suddenly - https://phabricator.wikimedia.org/T271063 (10Bstorm) [16:19:10] 10Acme-chief, 10LDAP, 10cloud-services-team (Kanban): acme-chief ldap certs required chained (with intermediate CA) versions suddenly - https://phabricator.wikimedia.org/T271063 (10Vgutierrez) acme-chief generated a valid certificate, the main difference between the current and the previous one is the interm... [18:33:32] 10netops, 10Operations, 10SRE-swift-storage: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Some notes after tests: 1) I don't see Router Advertisements using tcpdumps on ms-be2050, but I see them on all other nodes. I don't recall if the default gw settings are set vi... [18:49:10] 10netops, 10Operations, 10SRE-swift-storage: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) I am out of ideas, the next thing that I'd check is if the fiber between the switch and the host needs to be replaced..