[00:04:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [02:52:49] FIRING: DiskSpace: Disk space install1005:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:49:40] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on install1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [06:52:49] FIRING: DiskSpace: Disk space install1005:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:28:01] ^ install1005 should recover soonish [07:32:35] RESOLVED: [2x] DiskSpace: Disk space install1005:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:34:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on install1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:48:21] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [07:49:05] RESOLVED: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti01.svc.eqiad.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [13:51:14] 10CAS-SSO, 06Infrastructure-Foundations: Enable self-service IDP two-factor authentication management - https://phabricator.wikimedia.org/T359552#11353722 (10kostajh) I would love to see this move forward, especially in the era of [[ https://www.troyhunt.com/inside-the-synthient-threat-data/ | increasingly lar... [16:25:40] mutante: I think I have a fix for the routed ganeti issues with v6, should know in half an hour or so [16:26:24] also TIL SPICE and it looks cool https://en.wikipedia.org/wiki/Simple_Protocol_for_Independent_Computing_Environments [16:27:19] Oh yeah! I remember using that when I got my Red Hat Virtualization cert back in the day [16:28:08] apparently it's supported by qemu and thus ganeti as well, with some changes [16:28:37] cdanis: cool, thank you. [16:29:20] semi-relatedly though it does not seem clear now if this should really be in all POPs. but we need that fix regardless, so great [16:29:24] Nice! And according to your link, oVirt is still a thing™ [16:30:03] mutante: I kinda think it should be, it gives us a lot more resilience under adverse conditions [16:30:28] but I can see the argument for sure [16:31:09] talked a bit to traffic and we got to a "we have to talk about this" [16:31:13] ack [16:31:27] should we all make some synchronous time next week? [16:31:33] meanwhile I did assign VIPs for eqiad/codfw [16:31:40] that does sound like a good idea, yes [16:31:49] ok I'll set something up [16:31:58] :) ! [16:32:26] also have a patch for service catalog.. but yea.. got some questions. thank you [16:40:51] ` up ip addr add 2a02:ec80:300:103:10:80:2:9/128 dev ens13` [16:40:53] :) [16:43:12] yay:) [16:43:31] should I do reimages in magru or is there more to it [16:44:16] well right now it works because puppet is disabled on apt1002 and i livehacked /srv/autoinstall/scripts [16:44:29] I'll send a patch once tcp-proxy3001 finishes [16:49:13] gotcha! [17:05:37] mutante: I'm reimaging the other three as well now, will save us waiting for the puppet patch [17:05:56] :) cool [18:04:22] all good [18:07:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11354607 (10RobH) I accidentally pasted the Day 1 update on a subtask: >>! In T405945#11351182, @RobH wrote: > Day 1 of migrations update: > > * 58 hosts mo... [18:38:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11354665 (10RobH) Day 2 update: * 73 servers moved today, 169 servers remain. * We (again) focused on moving hosts that did not require any specific scheduli... [18:46:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11354723 (10Jclark-ctr) @cmooney Few things we ran into an-worker1136 Failed to ping after migration. changed cable port old and new showed link moved ba... [21:38:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11355132 (10RobH)