[03:03:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:28] ^ this was some leftover of manual tests done for dnsmasq support for the DHCP bridge support needed in routed Ganeti, I've removed it for now [07:13:25] RESOLVED: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:06] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558 (10cmooney) 03NEW p:05Triage→03Medium [09:33:40] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11213293 (10cmooney) [09:41:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560 (10cmooney) 03NEW [09:41:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11213333 (10cmooney) [09:44:37] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562 (10cmooney) 03NEW p:05Triage→03Medium [09:44:50] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11213369 (10cmooney) [09:52:23] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11213413 (10cmooney) [09:55:02] moritzm: slyngs: see the backlog of this channel from yesterday evening - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1190997 somehow broke my access to log in to netbox, do you have thoughts on what's the best way to fix that? [09:55:51] Just a sec [09:57:08] taavi: Oooh, I see [09:57:48] Let's just put the NDA back for now. [09:58:18] We do have a plan for that, I just failed to check your LDAP groups [10:02:57] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191317 [10:04:33] thanks. if the reality is that we no longer grant read access to everyone in nda ( :( ), then I guess that list could have ops instead of nda to cover this edge case [10:35:42] Yeah, ops make more sense: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191317 [11:19:16] taavi: You should be allowed back into netbox [11:22:00] slyngs: indeed, thank you! [11:37:38] 10netbox, 06Infrastructure-Foundations, 07Regression: after logging into Netbox, NDAs see an empty dashboard - https://phabricator.wikimedia.org/T404494#11213794 (10SLyngshede-WMF) @Novem_Linguae I've updated the description to be a bit more descriptive. Please feel free to request the permission now. [11:50:56] FIRING: MaxConntrack: Max conntrack at 84.12% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [12:00:56] RESOLVED: MaxConntrack: Max conntrack at 83.08% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [12:10:15] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11214003 (10cmooney) [12:34:27] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11214082 (10cmooney) @Jclark-ctr @VRiley-WMF I may have missed to check we have the cables needed for these already. We're re-using exsiting... [12:34:55] 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702#11214085 (10Novem_Linguae) We might want to re-scope this ticket to be more specific than "Unable to log in to Netbox". Hard to tell if it can be resolved or not. Ma... [12:39:26] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review, 07Regression: after logging into Netbox, NDAs see an empty dashboard - https://phabricator.wikimedia.org/T404494#11214094 (10Novem_Linguae) With netbox-readonly-access, I was able to log in to https://netbox.wikimedia.org/ . It still generated a... [13:10:22] ok so, puppetserver1001 :D [13:10:39] it seems that `puppetserver ca list` doesn't work if it doesn't find /var/lib/puppet/ssl/certs/puppetserver1001.eqiad.wmnet.pem [13:10:49] that is a nice circular dependency :D [13:18:18] if anybody has ideas I am all ears :) cc taavi (offered help on #sre, not dragging people arbitrarily in this mess :D) [13:20:00] pretty sure it won't work, but I'd try `puppetserver ca clean --certname puppetserver1001.eqiad.wmnet` [13:20:25] uh so what exactly is broken atm? the puppet host cert for 1001 was lost? [13:21:07] yeah, puppet ssl clean was run aiming to another target, but it was discarded and it cleaned up puppetserver1001 [13:21:11] I am reading https://www.puppet.com/docs/puppet/7/ssl_regenerate_certificates.html#regenerate_agent_certs_and_add_dns_alt_names [13:22:13] it seems that with puppet 7 they made things a little bit more spicy [13:23:56] but that article mentions the CA as well [13:24:03] not really what we need/want [13:25:08] I think it is probably good to move the conversation to #sre, so others can chime in [13:25:16] * elukey moves to #sre [13:40:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:29] !log restart upload_puppet_facts on puppetserver1001 - T405580 [13:47:29] elukey: Not expecting to hear !log here [13:47:30] T405580: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580 [13:50:25] RESOLVED: SystemdUnitFailed: upload_puppet_facts.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:50] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602 (10cmooney) 03NEW p:05Triage→03Medium [14:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.98%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [15:00:25] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11214910 (10cmooney) [15:11:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609 (10cmooney) 03NEW p:05Triage→03Medium [15:11:23] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11214968 (10cmooney) [15:11:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11214967 (10cmooney) [16:02:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618 (10Papaul) 03NEW [16:31:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11215392 (10cmooney) [17:18:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628 (10cmooney) 03NEW p:05Triage→03Medium [17:18:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11215779 (10cmooney) [17:18:56] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215780 (10cmooney) [17:19:41] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11215792 (10cmooney) [17:25:47] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630 (10cmooney) 03NEW p:05Triage→03Medium [17:26:02] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215830 (10cmooney) [17:26:04] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215831 (10cmooney) [17:27:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215832 (10cmooney) [17:27:01] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215833 (10cmooney) [17:33:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632 (10cmooney) 03NEW p:05Triage→03Medium [17:33:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11215881 (10cmooney) [17:33:24] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215882 (10cmooney) [17:37:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11215886 (10cmooney) [17:39:08] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215905 (10cmooney) [17:40:15] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215913 (10cmooney) [17:41:23] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11215927 (10cmooney) [17:42:34] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215944 (10cmooney) [17:42:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215945 (10cmooney) [17:42:55] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215946 (10cmooney) [17:42:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215947 (10cmooney) [17:43:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215950 (10cmooney) [17:49:15] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11215963 (10cmooney) [17:50:01] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11215967 (10cmooney) [17:52:05] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#11215973 (10cmooney) 05Open→03Declined Gonna close this one for now. Doing it in our YAML data for the occasional virtual-chassis... [17:56:34] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637 (10cmooney) 03NEW p:05Triage→03Medium [17:57:12] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637#11216041 (10cmooney) [17:57:16] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216042 (10cmooney) [18:02:29] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640 (10cmooney) 03NEW p:05Triage→03Medium [18:04:16] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640#11216198 (10cmooney) [18:04:22] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216197 (10cmooney) [18:04:51] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640#11216214 (10cmooney) [18:04:53] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216215 (10cmooney) [18:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (97.36%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [20:21:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11217124 (10Papaul) [21:24:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [22:04:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [22:37:55] FIRING: MaxConntrack: Max conntrack at 80.6% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:42:55] RESOLVED: MaxConntrack: Max conntrack at 83.22% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:59:25] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (98%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure