[03:22:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:30] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [03:29:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (94.27%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:10:26] 10SRE-tools, 06Infrastructure-Foundations, 10observability: Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#10907774 (10Volans) Just to clarify expectations here, while #sre-tools is happy to be included in the discussion/design, we think that this requ... [07:16:18] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#10907781 (10Volans) Is this still relevant or superseded by more recent development/plans in this area? [07:18:10] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: Decommission script race condition - https://phabricator.wikimedia.org/T206448#10907782 (10Volans) 05Open→03Declined The script doesn't exists since long time, replaced by the related cookbook. [07:19:31] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#10907787 (10jcrespo) In terms of features requests, I think it is still relevant. If they want to merge its requirements into another ticket, that's ok (I don't think all of... [07:21:30] 10SRE-tools, 06Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315#10907807 (10Volans) @wiki_willy is this something still needed or the current workflow doesn't need it anymore? [07:21:35] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#10907808 (10jcrespo) [07:22:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:28:30] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [07:29:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (94.24%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:57:48] 10netops, 10SRE-tools, 06Infrastructure-Foundations: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712 (10Volans) 03NEW p:05Triage→03Medium [08:02:13] 10netops, 10SRE-tools, 06Infrastructure-Foundations: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712#10907935 (10Volans) I've run some custom code with spicerack-shell and get the audit data for the whole fleet and comparing the MAC address retrieved from... [08:02:15] XioNoX: ^^^ for you :) [08:02:47] volans: I have wikibug in my ignore list if that's what your pointing to :) [08:03:54] yes, and I knew :D [08:03:57] pointing anyway [08:04:07] T396712#10907935 [08:04:08] T396712: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712 [08:04:41] looking, thx1 [08:09:10] volans: an-worker1065 - Purchase date 2018 [08:09:49] yes, there are a bunch of very old hosts that I would not bother to care or for which it's totally ok to store the MAC on netbox [08:10:00] volans: that's great thanks, and looks quite promising! [08:10:14] the most annoying part are the supermicro [08:10:19] they are all new and have some issues [08:10:32] what kind? [08:11:04] beside checking that ganeti works despite the mismatch with puppetdb I think all not-too-old Dell's work [08:11:07] I think we can open tasks for DCops to start fixing stuff, like "more than 1 NIC with PXE enabled", "No NIC with PXE enabled" [08:11:27] that's usually just to run the provision cookbook again, we can do that [08:11:50] might need to coordinate with service owner in case a reboot is triggered (maybe because other things are misconfigured) [08:11:56] to me PXE level stuff should be in DCops realm [08:12:05] NIC/PXE [08:12:11] "iDRAC not working" too [08:12:36] not up to me to decide :) [08:13:59] volans: can I run your script as one off on some servers ? for example to troublehsot "MAC mismatch between PuppetDB and Redfish for other reasons" [08:14:20] when did you run this script, less than 10 days ago? it shows ganeti7001/7002, but these are using routed Ganeti currently [08:15:09] moritzm: yesterday but the exported data from puppetdb was indeed older, I got a newer one, I can double check they work with the updated data [08:15:12] give me a sec [08:16:10] yep they match [08:17:00] ok, thanks. I was just worried he had missed something during the magru migration :-) [08:17:06] s/he/we [08:17:20] sorry for the worry :) [08:17:54] XioNoX: I hve the data live in an interactive shell I can give you the data if you want to check [08:19:00] moritzm: just 7001/2 right? [08:19:40] volans: if we take the first one an-conf1004, what's the reported mac from redfish? [08:20:07] ignore the supermicro for now [08:20:13] take this one [08:20:13] Mac(hostname='db2241', bmc=, puppetdb='04:32:01:db:fa:80', redfish='04:32:01:DB:A5:D0', error='', match=False) [08:20:52] volans: btw pc2011 is because we moved it to 10G yesterday, so it's probably jsut state puppetdb data [08:21:20] volans: yes, 7001/7002 should have vanished from the list since they are now using routed Ganeti, 7003 will be reimaged to routed later the day and 7004 sticks around until the Bird work is completed by NIC.cz [08:21:36] great, thanks. Updated [08:21:50] db2241 is weird the redfish reporting mac address is not any of the current interfaces.. [08:22:22] XioNoX: when moving hosts to 10GB is the provision cookbook run again to update the PXE setting? [08:22:41] otherwise those hosts will fail next reimage [08:22:50] volans: updating PXE is one of the steps, dunno how DCops does it [08:22:58] but yeah it's accounted for [08:23:02] ack [08:23:24] if it's done manually is error prone, I would probably use the cookbook :) [08:23:48] of course, I forgot about it, will update the steps [08:24:34] volans: done https://phabricator.wikimedia.org/T378715#10524038 [08:25:09] for example for cloudcephosd1048 puppetdb has enp10s0f0np0, redfish eno8303 [08:25:23] possible pxe misconfiguration, to be checked [08:25:30] yeah I'd bet on that [08:25:41] and puppetdb is the correct active one ofc [08:25:42] puppetdb is 10G, redfish the old 1G [08:27:00] but for db2241 I have no idea on what the redfish reported MAC is... [08:27:57] checking [08:29:45] MAC mismatch between PuppetDB and Redfish for other reasons -> I bet many of them are where the PXE NIC is the wrong one [08:31:17] yes, most likely and many are supermicro so can be either that and/or the bug [08:31:26] so db2241, NIC.Integrated.1-1-1 reports that MAC on redfish [08:32:32] but it's none of the ones reported by the OS... [08:32:36] so that's clearly a bug [08:33:06] it has 2 Embedded oports and 2 integrated ones [08:34:01] 2: eno8303: ac:b4:80:23:56:71 [08:34:01] 3: eno12399np0: 04:32:01:db:fa:80 [08:34:01] 4: eno8403: ac:b4:80:23:56:72 [08:34:01] 5: eno12409np1: 04:32:01:db:fa:81 [08:34:07] from the OS [08:34:30] also the embedded ones don't match [08:34:31] sigh [08:35:13] question is, only 1 stray host, or many of them ? :) [08:35:32] '04:32:01:DB:A5:D0' is reported as VirtMacAddr in the dump and as MACAddress and PermanentMACAddress in the nic specific get [08:35:50] cosmic ray? [08:35:55] :-P [08:36:02] eh [08:36:04] https://phabricator.wikimedia.org/T396717 [08:36:27] <3 [08:36:35] I'm tempted to restart idrac there [08:36:42] checking which host is [08:36:52] volans: probably a DB master :) [08:37:05] x3 codfw master ofc [08:37:09] :) [08:37:15] but secondary dc :D [08:38:45] for "iDRAC not working" only cirrussearch2079 is not too old, all the others are 2018, 2020 or test [08:41:11] https://phabricator.wikimedia.org/T396718 [08:46:46] so what's left are the 20 "MAC mismatch between PuppetDB and Redfish for other reasons", I can probably sort them out manually (eg. a pile of PXE on the wrong interface) if you give me the list of redfish exposed MACs [08:47:04] and and the more annoying "find_supermicro_mac() can't find MAC (supermicro bug to investigate, escalated to supermicro support)" [08:47:27] but a bunch of the mismatch category are also supermicro [08:47:34] so might be related to the second [08:47:35] not sure yet [08:52:17] good point, I can sort that too [08:52:31] don't do manual work, I have all the data :D [08:52:42] tell me how you want them split and I can update the paste with the relevant data [08:54:12] volans: a supermicro pile and a dell pile [08:54:13] and/or a pile where the redfish mac still match another NIC present on the host, vs. it's a weird one like db2241 [08:54:31] basically is it just PXE set on the wrong NIC? [08:55:22] that I don't have the data to split, but I can do supermicro/dell [08:56:12] the nokiatest are special in some ways? [08:56:56] they're old hosts I think [08:57:18] yeah, they can be ignored [08:58:07] for the 2nd split I can do it manually :) I just need the redfish reported data [09:03:03] XioNoX: paste updated [09:04:15] XioNoX: I noticed that db2241 and db2242 are inverted [09:04:26] so probably they have the bmc IP inverted [09:08:45] fixing it [09:15:37] ohhhh [09:15:45] good catch [09:16:01] could have been an outage [09:23:51] for sure [09:24:01] reboot one via mgmt and will reboot the other :facepalm: [11:22:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (94.34%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:29:50] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [13:23:06] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 10Toolforge: [infra] Reports of slow connectivity from APAC - https://phabricator.wikimedia.org/T395135#10909085 (10cmooney) The latency is also reduced when I check for it here (there are no manual overrides of the traffic path in place either... [13:28:19] 10netops, 06Infrastructure-Foundations, 06SRE: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10909127 (10cmooney) 05Open→03Resolved [13:43:07] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10909207 (10cmooney) p:05Medium→03Low > We could (ab)use the FHRP group feature to group members of a MC-LAG and add common variables... [13:45:13] 10netops, 06Infrastructure-Foundations, 06SRE: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635#10909214 (10cmooney) 05Open→03Resolved [13:48:53] 10netops, 06Infrastructure-Foundations, 06SRE: Sub-optimal cloud routing for WMCS in eqiad when link fails - https://phabricator.wikimedia.org/T367203#10909233 (10cmooney) 05Open→03Resolved This problem is now resolved as we are using IBGP with next-hops announced as the loopbacks of each switch, and... [13:51:56] 10netops, 06Infrastructure-Foundations, 06SRE: Create cookbook to set up ganeti host network - https://phabricator.wikimedia.org/T378346#10909245 (10cmooney) 05Open→03Declined [15:04:33] routed ganeti and full katran, magru is living in the future :) [15:05:06] magru is the new ulsfo (guinea pig) [15:16:01] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10909646 (10RobH) ` Management Password: db1253.eqiad.wmnet (Gen 15): starting db1253.eqiad.wmnet (SSD): update db1253.eqiad.wmnet (SSD): current version:... [15:25:10] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:13] * moritzm quickly reimages all of magru with trixie rc1 [15:29:50] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:30:57] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10909802 (10RobH) Not sure what I'm doing wrong: > robh@cumin2002:~$ sudo cookbook sre.hardware.upgrade-firmware -c ssd "db1253.*" > Acquired lock for... [15:34:02] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (94.3%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [15:35:07] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10909840 (10RobH) Bah, fixed, was SSD directory not STORAGE, thanks Riccardo! [15:42:24] 10netops, 10SRE-tools, 06Infrastructure-Foundations: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712#10909876 (10ayounsi) [17:25:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:29:50] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [19:34:02] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (94.35%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [19:46:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [20:05:12] 10SRE-tools, 06Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315#10910987 (10wiki_willy) Hey @Volans - I think we've come up with a couple solutions since this task was created. One is providing a monthly Netbox dump to the Account... [21:30:10] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:50] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [23:34:02] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (94.34%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure