[00:25:25] <icinga-wm_>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:28:33] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+2] kubernetes: set NO_HOME for bulidservice and unset workingDir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro)
[00:29:19] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: set NO_HOME for bulidservice and unset workingDir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro)
[00:30:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Make deploy-tag compulsory [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[00:32:24] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] sre: mute puppet-ca pint checks for missing series [alerts] - 10https://gerrit.wikimedia.org/r/906533 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[00:35:54] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[00:39:26] <urandom>	 !log rebooting sessionstore1001 — T327954
[00:39:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906688
[00:39:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:31] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[00:39:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906688 (owner: 10TrainBranchBot)
[00:55:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906688 (owner: 10TrainBranchBot)
[01:02:39] <urandom>	 !log rebooting sessionstore1001 — T327954
[01:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:02:43] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[01:10:44] <urandom>	 !log rebooting sessionstore1001 — T327954
[01:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:10:48] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[01:17:35] <urandom>	 !log rebooting sessionstore1001 — T327954
[01:17:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:39] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:21:11] <icinga-wm_>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:07:05] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:07:51] <icinga-wm_>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:35:17] <icinga-wm_>	 PROBLEM - Host blog.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:33] <icinga-wm_>	 RECOVERY - Host blog.wikimedia.org is UP: PING WARNING - Packet loss = 90%, RTA = 4.01 ms
[06:00:07] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230407T0600)
[06:49:03] <icinga-wm_>	 PROBLEM - Host blog.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[06:52:19] <icinga-wm_>	 RECOVERY - Host blog.wikimedia.org is UP: PING WARNING - Packet loss = 90%, RTA = 4.54 ms
[06:55:37] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:56:31] <icinga-wm_>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:07] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230407T0700)
[07:23:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10Volans) p:05Triage→03High a:03Volans I've updated on IRC while I was working on this, this is my backlog :)   > So for Netbox, we do have hourly backup, we're jus...
[07:24:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10ayounsi) FYI, the above patch caused the following outstanding diff on some cloudsw switches:  `lang=diff Changes for 1 devices: ['cloudsw1-c8-eqiad.mgmt.eqiad.wmn...
[07:56:20] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) Hello. I still have a problem with the display of a PDF on this page :  https://fr.wikibooks.org/wiki/Utilisateur:Lionel_Scheepmans/...
[07:57:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: mute puppet-ca pint checks for missing series [alerts] - 10https://gerrit.wikimedia.org/r/906533 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[07:59:15] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) Is ther some news about this bug ?
[08:16:31] <icinga-wm_>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Yup, set is an action directive and afaik (I know of no docs clarifying this) does not inherit to child contexts. https://blog.martinfjord" [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) (owner: 10Dduvall)
[08:51:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "I should add this "set" directive comes from the stream module, not the rewrite module. They both define it, but the rewrite one isn't ena" [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) (owner: 10Dduvall)
[08:55:16] <wikibugs>	 (03PS4) 10Filippo Giunchedi: Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney)
[08:55:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I have resolved the conflicts and removed a now-unused define, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney)
[09:02:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-engineering: use generic eventgate HTTP error alert name [alerts] - 10https://gerrit.wikimedia.org/r/906710 (https://phabricator.wikimedia.org/T309009)
[09:11:31] <icinga-wm_>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-engineering: disable missing metrics pint check for validation errors [alerts] - 10https://gerrit.wikimedia.org/r/906711 (https://phabricator.wikimedia.org/T309009)
[09:20:38] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[09:22:41] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sonicmgmt - ayounsi@cumin1001"
[09:23:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sonicmgmt - ayounsi@cumin1001"
[09:23:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:30:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-engineering: refactor eventgate validation alerts [alerts] - 10https://gerrit.wikimedia.org/r/906712 (https://phabricator.wikimedia.org/T309009)
[09:31:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: use generic eventgate HTTP error alert name [alerts] - 10https://gerrit.wikimedia.org/r/906710 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi)
[09:32:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: disable missing metrics pint check for validation errors [alerts] - 10https://gerrit.wikimedia.org/r/906711 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi)
[09:32:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: refactor eventgate validation alerts [alerts] - 10https://gerrit.wikimedia.org/r/906712 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi)
[10:03:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/906716
[10:34:34] <aqu>	 !log About to deploy analytics/refinery in test cluster
[10:34:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:31] <icinga-wm_>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:40:24] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@eb4c2b2] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST [analytics/refinery@eb4c2b2]
[10:40:30] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@eb4c2b2] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST [analytics/refinery@eb4c2b2] (duration: 00m 06s)
[10:57:35] <icinga-wm_>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:04] <wikibugs>	 (03PS2) 10David Caro: maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637
[11:01:06] <wikibugs>	 (03PS12) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[11:03:56] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@e70da10] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST 2nd try [analytics/refinery@e70da10]
[11:05:30] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@e70da10] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST 2nd try [analytics/refinery@e70da10] (duration: 01m 33s)
[12:09:01] <wikibugs>	 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH)
[12:09:13] <wikibugs>	 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH)
[12:10:28] <wikibugs>	 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH)
[12:10:46] <wikibugs>	 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH)
[12:20:54] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[12:21:28] <wikibugs>	 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) Created ticket CS1020330 with the following:    > Support, >  > We would like remote hands to receive in our shipment of PDUs on ticket DEL0126279 & install them into our new racks being built out by Interxio...
[12:31:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) p:05Triage→03Low
[12:34:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney)
[12:40:37] <wikibugs>	 (03PS1) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281)
[12:40:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: only backup toolforge things every other day [puppet] - 10https://gerrit.wikimedia.org/r/906579 (owner: 10Andrew Bogott)
[12:41:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney)
[12:43:44] <wikibugs>	 (03PS2) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281)
[12:51:00] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727
[12:51:39] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727
[12:54:38] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727
[12:55:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney)
[12:55:05] <wikibugs>	 (03PS3) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281)
[12:55:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Jgreen) Sudo access to sre.dns.netbox makes sense to me.
[12:58:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727 (owner: 10Andrew Bogott)
[12:58:51] <wikibugs>	 (03PS1) 10Volans: tox.ini: make it compatible with tox 4.x [software/homer] - 10https://gerrit.wikimedia.org/r/906728
[12:58:53] <wikibugs>	 (03PS1) 10Volans: tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729
[12:58:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans)
[12:59:32] <wikibugs>	 (03CR) 10Volans: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans)
[13:00:36] <wikibugs>	 (03CR) 10Andrew Bogott: "Is this just to week us off of the old link name or is there a specific reason why this is bad on bookworm?" [puppet] - 10https://gerrit.wikimedia.org/r/906034 (owner: 10Majavah)
[13:01:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  Thanks for making sure to update this to deal with the snag :)" [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans)
[13:01:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans)
[13:02:03] <wikibugs>	 (03CR) 10Majavah: cinderutils: stop provisioning old filename on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906034 (owner: 10Majavah)
[13:02:11] <wikibugs>	 (03CR) 10Andrew Bogott: "*ween" [puppet] - 10https://gerrit.wikimedia.org/r/906034 (owner: 10Majavah)
[13:03:29] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tox.ini: make it compatible with tox 4.x [software/homer] - 10https://gerrit.wikimedia.org/r/906728 (owner: 10Volans)
[13:03:36] <wikibugs>	 (03PS6) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782)
[13:05:19] <wikibugs>	 (03Merged) 10jenkins-bot: tox.ini: make it compatible with tox 4.x [software/homer] - 10https://gerrit.wikimedia.org/r/906728 (owner: 10Volans)
[13:06:27] <wikibugs>	 (03PS2) 10Volans: tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729
[13:06:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) (owner: 10Samtar)
[13:09:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/903258/40558/" [puppet] - 10https://gerrit.wikimedia.org/r/903258 (owner: 10Majavah)
[13:09:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: remove osmdb dns records [puppet] - 10https://gerrit.wikimedia.org/r/892903 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah)
[13:10:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs: remove osmdb classes [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah)
[13:11:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] osm: remove unuseud shapefile_import class [puppet] - 10https://gerrit.wikimedia.org/r/892905 (owner: 10Majavah)
[13:11:49] <wikibugs>	 (03Abandoned) 10Andrew Bogott: wikireplica_dns.yaml: remove osm entry [puppet] - 10https://gerrit.wikimedia.org/r/905759 (owner: 10Andrew Bogott)
[13:12:10] <wikibugs>	 (03Abandoned) 10Andrew Bogott: wikireplica_dns.yaml: make legacy tools-db names cnames for the wmcloud domain [puppet] - 10https://gerrit.wikimedia.org/r/905760 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott)
[13:17:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10ssingh) >>! In T334253#8764695, @Volans wrote: > I've updated on IRC while I was working on this, this is my backlog :)  Many thanks for taking care of this, @volans!...
[13:25:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Open→03Resolved Indeed Arzhel thanks, my bad I had forgot those were present.  I'll close this one, the static's have been (rather laboriously) deal...
[13:40:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:45:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:50:13] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:50:37] <icinga-wm_>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:52:17] <icinga-wm_>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:55:15] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:56:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10RobH)
[13:56:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10RobH)
[14:07:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:08:22] <dcausse>	 looking ^
[14:12:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (9) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:13:28] <dcausse>	 huge load increase on all wdqs servers
[14:15:44] <inflatador>	 dcausse that's not good
[14:16:24] <dcausse>	 it's decreasing, what's surprising is that it impacts both eqiad & codfw, generally load spikes related to traffic affects only one DC
[14:17:53] <inflatador>	 Is it because of the DC failovers? I think most stuff is running out of CODFW only, although WDQS is still active in both DCs
[14:17:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (9) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:18:33] <dcausse>	 inflatador: we keep both DC actives I think
[14:20:48] <inflatador>	 dcausse we did, but I wonder if the front-end being stuck to CODFW means traffic that would normally only be routed to WDQS in the same DC as the front-end goes to all now? Just a weak guess
[14:21:26] <dcausse>	 I think it's geo based so I doubt but I could be wrong
[14:22:03] <inflatador>	 FWiW it looks like the throttling filter size only increased in codfw
[14:24:05] <dcausse>	 indeed I see a slight increase in eqiad too but not as significant
[14:53:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:56:26] <godog>	 mhh wikibugs isn't here anymore
[14:58:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:59:29] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8758772, @MoritzMuehlenhoff wrote: >>>! In T310980#8724070, @MoritzMuehlenhoff wrote: >> Looking at https://cassandra.apache.org/doc/latest/cassandra/getting_started/java11...
[15:00:03] <godog>	 ah there we go, welcome back wikibugs 
[15:00:31] <godog>	 :(
[15:00:33] <taavi>	 hm I just restarted it, but apparently it fixed itself just before
[15:00:42] <godog>	 ah got it, thanks taavi 
[15:00:59] <taavi>	 it's back on -cloud, it'll join back here once it has something new to say
[15:02:13] <godog>	 makes sense, IIRC jinxer-wm behaves the same
[15:24:58] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: sink notifications for dev/test hosts [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi)
[15:26:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "omg yes, I remember soooo many times clicking "ACK" in Icinga web UI with the message "dev / test host, why does it have monitoring"" [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi)
[15:27:35] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:28:15] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:29:51] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 5.757 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:30:47] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.573 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:30:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10Volans) 05Open→03Resolved Perfect, resolving.
[15:43:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:45:55] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] "I think that we can now get rid of this record, osmdb is unlikely to be resurrected :)" [dns] - 10https://gerrit.wikimedia.org/r/892901 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah)
[15:48:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:51:45] <wikibugs>	 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall)
[15:53:43] <wikibugs>	 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) 05In progress→03Resolved I went ahead and struck lists off of the.... list since it s...
[16:05:19] <wikibugs>	 10SRE, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn)
[16:08:10] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn)
[16:09:25] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) @Aklapper SRE would be happy about advice from you...
[16:17:08] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) @Ladsgroup How do you feel about https://github.co...
[16:25:35] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) The current tags that the bot adds the SRE tag to...
[16:30:29] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Aklapper) @Lionel_Scheepmans In general in all tickets, all known news can be found in the ticket itself. Thus no need to ever ask. Thanks!
[16:45:31] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:46:05] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:49:15] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.396 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:50:19] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.378 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:57:07] <wikibugs>	 (03PS1) 10Eevans: Revert "sessionstore: make native transport (intentionally) unreachable" [puppet] - 10https://gerrit.wikimedia.org/r/906743
[16:59:09] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Revert "sessionstore: make native transport (intentionally) unreachable" [puppet] - 10https://gerrit.wikimedia.org/r/906743 (owner: 10Eevans)
[17:02:06] <urandom>	 !log restart Cassandra, sessionstore1001-a (re-enabling CQL) — T327954
[17:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:11] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[17:10:43] <wikibugs>	 (03PS1) 10Xcollazo: structured-data: Temporarily remove ImageSuggestionsPushFailure alert. [alerts] - 10https://gerrit.wikimedia.org/r/906744 (https://phabricator.wikimedia.org/T328789)
[17:12:13] <wikibugs>	 (03CR) 10Xcollazo: "We will hold deploying this one. See details at https://phabricator.wikimedia.org/T328789#8765465." [alerts] - 10https://gerrit.wikimedia.org/r/905719 (https://phabricator.wikimedia.org/T328789) (owner: 10Xcollazo)
[17:16:03] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Aklapper) >>! In T334294#8765358, @Dzahn wrote: > @Aklapp...
[17:20:13] <wikibugs>	 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn)
[17:21:25] <wikibugs>	 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn) Almost 10 years later, making ticket public :) There are still a bunch of imported RT tickets here in Phabricator that are set to WMF-NDA but don't have to be.   And yes, I actually refe...
[17:22:39] <wikibugs>	 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn)
[17:41:11] <wikibugs>	 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn) removed process monitoring in https://gerrit.wikimedia.org/r/c/operations/puppet/+/904856  we have prometheus blackbox http check now: https://gerrit.wikimedia.org/r/c/operations/puppet/...
[17:51:03] <wikibugs>	 10SRE, 10ops-requests: Enable access to Gerrit on port 22. - https://phabricator.wikimedia.org/T84713 (10Dzahn)
[18:01:59] <wikibugs>	 10ops-codfw: ms-be2013-2015 setup mgmt and bios - https://phabricator.wikimedia.org/T84611 (10Dzahn)
[18:03:58] <wikibugs>	 10SRE, 10ops-requests: Set up regular backups for Graphite data in tungsten:/var/lib/carbon/whisper - https://phabricator.wikimedia.org/T84511 (10Dzahn)
[18:05:22] <wikibugs>	 10SRE, 10ops-requests: Upload Jenkins 1.565.2 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T84462 (10Dzahn)
[18:18:46] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@5c4ebda]: (no justification provided)
[18:19:22] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@5c4ebda]: (no justification provided) (duration: 00m 35s)
[18:40:22] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Znuny, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) also see T334250#8765659 where this is about converting existing NRPE checks we have on VRTS hosts (checks if c...
[18:43:45] <wikibugs>	 10SRE: monitoring of phabricator - https://phabricator.wikimedia.org/T957 (10Dzahn) In T334250 I am wondering if we should remove the process monitoring part of this and only keep https monitoring.
[18:48:04] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10observability, 10Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 (10Dzahn) If we had this we could maybe remove the Icinga process checks we have for zuul and zuul-merger (...
[19:48:27] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:48:28] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic: Why do Wikimedia Commons SVGs sometimes not update? - https://phabricator.wikimedia.org/T334303 (10bd808)
[19:58:34] <wikibugs>	 (03PS4) 10Aklapper: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748)
[20:03:09] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic: PNG thumbnail of Wikimedia Commons SVG file sometimes not updated - https://phabricator.wikimedia.org/T334303 (10Aklapper)
[20:31:36] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Ladsgroup) The switch from herald to maint bot was done b...
[21:06:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:11:10] <wikibugs>	 10SRE, 10Abstract Wikipedia team, 10Anti-Harassment, 10Cloud-Services, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani)
[21:11:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:14:43] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:16:13] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:25:07] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) Thank you for all the details @Ladsgroup , it's mu...
[21:36:02] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Ladsgroup) >>! In T334294#8765857, @Dzahn wrote: > Thank...
[21:40:33] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) >>! In T334294#8765862, @Ladsgroup wrote: > defini...
[21:46:27] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) {F36942751} see lower left corner in the "new proj...
[21:47:42] <wikibugs>	 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) {F36942753}
[22:48:47] <wikibugs>	 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) I've found that 2.84.84.84 (released just last month) is a non-functional version, causing the web interface to break. At first I thought it was enforcing host names (per https://www.dell.com/support/kbdoc...
[22:50:55] <wikibugs>	 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) Oh, and some of the older idrac firmware enforced :443 redirects, so an SSH tunnel on a different port would cause connectivity issues.  I worked around it by using socat:   ` socat TCP-LISTEN:443,fork TCP...
[23:49:41] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale