[00:25:25] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:33] (03CR) 10Raymond Ndibe: [C: 03+2] kubernetes: set NO_HOME for bulidservice and unset workingDir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [00:29:19] (03Merged) 10jenkins-bot: kubernetes: set NO_HOME for bulidservice and unset workingDir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [00:30:08] (03CR) 10Cwhite: [C: 03+1] Make deploy-tag compulsory [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:32:24] (03CR) 10Cwhite: [C: 03+1] sre: mute puppet-ca pint checks for missing series [alerts] - 10https://gerrit.wikimedia.org/r/906533 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:35:54] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [00:39:26] !log rebooting sessionstore1001 — T327954 [00:39:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906688 [00:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:31] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [00:39:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906688 (owner: 10TrainBranchBot) [00:55:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906688 (owner: 10TrainBranchBot) [01:02:39] !log rebooting sessionstore1001 — T327954 [01:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:43] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [01:10:44] !log rebooting sessionstore1001 — T327954 [01:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:48] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [01:17:35] !log rebooting sessionstore1001 — T327954 [01:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:39] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:11] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:05] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:51] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:17] PROBLEM - Host blog.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:38:33] RECOVERY - Host blog.wikimedia.org is UP: PING WARNING - Packet loss = 90%, RTA = 4.01 ms [06:00:07] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230407T0600) [06:49:03] PROBLEM - Host blog.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [06:52:19] RECOVERY - Host blog.wikimedia.org is UP: PING WARNING - Packet loss = 90%, RTA = 4.54 ms [06:55:37] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:31] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:07] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230407T0700) [07:23:11] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10Volans) p:05Triage→03High a:03Volans I've updated on IRC while I was working on this, this is my backlog :) > So for Netbox, we do have hourly backup, we're jus... [07:24:35] 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10ayounsi) FYI, the above patch caused the following outstanding diff on some cloudsw switches: `lang=diff Changes for 1 devices: ['cloudsw1-c8-eqiad.mgmt.eqiad.wmn... [07:56:20] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) Hello. I still have a problem with the display of a PDF on this page : https://fr.wikibooks.org/wiki/Utilisateur:Lionel_Scheepmans/... [07:57:57] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: mute puppet-ca pint checks for missing series [alerts] - 10https://gerrit.wikimedia.org/r/906533 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [07:59:15] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) Is ther some news about this bug ? [08:16:31] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Yup, set is an action directive and afaik (I know of no docs clarifying this) does not inherit to child contexts. https://blog.martinfjord" [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) (owner: 10Dduvall) [08:51:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I should add this "set" directive comes from the stream module, not the rewrite module. They both define it, but the rewrite one isn't ena" [puppet] - 10https://gerrit.wikimedia.org/r/904241 (https://phabricator.wikimedia.org/T322453) (owner: 10Dduvall) [08:55:16] (03PS4) 10Filippo Giunchedi: Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [08:55:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "I have resolved the conflicts and removed a now-unused define, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [09:02:53] (03PS1) 10Filippo Giunchedi: data-engineering: use generic eventgate HTTP error alert name [alerts] - 10https://gerrit.wikimedia.org/r/906710 (https://phabricator.wikimedia.org/T309009) [09:11:31] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:51] (03PS1) 10Filippo Giunchedi: data-engineering: disable missing metrics pint check for validation errors [alerts] - 10https://gerrit.wikimedia.org/r/906711 (https://phabricator.wikimedia.org/T309009) [09:20:38] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:22:41] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sonicmgmt - ayounsi@cumin1001" [09:23:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sonicmgmt - ayounsi@cumin1001" [09:23:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:30:45] (03PS1) 10Filippo Giunchedi: data-engineering: refactor eventgate validation alerts [alerts] - 10https://gerrit.wikimedia.org/r/906712 (https://phabricator.wikimedia.org/T309009) [09:31:44] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: use generic eventgate HTTP error alert name [alerts] - 10https://gerrit.wikimedia.org/r/906710 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi) [09:32:04] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: disable missing metrics pint check for validation errors [alerts] - 10https://gerrit.wikimedia.org/r/906711 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi) [09:32:37] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: refactor eventgate validation alerts [alerts] - 10https://gerrit.wikimedia.org/r/906712 (https://phabricator.wikimedia.org/T309009) (owner: 10Filippo Giunchedi) [10:03:59] (03PS1) 10Filippo Giunchedi: New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/906716 [10:34:34] !log About to deploy analytics/refinery in test cluster [10:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:31] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:24] !log aqu@deploy2002 Started deploy [analytics/refinery@eb4c2b2] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST [analytics/refinery@eb4c2b2] [10:40:30] !log aqu@deploy2002 Finished deploy [analytics/refinery@eb4c2b2] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST [analytics/refinery@eb4c2b2] (duration: 00m 06s) [10:57:35] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:04] (03PS2) 10David Caro: maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637 [11:01:06] (03PS12) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [11:03:56] !log aqu@deploy2002 Started deploy [analytics/refinery@e70da10] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST 2nd try [analytics/refinery@e70da10] [11:05:30] !log aqu@deploy2002 Finished deploy [analytics/refinery@e70da10] (hadoop-test): Deploy analytics_refinery including last webrquest load scripts in TEST 2nd try [analytics/refinery@e70da10] (duration: 01m 33s) [12:09:01] 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) [12:09:13] 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) [12:10:28] 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) [12:10:46] 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) [12:20:54] (03Abandoned) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:21:28] 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) Created ticket CS1020330 with the following: > Support, > > We would like remote hands to receive in our shipment of PDUs on ticket DEL0126279 & install them into our new racks being built out by Interxio... [12:31:25] 10SRE, 10Infrastructure-Foundations, 10netops: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) p:05Triage→03Low [12:34:08] 10SRE, 10Infrastructure-Foundations, 10netops: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) [12:40:37] (03PS1) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) [12:40:39] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: only backup toolforge things every other day [puppet] - 10https://gerrit.wikimedia.org/r/906579 (owner: 10Andrew Bogott) [12:41:12] (03CR) 10CI reject: [V: 04-1] Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [12:43:44] (03PS2) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) [12:51:00] (03PS1) 10Andrew Bogott: cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727 [12:51:39] (03PS2) 10Andrew Bogott: cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727 [12:54:38] (03PS3) 10Andrew Bogott: cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727 [12:55:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) [12:55:05] (03PS3) 10Cathal Mooney: Add generic way to create static routes on switches [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) [12:55:30] 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Jgreen) Sudo access to sre.dns.netbox makes sense to me. [12:58:09] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps tools cinder: limit volumes to backup [puppet] - 10https://gerrit.wikimedia.org/r/906727 (owner: 10Andrew Bogott) [12:58:51] (03PS1) 10Volans: tox.ini: make it compatible with tox 4.x [software/homer] - 10https://gerrit.wikimedia.org/r/906728 [12:58:53] (03PS1) 10Volans: tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 [12:58:58] (03CR) 10CI reject: [V: 04-1] tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans) [12:59:32] (03CR) 10Volans: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans) [13:00:36] (03CR) 10Andrew Bogott: "Is this just to week us off of the old link name or is there a specific reason why this is bad on bookworm?" [puppet] - 10https://gerrit.wikimedia.org/r/906034 (owner: 10Majavah) [13:01:44] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Thanks for making sure to update this to deal with the snag :)" [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans) [13:01:50] (03CR) 10CI reject: [V: 04-1] tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans) [13:02:03] (03CR) 10Majavah: cinderutils: stop provisioning old filename on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906034 (owner: 10Majavah) [13:02:11] (03CR) 10Andrew Bogott: "*ween" [puppet] - 10https://gerrit.wikimedia.org/r/906034 (owner: 10Majavah) [13:03:29] (03CR) 10Volans: [C: 03+2] tox.ini: make it compatible with tox 4.x [software/homer] - 10https://gerrit.wikimedia.org/r/906728 (owner: 10Volans) [13:03:36] (03PS6) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) [13:05:19] (03Merged) 10jenkins-bot: tox.ini: make it compatible with tox 4.x [software/homer] - 10https://gerrit.wikimedia.org/r/906728 (owner: 10Volans) [13:06:27] (03PS2) 10Volans: tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 [13:06:44] (03CR) 10Cathal Mooney: [C: 03+2] deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) (owner: 10Samtar) [13:09:02] (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/903258/40558/" [puppet] - 10https://gerrit.wikimedia.org/r/903258 (owner: 10Majavah) [13:09:56] (03CR) 10Andrew Bogott: [C: 03+2] openstack: remove osmdb dns records [puppet] - 10https://gerrit.wikimedia.org/r/892903 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [13:10:34] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs: remove osmdb classes [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [13:11:07] (03CR) 10Andrew Bogott: [C: 03+2] osm: remove unuseud shapefile_import class [puppet] - 10https://gerrit.wikimedia.org/r/892905 (owner: 10Majavah) [13:11:49] (03Abandoned) 10Andrew Bogott: wikireplica_dns.yaml: remove osm entry [puppet] - 10https://gerrit.wikimedia.org/r/905759 (owner: 10Andrew Bogott) [13:12:10] (03Abandoned) 10Andrew Bogott: wikireplica_dns.yaml: make legacy tools-db names cnames for the wmcloud domain [puppet] - 10https://gerrit.wikimedia.org/r/905760 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott) [13:17:16] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10ssingh) >>! In T334253#8764695, @Volans wrote: > I've updated on IRC while I was working on this, this is my backlog :) Many thanks for taking care of this, @volans!... [13:25:05] 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Open→03Resolved Indeed Arzhel thanks, my bad I had forgot those were present. I'll close this one, the static's have been (rather laboriously) deal... [13:40:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:45:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:13] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:50:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:52:17] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:15] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:56:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10RobH) [13:56:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10RobH) [14:07:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:08:22] looking ^ [14:12:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (9) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:13:28] huge load increase on all wdqs servers [14:15:44] dcausse that's not good [14:16:24] it's decreasing, what's surprising is that it impacts both eqiad & codfw, generally load spikes related to traffic affects only one DC [14:17:53] Is it because of the DC failovers? I think most stuff is running out of CODFW only, although WDQS is still active in both DCs [14:17:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (9) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:18:33] inflatador: we keep both DC actives I think [14:20:48] dcausse we did, but I wonder if the front-end being stuck to CODFW means traffic that would normally only be routed to WDQS in the same DC as the front-end goes to all now? Just a weak guess [14:21:26] I think it's geo based so I doubt but I could be wrong [14:22:03] FWiW it looks like the throttling filter size only increased in codfw [14:24:05] indeed I see a slight increase in eqiad too but not as significant [14:53:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:56:26] mhh wikibugs isn't here anymore [14:58:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:59:29] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8758772, @MoritzMuehlenhoff wrote: >>>! In T310980#8724070, @MoritzMuehlenhoff wrote: >> Looking at https://cassandra.apache.org/doc/latest/cassandra/getting_started/java11... [15:00:03] ah there we go, welcome back wikibugs [15:00:31] :( [15:00:33] hm I just restarted it, but apparently it fixed itself just before [15:00:42] ah got it, thanks taavi [15:00:59] it's back on -cloud, it'll join back here once it has something new to say [15:02:13] makes sense, IIRC jinxer-wm behaves the same [15:24:58] (03CR) 10Herron: [C: 03+1] alertmanager: sink notifications for dev/test hosts [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [15:26:45] (03CR) 10Dzahn: [C: 03+1] "omg yes, I remember soooo many times clicking "ACK" in Icinga web UI with the message "dev / test host, why does it have monitoring"" [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [15:27:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:28:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:29:51] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 5.757 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:30:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.573 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:30:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10Volans) 05Open→03Resolved Perfect, resolving. [15:43:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:45:55] (03CR) 10FNegri: [C: 03+2] "I think that we can now get rid of this record, osmdb is unlikely to be resurrected :)" [dns] - 10https://gerrit.wikimedia.org/r/892901 (https://phabricator.wikimedia.org/T323159) (owner: 10Majavah) [15:48:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:51:45] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) [15:53:43] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) 05In progress→03Resolved I went ahead and struck lists off of the.... list since it s... [16:05:19] 10SRE, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) [16:08:10] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) [16:09:25] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) @Aklapper SRE would be happy about advice from you... [16:17:08] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) @Ladsgroup How do you feel about https://github.co... [16:25:35] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) The current tags that the bot adds the SRE tag to... [16:30:29] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Aklapper) @Lionel_Scheepmans In general in all tickets, all known news can be found in the ticket itself. Thus no need to ever ask. Thanks! [16:45:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:46:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:49:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.396 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:50:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.378 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:57:07] (03PS1) 10Eevans: Revert "sessionstore: make native transport (intentionally) unreachable" [puppet] - 10https://gerrit.wikimedia.org/r/906743 [16:59:09] (03CR) 10Eevans: [C: 03+2] Revert "sessionstore: make native transport (intentionally) unreachable" [puppet] - 10https://gerrit.wikimedia.org/r/906743 (owner: 10Eevans) [17:02:06] !log restart Cassandra, sessionstore1001-a (re-enabling CQL) — T327954 [17:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:11] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [17:10:43] (03PS1) 10Xcollazo: structured-data: Temporarily remove ImageSuggestionsPushFailure alert. [alerts] - 10https://gerrit.wikimedia.org/r/906744 (https://phabricator.wikimedia.org/T328789) [17:12:13] (03CR) 10Xcollazo: "We will hold deploying this one. See details at https://phabricator.wikimedia.org/T328789#8765465." [alerts] - 10https://gerrit.wikimedia.org/r/905719 (https://phabricator.wikimedia.org/T328789) (owner: 10Xcollazo) [17:16:03] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Aklapper) >>! In T334294#8765358, @Dzahn wrote: > @Aklapp... [17:20:13] 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn) [17:21:25] 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn) Almost 10 years later, making ticket public :) There are still a bunch of imported RT tickets here in Phabricator that are set to WMF-NDA but don't have to be. And yes, I actually refe... [17:22:39] 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn) [17:41:11] 10SRE, 10ops-requests: etherpad.wikimedia.org could use an icinga check - https://phabricator.wikimedia.org/T82936 (10Dzahn) removed process monitoring in https://gerrit.wikimedia.org/r/c/operations/puppet/+/904856 we have prometheus blackbox http check now: https://gerrit.wikimedia.org/r/c/operations/puppet/... [17:51:03] 10SRE, 10ops-requests: Enable access to Gerrit on port 22. - https://phabricator.wikimedia.org/T84713 (10Dzahn) [18:01:59] 10ops-codfw: ms-be2013-2015 setup mgmt and bios - https://phabricator.wikimedia.org/T84611 (10Dzahn) [18:03:58] 10SRE, 10ops-requests: Set up regular backups for Graphite data in tungsten:/var/lib/carbon/whisper - https://phabricator.wikimedia.org/T84511 (10Dzahn) [18:05:22] 10SRE, 10ops-requests: Upload Jenkins 1.565.2 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T84462 (10Dzahn) [18:18:46] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@5c4ebda]: (no justification provided) [18:19:22] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@5c4ebda]: (no justification provided) (duration: 00m 35s) [18:40:22] 10SRE-Sprint-Week-Sustainability-March2023, 10Znuny, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) also see T334250#8765659 where this is about converting existing NRPE checks we have on VRTS hosts (checks if c... [18:43:45] 10SRE: monitoring of phabricator - https://phabricator.wikimedia.org/T957 (10Dzahn) In T334250 I am wondering if we should remove the process monitoring part of this and only keep https monitoring. [18:48:04] 10SRE, 10Continuous-Integration-Infrastructure, 10observability, 10Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 (10Dzahn) If we had this we could maybe remove the Icinga process checks we have for zuul and zuul-merger (... [19:48:27] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:48:28] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic: Why do Wikimedia Commons SVGs sometimes not update? - https://phabricator.wikimedia.org/T334303 (10bd808) [19:58:34] (03PS4) 10Aklapper: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) [20:03:09] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic: PNG thumbnail of Wikimedia Commons SVG file sometimes not updated - https://phabricator.wikimedia.org/T334303 (10Aklapper) [20:31:36] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Ladsgroup) The switch from herald to maint bot was done b... [21:06:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:11:10] 10SRE, 10Abstract Wikipedia team, 10Anti-Harassment, 10Cloud-Services, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10thcipriani) [21:11:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:25:07] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) Thank you for all the details @Ladsgroup , it's mu... [21:36:02] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Ladsgroup) >>! In T334294#8765857, @Dzahn wrote: > Thank... [21:40:33] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) >>! In T334294#8765862, @Ladsgroup wrote: > defini... [21:46:27] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) {F36942751} see lower left corner in the "new proj... [21:47:42] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Dzahn) {F36942753} [22:48:47] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) I've found that 2.84.84.84 (released just last month) is a non-functional version, causing the web interface to break. At first I thought it was enforcing host names (per https://www.dell.com/support/kbdoc... [22:50:55] 10SRE, 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) Oh, and some of the older idrac firmware enforced :443 redirects, so an SSH tunnel on a different port would cause connectivity issues. I worked around it by using socat: ` socat TCP-LISTEN:443,fork TCP... [23:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale