[00:07:49] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:08:39] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:08:49] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:15:03] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:15:39] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:16:29] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:23:27] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:24:17] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:24:25] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:32:49] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:33:39] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:33:45] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:38:21] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:38:27] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986834 [00:38:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986834 (owner: 10TrainBranchBot) [00:39:07] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:43:49] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:44:35] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:44:41] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:58:41] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:59:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986834 (owner: 10TrainBranchBot) [01:00:17] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:01:06] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:37:10] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:07] (03CR) 10Gergő Tisza: add foundationwiki to the list of central auth login wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [03:06:42] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) ` Your dispatch shipped on 1/3/2024 4:20 PM ` [03:08:57] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10Papaul) Request replacement ` Create Dispatch: Success You have successfully submitted request SR182683531. [03:49:56] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [04:18:53] RECOVERY - MD RAID on ganeti1031 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:41:33] (03PS1) 10Jdlrobson: Revise logic for creating compact links button on Vector 2022 [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987473 (https://phabricator.wikimedia.org/T353850) [05:45:57] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10Aklapper) @Dima: Hi! //In case// you are WMDE staff/contractor, could you please state so on https://phabricator.wikimedia.org/p/Dima/ for transparency? Thanks! :) [05:46:06] (03CR) 10Hashar: [C: 04-1] "The contint* and doc* hosts already have php 7.4 and I don't understand the intent of this change." [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [06:07:17] (03PS1) 10Marostegui: db2[096-120]: Add hosts [puppet] - 10https://gerrit.wikimedia.org/r/987517 (https://phabricator.wikimedia.org/T354210) [06:07:59] (03CR) 10Marostegui: [C: 03+2] db2[096-120]: Add hosts [puppet] - 10https://gerrit.wikimedia.org/r/987517 (https://phabricator.wikimedia.org/T354210) (owner: 10Marostegui) [06:10:41] (03PS1) 10Marostegui: db2143: Migrare to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/987518 (https://phabricator.wikimedia.org/T353499) [06:11:20] (03CR) 10Marostegui: [C: 03+2] db2143: Migrare to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/987518 (https://phabricator.wikimedia.org/T353499) (owner: 10Marostegui) [06:14:09] (03CR) 10Marostegui: [C: 03+2] update_zarcillo: Push to the repo [software] - 10https://gerrit.wikimedia.org/r/987445 (owner: 10Marostegui) [06:14:47] (03Merged) 10jenkins-bot: update_zarcillo: Push to the repo [software] - 10https://gerrit.wikimedia.org/r/987445 (owner: 10Marostegui) [06:28:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1151.eqiad.wmnet with OS bookworm [06:40:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1151.eqiad.wmnet with reason: host reimage [06:42:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1151.eqiad.wmnet with reason: host reimage [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T0700) [07:00:05] kormat, marostegui, and Amir1: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T0700). [07:00:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1151.eqiad.wmnet with OS bookworm [07:16:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987153 (owner: 10Muehlenhoff) [07:17:31] 10SRE, 10ops-eqiad: Degraded RAID on ganeti1031 - https://phabricator.wikimedia.org/T354251 (10MoritzMuehlenhoff) [07:17:45] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10MoritzMuehlenhoff) [07:19:46] (03PS3) 10Muehlenhoff: nftables: On Buster install nftables and libnftnl from backports [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) [07:22:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) (owner: 10Muehlenhoff) [07:51:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchOtherIndex lag is too high: 6h 1m 15s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchOtherIndex - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [07:51:54] (03PS4) 10Muehlenhoff: nftables: On Buster install nftables and libnftnl from backports [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) [07:56:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987439 (https://phabricator.wikimedia.org/T354279) (owner: 10Muehlenhoff) [07:56:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchOtherIndex lag is too high: 6h 3m 37s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchOtherIndex - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [08:00:04] Amir1, apergos, and jnuche: Time to snap out of that daydream and deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T0800). [08:00:37] morning! no trainees wish to learn the joys of deployment and no one has a patch to send around to production anyways, so there we have it. wishing you all a 2024 that is better in all ways than last year, and see you next time! [08:00:51] (03CR) 10Muehlenhoff: [C: 03+2] Switch peopleweb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/984161 (owner: 10Muehlenhoff) [08:14:14] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/987465 (owner: 10Andrew Bogott) [08:17:39] (03PS10) 10Ladsgroup: Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [08:17:47] (03CR) 10Ladsgroup: Add compare tables periodic job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [08:21:54] (03PS2) 10Ladsgroup: Add virtual domain config for reading lists extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985160 (https://phabricator.wikimedia.org/T353948) [08:22:19] (03CR) 10Ladsgroup: [C: 03+2] Add virtual domain config for reading lists extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985160 (https://phabricator.wikimedia.org/T353948) (owner: 10Ladsgroup) [08:22:58] (03Merged) 10jenkins-bot: Add virtual domain config for reading lists extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985160 (https://phabricator.wikimedia.org/T353948) (owner: 10Ladsgroup) [08:24:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985160 (https://phabricator.wikimedia.org/T353948) (owner: 10Ladsgroup) [08:25:23] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:985160|Add virtual domain config for reading lists extension (T353948)]] [08:25:30] T353948: Migrate reading lists to use a virtual database domain - https://phabricator.wikimedia.org/T353948 [08:27:02] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:985160|Add virtual domain config for reading lists extension (T353948)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:28:18] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [08:30:02] (03PS1) 10Muehlenhoff: idm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/987656 [08:33:36] (03PS1) 10Ladsgroup: Set commonswiki pagelinks migration stage to READ NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987657 (https://phabricator.wikimedia.org/T351237) [08:33:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987656 (owner: 10Muehlenhoff) [08:34:29] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:985160|Add virtual domain config for reading lists extension (T353948)]] (duration: 09m 05s) [08:34:33] T353948: Migrate reading lists to use a virtual database domain - https://phabricator.wikimedia.org/T353948 [08:34:47] (03CR) 10Jgiannelos: [C: 04-1] "This is ready for review. Blocking so we don't deploy before apps teams approval." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158 (owner: 10Jgiannelos) [08:35:38] (03PS2) 10Ladsgroup: Update virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134 [08:35:41] (03CR) 10Ladsgroup: [C: 03+2] Update virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134 (owner: 10Ladsgroup) [08:35:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134 (owner: 10Ladsgroup) [08:36:28] (03Merged) 10jenkins-bot: Update virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134 (owner: 10Ladsgroup) [08:36:53] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:987134|Update virtual domain for url shortener]] [08:38:24] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:987134|Update virtual domain for url shortener]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:41:58] (03PS1) 10Slyngshede: C:raid::perccli Support compression of output. [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) [08:43:40] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [08:49:28] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:987134|Update virtual domain for url shortener]] (duration: 12m 35s) [08:56:38] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987692 [08:56:51] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987692 (owner: 10Peter Fischer) [08:57:12] (03CR) 10Alexandros Kosiaris: "Indeed they do. https://w.wiki/8j4j for a quick view on anyone with a grafana account that wants to take a peek. There are some interestin" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984482 (owner: 10JMeybohm) [08:57:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump memory for calico-node on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/984482 (owner: 10JMeybohm) [08:58:12] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987692 (owner: 10Peter Fischer) [09:00:42] (03Merged) 10jenkins-bot: Bump memory for calico-node on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/984482 (owner: 10JMeybohm) [09:04:44] (03CR) 10Muehlenhoff: Package Debmonitor server as .deb (033 comments) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [09:07:25] (03CR) 10Filippo Giunchedi: [C: 03+1] pybal: Disable Pint promql/series checks [alerts] - 10https://gerrit.wikimedia.org/r/987499 (https://phabricator.wikimedia.org/T353760) (owner: 10BCornwall) [09:07:47] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987153 (owner: 10Muehlenhoff) [09:09:45] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:11:21] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:12:44] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:13:04] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:13:21] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:13:32] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:22:08] (03PS6) 10Brouberol: spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) [09:22:15] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:22:49] !log bump memory limits for calico-node in wikikube codfw/eqiad by 25% (i.e from 400Mi to 500Mi) [09:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:31:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:34:01] (03CR) 10Muehlenhoff: [C: 03+2] prometheus::migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987153 (owner: 10Muehlenhoff) [09:34:18] (03CR) 10Hashar: [C: 03+1] releases: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987436 (owner: 10Muehlenhoff) [09:36:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:36:28] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:36:36] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:38:22] !log bump memory limits for calico-node in wikikube codfw/eqiad by 25% (i.e from 400Mi to 500Mi) take #2 [09:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:40] !log delete mw1377-mw1383 from eqiad wikikube nodes [09:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10MatthewVernon) @Papaul Thanks for the quick swap :) [09:39:42] (03CR) 10MVernon: [C: 03+2] swift: add new storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/987448 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [09:41:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:46:09] (03CR) 10Volans: [C: 03+2] raid handler: fix broken cases [puppet] - 10https://gerrit.wikimedia.org/r/987409 (owner: 10Volans) [09:53:35] (03CR) 10Volans: "The compression looks ok. One comment on the other change." [puppet] - 10https://gerrit.wikimedia.org/r/987659 (https://phabricator.wikimedia.org/T354254) (owner: 10Slyngshede) [09:54:40] (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987707 (https://phabricator.wikimedia.org/T352583) [09:54:48] (03CR) 10CI reject: [V: 04-1] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987707 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [09:55:19] (03PS2) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987707 (https://phabricator.wikimedia.org/T352583) [09:55:24] (03CR) 10Muehlenhoff: [C: 03+2] profile::prometheus::rsyncd: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984808 (owner: 10Muehlenhoff) [09:57:09] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:57:53] (03PS3) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987707 (https://phabricator.wikimedia.org/T352583) [09:58:03] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10Dima) @Aklapper : Hi Andre! Sure thing, I added that. :) [10:01:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:01:19] (03PS1) 10Alexandros Kosiaris: Calico: Bump timeout to 1H [deployment-charts] - 10https://gerrit.wikimedia.org/r/987708 [10:03:07] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10LSobanski) There are 31 HP servers and 1 storage array remaining (https://netbox.wikimedia.org/dcim/manufacturers/6/), excluding the Swift hosts the majority remaining are DBs. Looking at the purchase dates m... [10:03:09] (03PS2) 10Alexandros Kosiaris: Calico: Bump timeout to 1H [deployment-charts] - 10https://gerrit.wikimedia.org/r/987708 [10:06:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:07:29] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:08:13] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [10:08:34] (03PS1) 10Muehlenhoff: rsyslog::receiver: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/987709 [10:08:36] (03PS1) 10Muehlenhoff: rsyslog::receiver: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987710 [10:08:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] Calico: Bump timeout to 1H [deployment-charts] - 10https://gerrit.wikimedia.org/r/987708 (owner: 10Alexandros Kosiaris) [10:08:54] (03CR) 10JMeybohm: [C: 03+2] pki::multirootca: Merge custom profiles on top of default_profiles [puppet] - 10https://gerrit.wikimedia.org/r/982854 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [10:09:01] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:09:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987709 (owner: 10Muehlenhoff) [10:09:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987710 (owner: 10Muehlenhoff) [10:10:00] (HelmReleaseBadStatus) firing: Helm release kube-system/calico on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:11:28] (03Merged) 10jenkins-bot: Calico: Bump timeout to 1H [deployment-charts] - 10https://gerrit.wikimedia.org/r/987708 (owner: 10Alexandros Kosiaris) [10:11:43] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10Marostegui) Fine by me! [10:14:51] (03Merged) 10jenkins-bot: Merge tag 'v3.6.8' into wmf/stable-3.6 [software/gerrit] (wmf/stable-3.6) - 10https://gerrit.wikimedia.org/r/987438 (https://phabricator.wikimedia.org/T309870) (owner: 10Hashar) [10:16:55] (03CR) 10Slyngshede: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/987656 (owner: 10Muehlenhoff) [10:17:02] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:17:19] !log bump memory limits for calico-node in wikikube codfw/eqiad by 25% (i.e from 400Mi to 500Mi) take #3 [10:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:00] (HelmReleaseBadStatus) resolved: Helm release kube-system/calico on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:26:09] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10WMDE-leszek) FWIW, I confirm @Dima is WMDE software engineer, and approve this request on WMDE's behalf. Thank you. [10:31:30] (HelmReleaseBadStatus) firing: (2) Helm release kube-system/calico on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:32:37] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:33:40] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:36:30] (HelmReleaseBadStatus) resolved: (2) Helm release kube-system/calico on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:37:16] (03PS1) 10Zabe: Revert "Support new block schema" [extensions/CentralAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987482 (https://phabricator.wikimedia.org/T354298) [10:37:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:38:04] (03PS1) 10Zabe: Revert "Get blocks from DatabaseBlockStore instead of doing our own query" [extensions/CheckUser] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987483 (https://phabricator.wikimedia.org/T353620) [10:39:41] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10MatthewVernon) Hm, actually, that list from netbox includes servers not in the description of this task (ah, and they have manufacturer = HPE not HP) and the necessary binary is now /usr/sbin/ssacli. So the c... [10:42:05] (03PS3) 10Ladsgroup: snapshot: Improve border of dumps cards [puppet] - 10https://gerrit.wikimedia.org/r/986181 [10:42:33] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 298, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:12] jouncebot: nowandnext [10:43:12] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [10:43:13] In 0 hour(s) and 16 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1100) [10:43:13] In 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1100) [10:43:23] (03CR) 10Ladsgroup: snapshot: Improve border of dumps cards (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/986181 (owner: 10Ladsgroup) [10:45:02] (03CR) 10Zabe: [C: 03+2] Revert "Get blocks from DatabaseBlockStore instead of doing our own query" [extensions/CheckUser] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987483 (https://phabricator.wikimedia.org/T353620) (owner: 10Zabe) [10:48:30] (HelmReleaseBadStatus) firing: (2) Helm release kube-system/calico on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:51:10] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1027/co" [puppet] - 10https://gerrit.wikimedia.org/r/983363 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [10:51:14] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:53:05] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] pki::multirootca: Override the server profiles expiry for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/983363 (https://phabricator.wikimedia.org/T353314) (owner: 10JMeybohm) [10:53:30] (HelmReleaseBadStatus) resolved: Helm release kube-system/calico on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1100). [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1100) [11:01:43] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1379.eqiad.wmnet with OS bullseye [11:15:23] (03CR) 10Muehlenhoff: [C: 03+2] idm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/987656 (owner: 10Muehlenhoff) [11:16:07] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [11:19:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1379.eqiad.wmnet with reason: host reimage [11:21:31] (03PS6) 10EoghanGaffney: [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) [11:27:10] (03PS1) 10Muehlenhoff: idm: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/987715 [11:27:18] (03PS2) 10Muehlenhoff: idm: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/987715 [11:32:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987715 (owner: 10Muehlenhoff) [11:34:07] (03PS1) 10MVernon: swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/987718 (https://phabricator.wikimedia.org/T353149) [11:35:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1379.eqiad.wmnet with OS bullseye [11:37:43] (03PS1) 10Kamila Součková: Revert "Set MW API servers to insetup to fix failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/987484 [11:38:05] (03PS2) 10Kamila Součková: Revert "Set MW API servers to insetup to fix failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/987484 [11:38:43] (03PS1) 10Dreamy Jazz: Attempt to send original file to PhotoDNA if no thumbnail [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987485 (https://phabricator.wikimedia.org/T353854) [11:40:43] (03PS1) 10Dreamy Jazz: Attempt to send original file to PhotoDNA if no thumbnail [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/987726 (https://phabricator.wikimedia.org/T353854) [11:41:21] (03CR) 10Kamila Součková: [C: 03+2] Revert "Set MW API servers to insetup to fix failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/987484 (owner: 10Kamila Součková) [11:42:14] (03CR) 10Btullis: "Instead of putting these options in to all of the role definition files where `profile::hadoop::spark3` is applied, we could instead make " [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [11:50:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [11:52:12] !log installing libde265 security updates [11:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:28] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/987715 (owner: 10Muehlenhoff) [12:04:14] !log installing lua5.3 security updates [12:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:13] (03PS1) 10Muehlenhoff: Add Cumin alias for cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/987748 [12:10:14] !log kamila@cumin1002 START - Cookbook sre.hosts.reboot-single for host mw1377.eqiad.wmnet [12:10:26] (03PS2) 10Muehlenhoff: Add Cumin alias for cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/987748 [12:16:25] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/987748 (owner: 10Muehlenhoff) [12:19:09] (03CR) 10Ayounsi: [C: 03+1] rancid/librenms: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987403 (owner: 10Muehlenhoff) [12:20:41] 10SRE, 10Infrastructure-Foundations, 10netops: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 (10cmooney) p:05Triage→03Medium [12:21:20] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [12:22:23] 10SRE, 10Infrastructure-Foundations, 10netops: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 (10cmooney) [12:23:41] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/986835 [12:23:52] (03CR) 10Phuedx: [C: 04-1] "AIUI we're only adding this proxy to support existing MediaWiki installations submitting pingbacks as the information is valuable (though " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [12:25:27] (03PS1) 10Jelto: miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) [12:30:31] PROBLEM - Host mw1377 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:12] (03CR) 10Muehlenhoff: [C: 03+2] idm: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/987715 (owner: 10Muehlenhoff) [12:38:03] jouncebot: nowandnext [12:38:03] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:03] In 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1300) [12:38:23] (03CR) 10Zabe: [C: 03+2] Revert "Get blocks from DatabaseBlockStore instead of doing our own query" [extensions/CheckUser] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987483 (https://phabricator.wikimedia.org/T353620) (owner: 10Zabe) [12:38:30] (03CR) 10Zabe: [C: 03+2] Revert "Support new block schema" [extensions/CentralAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987482 (https://phabricator.wikimedia.org/T354298) (owner: 10Zabe) [12:43:19] (03PS1) 10Ayounsi: Depool drmrs for cr2 maintenance [dns] - 10https://gerrit.wikimedia.org/r/987767 (https://phabricator.wikimedia.org/T354340) [12:45:36] (03PS1) 10Muehlenhoff: idm-test: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/987768 [12:45:58] (03PS2) 10Muehlenhoff: idm-test: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/987768 [12:51:48] (03PS1) 10Muehlenhoff: Switch idm-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987769 [12:52:38] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/987767 (https://phabricator.wikimedia.org/T354340) (owner: 10Ayounsi) [12:52:44] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 63296 [12:53:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63296 [12:56:19] (03Merged) 10jenkins-bot: Revert "Get blocks from DatabaseBlockStore instead of doing our own query" [extensions/CheckUser] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987483 (https://phabricator.wikimedia.org/T353620) (owner: 10Zabe) [12:56:22] (03Merged) 10jenkins-bot: Revert "Support new block schema" [extensions/CentralAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987482 (https://phabricator.wikimedia.org/T354298) (owner: 10Zabe) [12:56:52] !log zabe@deploy2002 Started scap: Backport for [[gerrit:987483|Revert "Get blocks from DatabaseBlockStore instead of doing our own query" (T353620)]], [[gerrit:987482|Revert "Support new block schema" (T354298)]] [12:56:57] T353620: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'commonswiki' - https://phabricator.wikimedia.org/T353620 [12:56:57] T354298: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'xxxwiki' - https://phabricator.wikimedia.org/T354298 [12:59:05] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog::receiver: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987710 (owner: 10Muehlenhoff) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1300) [13:00:07] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog::receiver: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/987709 (owner: 10Muehlenhoff) [13:00:29] !log zabe@deploy2002 zabe: Backport for [[gerrit:987483|Revert "Get blocks from DatabaseBlockStore instead of doing our own query" (T353620)]], [[gerrit:987482|Revert "Support new block schema" (T354298)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:01:06] !log zabe@deploy2002 zabe: Continuing with sync [13:01:43] (03CR) 10Ayounsi: [C: 03+2] Depool drmrs for cr2 maintenance [dns] - 10https://gerrit.wikimedia.org/r/987767 (https://phabricator.wikimedia.org/T354340) (owner: 10Ayounsi) [13:02:29] !log depool drmrs for router work - T354340 [13:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:33] T354340: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 [13:02:54] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mw1377.eqiad.wmnet [13:05:20] (03CR) 10Muehlenhoff: [C: 03+2] rsyslog::receiver: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/987709 (owner: 10Muehlenhoff) [13:06:58] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:987483|Revert "Get blocks from DatabaseBlockStore instead of doing our own query" (T353620)]], [[gerrit:987482|Revert "Support new block schema" (T354298)]] (duration: 10m 06s) [13:07:13] T353620: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'commonswiki' - https://phabricator.wikimedia.org/T353620 [13:07:14] T354298: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'xxxwiki' - https://phabricator.wikimedia.org/T354298 [13:09:06] (03CR) 10Muehlenhoff: [C: 03+2] rsyslog::receiver: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987710 (owner: 10Muehlenhoff) [13:11:03] (03PS1) 10Ayounsi: Repool drmrs after cr2 maintenance [dns] - 10https://gerrit.wikimedia.org/r/987727 (https://phabricator.wikimedia.org/T354340) [13:11:31] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10fgiunchedi) >>! In T141756#9434526, @LSobanski wrote: > There are 31 HP servers and 1 storage array remaining (https://netbox.wikimedia.org/dcim/manufacturers/6/), excluding the Swift hosts the majority remai... [13:13:40] Hi, we're seeing critical alerts about wdqs1019 being 503: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=wikidata&q=alertname%21%3DCirrusSearchIndexTooOld&q=instance%21%3Dwikidata-analytics-1 --- I'm trying to understand if that is something we should worry about or whether there is anything needed from our side? [13:15:06] (03CR) 10Cathal Mooney: [C: 03+1] Repool drmrs after cr2 maintenance [dns] - 10https://gerrit.wikimedia.org/r/987727 (https://phabricator.wikimedia.org/T354340) (owner: 10Ayounsi) [13:16:41] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo:rsync: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984805 (owner: 10Muehlenhoff) [13:18:21] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10LSobanski) 05Open→03Resolved a:03LSobanski [13:18:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631 (10LSobanski) [13:18:35] 10SRE: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10LSobanski) a:05LSobanski→03None [13:20:03] (03CR) 10DCausse: [C: 03+1] enable page_rerender for 2nd batch: dewiki, frwiktionary, and kuwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984810 (owner: 10Peter Fischer) [13:20:44] MichaelG_WMDE: looking [13:21:08] thank you 🙏 [13:24:37] !log restarting blazegraph on wdqs1019 (stuck with high thread count) [13:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] RECOVERY - Query Service HTTP Port on wdqs1019 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:25:22] lag alerts will fire for this host but can be ignored (it's depooled) [13:26:05] RECOVERY - WDQS SPARQL on wdqs1019 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:27:30] dcausse Ok, good to know. Thank you for taking care of it [13:28:24] np! thanks for raising this here :) [13:30:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:32:43] (03CR) 10Muehlenhoff: [C: 03+2] rancid/librenms: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987403 (owner: 10Muehlenhoff) [13:37:26] PROBLEM - Host cr2-drmrs #page is DOWN: PING CRITICAL - Packet loss = 100% [13:37:47] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:37:52] XioNoX:, topranks ^^? [13:37:54] expired downtime? [13:37:54] <_joe_> uh [13:38:00] drmrs is depooled [13:38:03] <_joe_> ok [13:38:05] ack [13:38:05] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Active - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:07] * Emperor twitches [13:38:15] PROBLEM - Host cr2-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:16] * vgutierrez sends a kill -SIGHUP to his heart attack [13:38:20] <_joe_> !incidents [13:38:20] 4371 (ACKED) Host cr2-drmrs (paged) - PING - Packet loss = 100% [13:38:29] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Active - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:37] ^^ this was due to maintenance we were doing which hasn't gone as smoothly as hoped [13:38:49] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:38:52] rebooting device now - site is depoosed [13:38:55] At least we know the batphone is working ;-) [13:39:15] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:39:21] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:40:37] it's also only one router, so even if some users are sticking to drmrs due to bad DNS config, it would still work for them [13:43:17] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:43:25] (03CR) 10Muehlenhoff: [C: 03+2] idm-test: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/987768 (owner: 10Muehlenhoff) [13:44:19] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:44:41] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:44:48] RECOVERY - Host cr2-drmrs #page is UP: PING OK - Packet loss = 0%, RTA = 86.02 ms [13:44:53] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:07] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:45:33] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:45:37] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:45:37] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:49:35] RECOVERY - Host cr2-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.71 ms [13:50:01] (03PS2) 10Muehlenhoff: Switch idm-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987769 [13:51:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987769 (owner: 10Muehlenhoff) [13:53:13] (KubernetesCalicoDown) firing: mw1377.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=mw1377.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:53:29] !log installing libssh security updates [13:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:51] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 2686 [13:57:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2686 [13:57:34] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987773 [13:57:44] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987773 (owner: 10Peter Fischer) [13:58:34] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987773 (owner: 10Peter Fischer) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1400). [14:00:05] Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:23] \o [14:00:41] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:00:45] if someone else could deploy, that'd be great :-) [14:00:55] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:01:07] My backports cannot be tested on-wiki as the only changes are to maintenance scripts. These are currently run manually, so if there are problems I will see them once the script is run manually. [14:01:24] *to maintenance scripts and code that interacts only with maintenance scripts. [14:01:28] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:01:30] (03CR) 10Ayounsi: [C: 03+2] Repool drmrs after cr2 maintenance [dns] - 10https://gerrit.wikimedia.org/r/987727 (https://phabricator.wikimedia.org/T354340) (owner: 10Ayounsi) [14:02:02] Need to backport as I want to run another scan on testwiki later today. [14:03:10] !log repool drmrs - T354340 [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:14] T354340: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 [14:04:53] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Test depool of drmrs - https://phabricator.wikimedia.org/T344968 (10ayounsi) 05Open→03Resolved a:03ayounsi Depooled esams for 1h and everything went well. [14:06:04] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:07:42] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/984815 (owner: 10Muehlenhoff) [14:08:24] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:08:43] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:08:51] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:09:03] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:09:07] (03CR) 10Brouberol: spark3: enable event logging and history server integration for all spark jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [14:09:24] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:09:34] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:10:20] * James_F waves. [14:10:20] Dreamy_Jazz, I can deploy the MediaModeration fixes along with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/987483 [14:10:40] Sure. [14:10:49] Has that CheckUser change already been backported? [14:10:57] Just as zabe gave it a +2 [14:11:05] I think so. [14:11:10] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:11:47] Based on https://phabricator.wikimedia.org/T353620#9435093 I think that the CheckUser backport has already been scap backported [14:11:56] Yup. [14:12:01] Also https://sal.toolforge.org/production?p=0&q=987483&d= [14:12:07] However, my MediaModeration changes have not :) [14:12:10] Dreamy_Jazz, OK just MediaModeration then [14:12:18] 👍 [14:12:31] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:12:43] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:13:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tchanders@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987485 (https://phabricator.wikimedia.org/T353854) (owner: 10Dreamy Jazz) [14:14:02] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/987769 (owner: 10Muehlenhoff) [14:16:02] (03CR) 10Muehlenhoff: [C: 03+2] Switch idm-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987769 (owner: 10Muehlenhoff) [14:16:04] (03Merged) 10jenkins-bot: Attempt to send original file to PhotoDNA if no thumbnail [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987485 (https://phabricator.wikimedia.org/T353854) (owner: 10Dreamy Jazz) [14:16:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:16:27] !log tchanders@deploy2002 Started scap: Backport for [[gerrit:987485|Attempt to send original file to PhotoDNA if no thumbnail (T353854)]] [14:16:37] T353854: Attempt to send original file to PhotoDNA for the scan if thumbnail fails - https://phabricator.wikimedia.org/T353854 [14:20:04] !log tchanders@deploy2002 dreamyjazz and tchanders: Backport for [[gerrit:987485|Attempt to send original file to PhotoDNA if no thumbnail (T353854)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:20:10] !log tchanders@deploy2002 dreamyjazz and tchanders: Continuing with sync [14:25:51] !log tchanders@deploy2002 Finished scap: Backport for [[gerrit:987485|Attempt to send original file to PhotoDNA if no thumbnail (T353854)]] (duration: 09m 24s) [14:25:55] T353854: Attempt to send original file to PhotoDNA for the scan if thumbnail fails - https://phabricator.wikimedia.org/T353854 [14:27:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tchanders@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/987726 (https://phabricator.wikimedia.org/T353854) (owner: 10Dreamy Jazz) [14:28:01] (03PS7) 10Brouberol: spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) [14:28:05] (03CR) 10Muehlenhoff: [C: 03+2] keystone: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984815 (owner: 10Muehlenhoff) [14:28:32] (03CR) 10CI reject: [V: 04-1] spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [14:29:51] (03PS8) 10Brouberol: spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) [14:30:22] (03Merged) 10jenkins-bot: Attempt to send original file to PhotoDNA if no thumbnail [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/987726 (https://phabricator.wikimedia.org/T353854) (owner: 10Dreamy Jazz) [14:30:49] !log tchanders@deploy2002 Started scap: Backport for [[gerrit:987726|Attempt to send original file to PhotoDNA if no thumbnail (T353854)]] [14:34:27] !log tchanders@deploy2002 tchanders and dreamyjazz: Backport for [[gerrit:987726|Attempt to send original file to PhotoDNA if no thumbnail (T353854)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:34:31] T353854: Attempt to send original file to PhotoDNA for the scan if thumbnail fails - https://phabricator.wikimedia.org/T353854 [14:34:32] Hi, am I too late for the current UTC afternoon backport window? I just added a config change. [14:34:34] !log tchanders@deploy2002 tchanders and dreamyjazz: Continuing with sync [14:34:35] (03PS1) 10Muehlenhoff: mwmaint: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987778 [14:35:30] (03PS2) 10Ssingh: hiera: dnsbox: remove anycast-hc dependency on pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/979159 (https://phabricator.wikimedia.org/T347054) [14:36:37] (03PS1) 10Muehlenhoff: deployment servers: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987779 [14:36:55] pfischer, I'll do that one next [14:37:11] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:16] (03PS1) 10Muehlenhoff: toolforge::docker::registry: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987781 [14:38:34] (03PS1) 10Ayounsi: Depool esams for cr1 maintenance [dns] - 10https://gerrit.wikimedia.org/r/987782 (https://phabricator.wikimedia.org/T346779) [14:38:43] Tchanders: thanks! [14:39:12] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [dns] - 10https://gerrit.wikimedia.org/r/987782 (https://phabricator.wikimedia.org/T346779) (owner: 10Ayounsi) [14:40:14] !log tchanders@deploy2002 Finished scap: Backport for [[gerrit:987726|Attempt to send original file to PhotoDNA if no thumbnail (T353854)]] (duration: 09m 25s) [14:40:18] T353854: Attempt to send original file to PhotoDNA for the scan if thumbnail fails - https://phabricator.wikimedia.org/T353854 [14:40:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987778 (owner: 10Muehlenhoff) [14:41:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984810 (owner: 10Peter Fischer) [14:41:31] (03Abandoned) 10Phuedx: ext-EventStreamConfig: Add eventlogging_MediaWikiPingback stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981446 (https://phabricator.wikimedia.org/T323828) (owner: 10Phuedx) [14:41:48] (03Merged) 10jenkins-bot: enable page_rerender for 2nd batch: dewiki, frwiktionary, and kuwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984810 (owner: 10Peter Fischer) [14:42:13] !log tchanders@deploy2002 Started scap: Backport for [[gerrit:984810|enable page_rerender for 2nd batch: dewiki, frwiktionary, and kuwiktionary]] [14:42:42] 10SRE, 10Infrastructure-Foundations, 10netops: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 (10cmooney) 05Open→03Resolved Problem has resolved following device reboot. It looked like killing the mgd processes in "lockf" state was working, but I made an error a... [14:45:17] (03PS1) 10Peter Fischer: Search update pipeline: 2nd batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/987783 (https://phabricator.wikimedia.org/T351503) [14:45:27] Thanks for doing my backports Tchanders! [14:45:48] !log tchanders@deploy2002 pfischer and tchanders: Backport for [[gerrit:984810|enable page_rerender for 2nd batch: dewiki, frwiktionary, and kuwiktionary]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:46:01] (03CR) 10Peter Fischer: "MW config change is already on it's way…" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987783 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [14:46:14] pfischer, how does that look? [14:46:21] (03CR) 10Dzahn: ""values-landing-page.yaml" - wouldn't that be "values-design-landing-page.yaml" to match the other 2 yaml file names?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:46:26] Tchanders: just a sec. [14:46:28] (03PS1) 10Aklapper: phabricator: quarterly_metrics.sh: Correct Year output [puppet] - 10https://gerrit.wikimedia.org/r/987784 [14:47:21] (03PS2) 10Jelto: miscweb: add design.wikimedia.org services [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) [14:48:08] Dreamy_Jazz, no problem! [14:49:39] (03CR) 10Jelto: miscweb: add design.wikimedia.org services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:51:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987781 (owner: 10Muehlenhoff) [14:53:39] Tchanders: it’s alright, we can continue [14:54:09] pfischer, OK [14:54:13] !log tchanders@deploy2002 pfischer and tchanders: Continuing with sync [14:55:48] (03CR) 10Milimetric: [C: 03+1] webrequest varnishkafka - Add to X-Analytics the Sec-Purpose HTTP header [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [14:56:08] !log volans@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on mw1378.eqiad.wmnet with reason: WIP hosts to be setup [14:56:22] !log volans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw1378.eqiad.wmnet with reason: WIP hosts to be setup [14:56:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987779 (owner: 10Muehlenhoff) [14:57:10] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:21] (03CR) 10Btullis: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [14:57:50] (03CR) 10DCausse: [C: 03+1] Search update pipeline: 2nd batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/987783 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [14:59:38] !log rebooting mw1378 (downtimed and depooled) to debug reboot issues afer reimage - T351074 [14:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:42] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:00:08] !log tchanders@deploy2002 Finished scap: Backport for [[gerrit:984810|enable page_rerender for 2nd batch: dewiki, frwiktionary, and kuwiktionary]] (duration: 17m 55s) [15:00:55] (03CR) 10Ayounsi: [C: 03+2] Depool esams for cr1 maintenance [dns] - 10https://gerrit.wikimedia.org/r/987782 (https://phabricator.wikimedia.org/T346779) (owner: 10Ayounsi) [15:00:57] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: 2nd batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/987783 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [15:01:28] !log depool esams for router work - T346779 [15:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:31] T346779: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 [15:02:17] (03Merged) 10jenkins-bot: Search update pipeline: 2nd batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/987783 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [15:02:25] (03PS1) 10Ayounsi: Repool esams after maintenance [dns] - 10https://gerrit.wikimedia.org/r/987732 (https://phabricator.wikimedia.org/T346779) [15:04:43] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:04:44] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:05:02] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:05:03] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:07:16] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:07:22] (03PS9) 10Brouberol: spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) [15:07:31] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:07:48] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:07:58] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:08:06] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:08:11] !log rebooting mw1378 (downtimed and depooled) to debug reboot issues afer reimage - T351074 [15:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:15] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:08:18] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:08:27] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:08:53] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:12:16] !log installing curl security updates [15:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:36] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:13:44] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:13:59] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:14:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10Dreamy_Jazz) Are there any updates on this request? [15:14:37] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:15:32] (03PS1) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [15:15:34] (03PS1) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [15:16:09] !log drain esams-eqiad transport - T346779 [15:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:16] T346779: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 [15:17:17] (03CR) 10Dzahn: [C: 03+1] "lgtm, afaict:)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987758 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [15:18:35] (03CR) 10Btullis: spark3: enable event logging and history server integration for all spark jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [15:19:06] (03CR) 10Dzahn: "the intent of this change is to remove one blocker for "contint on bullseye". with an "eq buster" the php74 APT repo won't be installed on" [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [15:19:16] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host mw1378.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:19:54] (03PS4) 10Dzahn: contint: use php7.4 on bullseye just like on buster [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) [15:19:54] !log running sre.hosts.provision for mw1378 - T351074 [15:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:04] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:21:50] (03CR) 10Dzahn: "the intent is to keep using the wmf php packages, so that there is no change for CI besides the rest of the OS upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [15:26:06] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sd3) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T354200 (10Eevans) [15:26:55] !log disable peering/transit on cr1-esams for linecard reboot - T346779 [15:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:58] T346779: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 [15:29:23] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw1378.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:30:22] !log reboot fpc0 on cr1-esams - T346779 [15:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:34:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 59, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:34:44] expected ^ [15:34:51] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:35:25] !log volans@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1378.eqiad.wmnet [15:36:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:36:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:36:25] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:37:10] !log re-enable peering/transit on cr1-esams - T346779 [15:37:11] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 292 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:26] T346779: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 [15:37:37] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 320 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:38:57] !log undrain esams-eqiad transport - T346779 [15:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:06] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [15:42:50] (03PS2) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [15:42:52] (03PS2) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [15:43:11] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 75 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:43:33] (03CR) 10CI reject: [V: 04-1] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [15:43:39] (03CR) 10CI reject: [V: 04-1] Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) (owner: 10Btullis) [15:46:04] !log volans@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw1378.eqiad.wmnet [15:46:28] (03CR) 10Ayounsi: [C: 03+2] Repool esams after maintenance [dns] - 10https://gerrit.wikimedia.org/r/987732 (https://phabricator.wikimedia.org/T346779) (owner: 10Ayounsi) [15:47:58] !log repool esams - T346779 [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:02] T346779: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 [15:48:19] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 8 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:51:16] !log rolling restart of FPM/apache on mw canaries to pick up curl updates [15:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 (10ayounsi) a:03ayounsi Error logs stopped showing up after the linecard reboot. Monitoring it for a bit before closing the task. [15:53:32] (03PS1) 10Filippo Giunchedi: prometheus: validate check Prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/987789 [15:58:42] !log installing libdatetime-timezone-perl updates [15:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:21] (03CR) 10Andrew Bogott: [C: 03+2] wmcs admin scripts: run everything through Black [puppet] - 10https://gerrit.wikimedia.org/r/987465 (owner: 10Andrew Bogott) [15:59:32] !log volans@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1378.eqiad.wmnet [16:00:03] !log volans@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1378.eqiad.wmnet [16:12:42] (03PS1) 10Muehlenhoff: Switch role::test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987791 [16:17:29] (03PS10) 10Brouberol: spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) [16:20:08] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [16:22:07] (03PS3) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [16:22:09] (03PS3) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [16:25:43] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [16:25:56] !log volans@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1378.eqiad.wmnet [16:26:50] (03PS11) 10Brouberol: spark3: enable event logging and history server integration for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) [16:28:17] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984132 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol) [16:35:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:35:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) (owner: 10Majavah) [16:36:02] !log volans@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw1378.eqiad.wmnet [16:39:00] (03Merged) 10jenkins-bot: cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) (owner: 10Majavah) [16:39:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tox: show black diff on failure [software/conftool] - 10https://gerrit.wikimedia.org/r/987170 (owner: 10Majavah) [16:41:10] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:41:25] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:41:58] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:42:29] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:42:52] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:42:56] (03Merged) 10jenkins-bot: tox: show black diff on failure [software/conftool] - 10https://gerrit.wikimedia.org/r/987170 (owner: 10Majavah) [16:43:12] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:45:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:50:14] (03PS1) 10Giuseppe Lavagetto: requestctl: ensure no irc logging happens [software/conftool] - 10https://gerrit.wikimedia.org/r/987792 (https://phabricator.wikimedia.org/T354209) [16:50:16] (03PS1) 10Giuseppe Lavagetto: Release 2.3.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/987793 [16:51:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:53:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: ensure no irc logging happens [software/conftool] - 10https://gerrit.wikimedia.org/r/987792 (https://phabricator.wikimedia.org/T354209) (owner: 10Giuseppe Lavagetto) [16:56:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:56:33] (03Merged) 10jenkins-bot: requestctl: ensure no irc logging happens [software/conftool] - 10https://gerrit.wikimedia.org/r/987792 (https://phabricator.wikimedia.org/T354209) (owner: 10Giuseppe Lavagetto) [16:57:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 2.3.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/987793 (owner: 10Giuseppe Lavagetto) [16:58:04] thanks _joe_ [17:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:11] <_joe_> taavi: I'm in the process of building it [17:03:03] (03PS4) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [17:03:05] (03PS4) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [17:03:52] (03CR) 10CI reject: [V: 04-1] Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) (owner: 10Btullis) [17:03:54] (03CR) 10CI reject: [V: 04-1] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [17:09:05] RECOVERY - Host mw1377 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:10:36] !log oblivian@puppetmaster2001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=mw1377.* [17:10:46] <_joe_> taavi: ^^ [17:10:48] <_joe_> :) [17:12:28] (03PS5) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [17:12:30] (03PS5) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [17:13:08] (03CR) 10CI reject: [V: 04-1] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [17:13:12] (03CR) 10CI reject: [V: 04-1] Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) (owner: 10Btullis) [17:14:49] PROBLEM - Host mw1377 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:07] PROBLEM - Host mw1380 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:13] PROBLEM - Host mw1379 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:29] RECOVERY - Host mw1380 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:16:43] RECOVERY - Host mw1379 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:16:49] (03PS6) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [17:16:51] (03PS6) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [17:17:05] RECOVERY - Host mw1377 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [17:17:29] (03CR) 10CI reject: [V: 04-1] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [17:17:34] (03CR) 10CI reject: [V: 04-1] Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) (owner: 10Btullis) [17:22:44] (03PS7) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [17:22:46] (03PS7) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [17:24:32] (03CR) 10CI reject: [V: 04-1] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [17:24:34] (03CR) 10CI reject: [V: 04-1] Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) (owner: 10Btullis) [17:24:49] (03PS1) 10Kamila Součková: mw1377: change role to insetup for debugging [puppet] - 10https://gerrit.wikimedia.org/r/987797 (https://phabricator.wikimedia.org/T351074) [17:25:55] (03CR) 10Kamila Součková: [C: 03+2] mw1377: change role to insetup for debugging [puppet] - 10https://gerrit.wikimedia.org/r/987797 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [17:26:29] (03PS1) 10RobH: updated for energy star sku [software] - 10https://gerrit.wikimedia.org/r/987798 [17:28:46] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1377.eqiad.wmnet with OS bullseye [17:30:25] (03CR) 10RobH: [C: 03+2] updated for energy star sku [software] - 10https://gerrit.wikimedia.org/r/987798 (owner: 10RobH) [17:30:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:31:25] (03PS1) 10Andrew Bogott: mwopenstackclients: allow passing the 'edit-managed' flag to designate [puppet] - 10https://gerrit.wikimedia.org/r/987799 (https://phabricator.wikimedia.org/T354365) [17:31:27] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: use upstream designate clients [puppet] - 10https://gerrit.wikimedia.org/r/987800 (https://phabricator.wikimedia.org/T354365) [17:34:59] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:35:15] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:42:42] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:42:57] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:43:42] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [17:46:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [17:52:56] (03CR) 10Xcollazo: [C: 03+1] "Fair enough." [puppet] - 10https://gerrit.wikimedia.org/r/986181 (owner: 10Ladsgroup) [17:54:09] (MXQueueHigh) firing: MX host mx2001:9100 has many queued messages: 4843 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [17:55:01] looking [17:55:07] * Emperor summoned by phone [17:55:28] taking a look as well [17:56:39] looks like a lot of mails to one gmail user (and thus gmail is rate-limiting) [17:57:00] see -security [17:57:53] !log mx2001: exiqgrep -i -r w*@gmail.com | xargs exim -Mrm [17:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1800). nyaa~ [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1800) [18:01:40] nothing for me today [18:02:30] (03PS8) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [18:02:32] (03PS8) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [18:03:12] (03CR) 10CI reject: [V: 04-1] Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [18:03:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1377.eqiad.wmnet with OS bullseye [18:03:20] (03CR) 10CI reject: [V: 04-1] Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) (owner: 10Btullis) [18:04:09] (MXQueueHigh) resolved: MX host mx2001:9100 has many queued messages: 4902 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [18:22:40] (03PS9) 10Btullis: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) [18:22:42] (03PS9) 10Btullis: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353790) [18:27:58] (03PS1) 10Dreamy Jazz: Check for invalid JSON on a good response from PhotoDNA [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987734 (https://phabricator.wikimedia.org/T354370) [18:28:10] (03PS1) 10Dreamy Jazz: Check for invalid JSON on a good response from PhotoDNA [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/987735 (https://phabricator.wikimedia.org/T354370) [18:42:10] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:46:54] !log [second time] mx2001: exiqgrep -i -r w*@gmail.com | xargs exim -Mrm [18:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:10] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:05] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T1900). [19:04:55] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:04:59] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:06:15] dancy, Jdlrobson: o/ [19:06:31] o/ [19:06:38] let's start with the backport for https://phabricator.wikimedia.org/T353850 [19:07:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dduvall@deploy2002 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987473 (https://phabricator.wikimedia.org/T353850) (owner: 10Jdlrobson) [19:11:26] random thought: scap backport could probably query gerrit for the check (ci job) progress and display it [19:11:58] It could definitely do that. [19:13:16] (03CR) 10Dzahn: [C: 03+2] releases: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987436 (owner: 10Muehlenhoff) [19:13:16] (KubernetesRsyslogDown) firing: rsyslog on mw1381:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1381 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:18:16] (KubernetesRsyslogDown) firing: (6) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:22:45] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:22:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:24:17] (03Merged) 10jenkins-bot: Revise logic for creating compact links button on Vector 2022 [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987473 (https://phabricator.wikimedia.org/T353850) (owner: 10Jdlrobson) [19:24:40] !log dduvall@deploy2002 Started scap: Backport for [[gerrit:987473|Revise logic for creating compact links button on Vector 2022 (T353850)]] [19:24:44] T353850: UniversalLanguageSelector compact links and @wikimedia/codex loads on page load - https://phabricator.wikimedia.org/T353850 [19:25:02] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:25:10] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:26:12] !log dduvall@deploy2002 jdlrobson and dduvall: Backport for [[gerrit:987473|Revise logic for creating compact links button on Vector 2022 (T353850)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:26:42] !log dduvall@deploy2002 jdlrobson and dduvall: Continuing with sync [19:26:58] continuing the sync since this is only to testwikis [19:28:49] dduvall: let me know when synced and i can test! [19:29:15] Jdlrobson: 👍 [19:30:10] (03CR) 10Dzahn: [C: 03+2] "private.pp and security.pp are affecting deployment servers, common.pp affects the actual releases server(s)" [puppet] - 10https://gerrit.wikimedia.org/r/987436 (owner: 10Muehlenhoff) [19:32:39] !log dduvall@deploy2002 Finished scap: Backport for [[gerrit:987473|Revise logic for creating compact links button on Vector 2022 (T353850)]] (duration: 07m 58s) [19:32:43] T353850: UniversalLanguageSelector compact links and @wikimedia/codex loads on page load - https://phabricator.wikimedia.org/T353850 [19:32:49] Jdlrobson: ^ [19:32:54] (03PS1) 10Ebernhardson: cirrus-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/987812 [19:37:23] dduvall: it worked \o/ [19:40:18] (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/987812 (owner: 10Ebernhardson) [19:41:09] (03Merged) 10jenkins-bot: cirrus-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/987812 (owner: 10Ebernhardson) [19:41:27] Jdlrobson: yay. thanks so much for the fix [19:41:34] rolling group0 [19:41:48] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987813 (https://phabricator.wikimedia.org/T350088) [19:41:50] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987813 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [19:42:34] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987813 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [19:47:08] (03CR) 10Ryan Kemper: [C: 03+2] elastic: test out elastic2087 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/987189 (owner: 10Ryan Kemper) [19:47:12] (03CR) 10Ryan Kemper: [C: 03+2] elastic: prepare new hosts [puppet] - 10https://gerrit.wikimedia.org/r/987188 (https://phabricator.wikimedia.org/T353878) (owner: 10Ryan Kemper) [19:47:24] !log deploy1002 - systemctl start rsync-patches_module after gerrit:987436 [19:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:50] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.12 refs T350088 [19:49:54] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [19:51:51] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:52:02] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:53:45] i'll wait until the hour and then roll group1 [19:55:59] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:57:11] !log repooling wdqs1019 [19:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:07] !log releases1003 - systemctl start rsync-srv-patches-releases-primary after gerrit:987436 [19:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:47] !log restarting pybal on lvs5006 for testing purposes - T353760 [19:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:54] T353760: pybal_monitor_down_results_total metric only created when PyBal goes down - https://phabricator.wikimedia.org/T353760 [20:01:03] !log releases2003 - systemctl start rsync-srv-patches-releases2003.codfw.wmnet after gerrit:987436 [20:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:39] !log releases2003 - systemctl status rsync-srv-org-wikimedia-releases-releases2003.codfw.wmnet after gerrit:987436 [20:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:18] (03CR) 10Dzahn: [C: 03+2] "did some tests:" [puppet] - 10https://gerrit.wikimedia.org/r/987436 (owner: 10Muehlenhoff) [20:04:41] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987816 (https://phabricator.wikimedia.org/T350088) [20:04:43] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987816 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [20:06:05] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987816 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [20:07:11] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:47] (03PS4) 10Eevans: restbase: set production role and add config for restbase2034 [puppet] - 10https://gerrit.wikimedia.org/r/981608 (https://phabricator.wikimedia.org/T352468) [20:08:48] (03PS4) 10Eevans: restbase: set production role and add config for restbase2035 [puppet] - 10https://gerrit.wikimedia.org/r/981609 (https://phabricator.wikimedia.org/T352468) [20:13:56] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.12 refs T350088 [20:14:07] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [20:16:11] (03CR) 10Dzahn: [C: 03+2] mwmaint: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987778 (owner: 10Muehlenhoff) [20:20:06] !log dduvall@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.12 refs T350088 (duration: 06m 09s) [20:20:13] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [20:20:32] (03PS1) 10Dreamy Jazz: Ensure all non-okay statuses from ::getImageContents have a message [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/987737 (https://phabricator.wikimedia.org/T354374) [20:21:07] (03PS1) 10Dreamy Jazz: Ensure all non-okay statuses from ::getImageContents have a message [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987738 (https://phabricator.wikimedia.org/T354374) [20:21:35] (03CR) 10Dr0ptp4kt: "Hopefully quick question." [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [20:27:10] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:30:05] !log mwmaint2002 - /usr/local/sbin/sync-home-mwmaint after gerrit:987778 [20:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:28] (03CR) 10Dzahn: [C: 03+2] "this unified rules into a single file /etc/ferm/conf.d/10_rsyncd_access_home_mwmaint on mwmaint1002. mwmaint2002 is allowed to connect to " [puppet] - 10https://gerrit.wikimedia.org/r/987778 (owner: 10Muehlenhoff) [20:31:05] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987819 (https://phabricator.wikimedia.org/T350088) [20:31:07] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987819 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [20:31:10] (03CR) 10Dzahn: [C: 03+1] deployment servers: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987779 (owner: 10Muehlenhoff) [20:31:52] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987819 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [20:32:10] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:32:38] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2034 [puppet] - 10https://gerrit.wikimedia.org/r/981608 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [20:34:07] (03CR) 10Dzahn: [C: 03+2] "no automatic sync / timer here. since I started the sync manually it's now copying many gigabytes from /home/ladsgroup/moveToExternal/ an" [puppet] - 10https://gerrit.wikimedia.org/r/987778 (owner: 10Muehlenhoff) [20:39:05] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.12 refs T350088 [20:39:08] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [20:41:11] !log [apifeatureusage] T350703 Restarted `logstash` on `apifeatureusage[1,2]001` [20:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:15] T350703: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 [21:00:04] brennen and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240104T2100). [21:00:04] Dreamy_Jazz: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] \o [21:00:58] My backports cannot be tested as they backport code only used by a maintenance script that is run manually. The script will be tested once the scan is re-started on testwiki either later today or tommorrow. [21:01:18] o/ [21:02:09] Dreamy_Jazz: we're canceling the regular backport training sessions at this time, but i'm around atm and can deploy. [21:02:22] Sure. Thanks. [21:04:20] i think we can skip the .10 backports here? looks like train has reached all wikis. [21:04:33] (03CR) 10Dzahn: [C: 04-1] "compiler shows only on gerrit1003 the rsync service and config gets installed, while all other things happen on both servers." [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [21:05:16] We can if you prefer [21:05:44] I had made them just in case the train rolls back to wmf.10, but don't need to do them necessarily. [21:06:10] I'll go ahead and abandon the wmf.10 backports. [21:06:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987734 (https://phabricator.wikimedia.org/T354370) (owner: 10Dreamy Jazz) [21:06:44] (03Abandoned) 10Dreamy Jazz: Check for invalid JSON on a good response from PhotoDNA [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/987735 (https://phabricator.wikimedia.org/T354370) (owner: 10Dreamy Jazz) [21:06:51] (03Abandoned) 10Dreamy Jazz: Ensure all non-okay statuses from ::getImageContents have a message [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/987737 (https://phabricator.wikimedia.org/T354374) (owner: 10Dreamy Jazz) [21:08:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10thcipriani) Approved! Sorry for the delay! [21:08:39] (03CR) 10Brennen Bearnes: [C: 03+2] Ensure all non-okay statuses from ::getImageContents have a message [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987738 (https://phabricator.wikimedia.org/T354374) (owner: 10Dreamy Jazz) [21:08:45] (03CR) 10Dzahn: [C: 04-1] "btw: cool how you fixed the previous CI issue! didn't know" [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [21:08:51] (03Merged) 10jenkins-bot: Check for invalid JSON on a good response from PhotoDNA [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987734 (https://phabricator.wikimedia.org/T354370) (owner: 10Dreamy Jazz) [21:09:08] !log brennen@deploy2002 Started scap: Backport for [[gerrit:987734|Check for invalid JSON on a good response from PhotoDNA (T354370)]] [21:09:12] T354370: Argument 1 passed to MediaWiki\Extension\MediaModeration\PhotoDNA\Response::newFromArray() must be of the type array, null given - https://phabricator.wikimedia.org/T354370 [21:10:18] (03CR) 10Dzahn: [C: 04-1] [gerrit] Add rsync job for lfs sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [21:10:43] !log brennen@deploy2002 brennen and dreamyjazz: Backport for [[gerrit:987734|Check for invalid JSON on a good response from PhotoDNA (T354370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:10:57] (03Merged) 10jenkins-bot: Ensure all non-okay statuses from ::getImageContents have a message [extensions/MediaModeration] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987738 (https://phabricator.wikimedia.org/T354374) (owner: 10Dreamy Jazz) [21:11:14] !log brennen@deploy2002 brennen and dreamyjazz: Continuing with sync [21:11:40] PROBLEM - cassandra-a CQL 10.192.48.234:9042 on restbase2034 is CRITICAL: connect to address 10.192.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:12:40] (03CR) 10Dzahn: [C: 03+2] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/987707 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [21:17:06] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:987734|Check for invalid JSON on a good response from PhotoDNA (T354370)]] (duration: 07m 57s) [21:17:14] T354370: Argument 1 passed to MediaWiki\Extension\MediaModeration\PhotoDNA\Response::newFromArray() must be of the type array, null given - https://phabricator.wikimedia.org/T354370 [21:18:04] !log brennen@deploy2002 Started scap: Backport for [[gerrit:987738|Ensure all non-okay statuses from ::getImageContents have a message (T354374)]] [21:18:08] T354374: Internal error: MediaWiki\Status\StatusFormatter::getWikiText: Invalid result object: no error text but not OK in output of scanning script - https://phabricator.wikimedia.org/T354374 [21:19:51] !log brennen@deploy2002 brennen and dreamyjazz: Backport for [[gerrit:987738|Ensure all non-okay statuses from ::getImageContents have a message (T354374)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:20:16] Dreamy_Jazz: ok, going ahead with this last one [21:20:19] Thanks! [21:20:20] !log brennen@deploy2002 brennen and dreamyjazz: Continuing with sync [21:22:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10Dreamy_Jazz) >>! In T353735#9436330, @thcipriani wrote: > Approved! Sorry for the delay! Thanks! [21:26:06] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:987738|Ensure all non-okay statuses from ::getImageContents have a message (T354374)]] (duration: 08m 01s) [21:26:10] T354374: Internal error: MediaWiki\Status\StatusFormatter::getWikiText: Invalid result object: no error text but not OK in output of scanning script - https://phabricator.wikimedia.org/T354374 [21:26:22] (03CR) 10Dzahn: contint: use php7.4 on bullseye just like on buster (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [21:26:35] Thanks for deploying. [21:27:05] sure thing [21:27:11] !log end of utc late backport window [21:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:37] (03CR) 10Dzahn: contint: use php7.4 on bullseye just like on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [21:38:24] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:38:34] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:38:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:43:09] (03PS1) 10BCornwall: Add new release of wmf-debci images for building [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987837 (https://phabricator.wikimedia.org/T352003) [21:45:42] (03PS2) 10BCornwall: Add new release of wmf-debci images for building [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987837 (https://phabricator.wikimedia.org/T352003) [21:46:51] (03CR) 10BCornwall: [C: 03+1] Add new release of wmf-debci images for building [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987837 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [21:47:48] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Add new release of wmf-debci images for building [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/987837 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [21:51:39] (03PS1) 10RobH: updated config F skus [software] - 10https://gerrit.wikimedia.org/r/987841 [21:52:01] (03CR) 10RobH: [C: 03+2] updated config F skus [software] - 10https://gerrit.wikimedia.org/r/987841 (owner: 10RobH) [21:52:33] (03Merged) 10jenkins-bot: updated config F skus [software] - 10https://gerrit.wikimedia.org/r/987841 (owner: 10RobH) [22:00:22] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [22:00:26] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [22:03:37] (03PS1) 10RobH: updates for Config G [software] - 10https://gerrit.wikimedia.org/r/987843 [22:03:59] (03CR) 10RobH: [C: 03+2] updates for Config G [software] - 10https://gerrit.wikimedia.org/r/987843 (owner: 10RobH) [22:04:30] (03Merged) 10jenkins-bot: updates for Config G [software] - 10https://gerrit.wikimedia.org/r/987843 (owner: 10RobH) [22:21:09] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [22:21:29] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:21:34] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:22:01] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:22:08] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:22:38] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:24:06] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [22:24:14] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [22:25:30] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:25:39] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:27:11] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:29:11] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:29:14] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:29:47] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:29:58] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:31:31] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:31:40] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:32:11] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:33:35] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:33:49] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:33:53] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:34:26] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:36:36] (03PS2) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) [22:36:53] (03PS1) 10RobH: R750xs sku updates [software] - 10https://gerrit.wikimedia.org/r/987845 [22:37:51] (03CR) 10BCornwall: "Hello again! We've accumulated more data and a preview of the dashboard can be found at https://grafana-rw.wikimedia.org/dashboard/snapsho" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:38:28] (03CR) 10RobH: [C: 03+2] R750xs sku updates [software] - 10https://gerrit.wikimedia.org/r/987845 (owner: 10RobH) [22:39:58] (03CR) 10BCornwall: "Hello again! We've accumulated more data and a preview of the dashboard can be found at https://grafana-rw.wikimedia.org/dashboard/snapsho" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:48:17] (03PS2) 10Andrew Bogott: mwopenstackclients: allow passing the 'edit-managed' flag to designate [puppet] - 10https://gerrit.wikimedia.org/r/987799 (https://phabricator.wikimedia.org/T354365) [22:48:19] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: use upstream designate clients [puppet] - 10https://gerrit.wikimedia.org/r/987800 (https://phabricator.wikimedia.org/T354365) [22:48:21] (03PS1) 10Andrew Bogott: Rename wmcs-novastats-dnsleaks to wmcs-dnsleaks [puppet] - 10https://gerrit.wikimedia.org/r/987851 (https://phabricator.wikimedia.org/T354365) [22:48:23] (03PS1) 10Andrew Bogott: wmcs-dnsleaks.py: add the --to-prometheus flag [puppet] - 10https://gerrit.wikimedia.org/r/987852 (https://phabricator.wikimedia.org/T354365) [22:48:25] (03PS1) 10Andrew Bogott: wmcs-dnsleaks: Add prometheus metric [puppet] - 10https://gerrit.wikimedia.org/r/987853 (https://phabricator.wikimedia.org/T354365) [22:49:26] (03CR) 10CI reject: [V: 04-1] wmcs-dnsleaks.py: add the --to-prometheus flag [puppet] - 10https://gerrit.wikimedia.org/r/987852 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [22:50:43] (03PS1) 10RobH: config I skus [software] - 10https://gerrit.wikimedia.org/r/987855 [22:50:51] (03CR) 10CI reject: [V: 04-1] config I skus [software] - 10https://gerrit.wikimedia.org/r/987855 (owner: 10RobH) [22:57:46] (03PS1) 10RobH: update sku for g15 restbase [software] - 10https://gerrit.wikimedia.org/r/987857 [22:58:09] (03CR) 10CI reject: [V: 04-1] update sku for g15 restbase [software] - 10https://gerrit.wikimedia.org/r/987857 (owner: 10RobH) [22:58:30] (03CR) 10RobH: [C: 03+2] config I skus [software] - 10https://gerrit.wikimedia.org/r/987855 (owner: 10RobH) [22:58:51] (03CR) 10RobH: [V: 03+2 C: 03+2] config I skus [software] - 10https://gerrit.wikimedia.org/r/987855 (owner: 10RobH) [22:59:13] (03PS1) 10Andrew Bogott: team-wmcs: alert when stray DNS records appear in designate. [alerts] - 10https://gerrit.wikimedia.org/r/987858 (https://phabricator.wikimedia.org/T354365) [22:59:20] (03CR) 10RobH: [C: 03+2] update sku for g15 restbase [software] - 10https://gerrit.wikimedia.org/r/987857 (owner: 10RobH) [22:59:54] (03Merged) 10jenkins-bot: update sku for g15 restbase [software] - 10https://gerrit.wikimedia.org/r/987857 (owner: 10RobH) [23:00:09] (03CR) 10Andrew Bogott: "That {{ $value }} bit is cribbed from another alert here but I'm not sure I understand what it actually does." [alerts] - 10https://gerrit.wikimedia.org/r/987858 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [23:01:00] (03CR) 10CI reject: [V: 04-1] team-wmcs: alert when stray DNS records appear in designate. [alerts] - 10https://gerrit.wikimedia.org/r/987858 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [23:01:28] (03PS1) 10Bking: aptrepo: add Elastic-related components to bookworm repo [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) [23:02:02] (03PS2) 10Andrew Bogott: wmcs-dnsleaks.py: add the --to-prometheus flag [puppet] - 10https://gerrit.wikimedia.org/r/987852 (https://phabricator.wikimedia.org/T354365) [23:02:04] (03PS2) 10Andrew Bogott: wmcs-dnsleaks: Add prometheus metric [puppet] - 10https://gerrit.wikimedia.org/r/987853 (https://phabricator.wikimedia.org/T354365) [23:05:55] (03PS2) 10Andrew Bogott: team-wmcs: alert when stray DNS records appear in designate. [alerts] - 10https://gerrit.wikimedia.org/r/987858 (https://phabricator.wikimedia.org/T354365) [23:07:08] (03CR) 10CI reject: [V: 04-1] team-wmcs: alert when stray DNS records appear in designate. [alerts] - 10https://gerrit.wikimedia.org/r/987858 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [23:10:25] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [23:10:32] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:16:41] (03CR) 10Andrew Bogott: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/987858 (https://phabricator.wikimedia.org/T354365) (owner: 10Andrew Bogott) [23:17:21] (03PS1) 10VolkerE: styles: Replace obsolete WikimediaUI Base var with Codex alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987861 [23:18:16] (KubernetesRsyslogDown) firing: (6) rsyslog on mw1378:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown