[00:14:10] (03PS1) 10Jdlrobson: Wordmark for blk wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966314 (https://phabricator.wikimedia.org/T341257) [00:31:09] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965574 (owner: 10TrainBranchBot) [00:38:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965577 [00:38:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965577 (owner: 10TrainBranchBot) [00:50:16] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:54:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965577 (owner: 10TrainBranchBot) [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T349047 (10phaultfinder) [01:24:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10phaultfinder) [01:24:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T343198)', diff saved to https://phabricator.wikimedia.org/P52978 and previous config saved to /var/cache/conftool/dbconfig/20231017-012459-arnaudb.json [01:25:05] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [01:40:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P52979 and previous config saved to /var/cache/conftool/dbconfig/20231017-014005-arnaudb.json [01:55:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P52980 and previous config saved to /var/cache/conftool/dbconfig/20231017-015511-arnaudb.json [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T0200) [02:05:51] 10SRE-swift-storage, 10TimedMediaHandler, 10MW-1.41-notes (1.41.0-wmf.30; 2023-10-10), 10MW-1.42-notes (1.42.0-wmf.1; 2023-10-17), and 2 others: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.w... - https://phabricator.wikimedia.org/T348753 [02:07:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.1 [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/965578 (https://phabricator.wikimedia.org/T348354) [02:07:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.1 [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/965578 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [02:10:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T343198)', diff saved to https://phabricator.wikimedia.org/P52981 and previous config saved to /var/cache/conftool/dbconfig/20231017-021018-arnaudb.json [02:10:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [02:10:24] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [02:10:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [02:10:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T343198)', diff saved to https://phabricator.wikimedia.org/P52982 and previous config saved to /var/cache/conftool/dbconfig/20231017-021040-arnaudb.json [02:21:04] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.1 [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/965578 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [02:38:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:17] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T0300) [03:01:40] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966320 (https://phabricator.wikimedia.org/T348354) [03:01:42] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966320 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [03:02:27] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966320 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [03:02:55] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.1 refs T348354 [03:03:00] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [03:03:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:57] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10phaultfinder) [03:53:11] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.1 refs T348354 (duration: 50m 15s) [03:53:15] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [03:55:28] !log mwpresync@deploy2002 Pruned MediaWiki: 1.41.0-wmf.29 (duration: 02m 15s) [04:50:16] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:05:05] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/965579 (https://phabricator.wikimedia.org/T349053) [05:05:10] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/965580 (https://phabricator.wikimedia.org/T349053) [05:16:15] * kart_ deploying MinT.. [05:16:21] (03CR) 10Marostegui: [C: 03+2] check_private_data: Add Arnaud [puppet] - 10https://gerrit.wikimedia.org/r/966222 (owner: 10Marostegui) [05:16:49] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-10-16-101614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/966170 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:17:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s8 T349053 [05:17:11] T349053: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T349053 [05:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2165 with weight 0 T349053', diff saved to https://phabricator.wikimedia.org/P52983 and previous config saved to /var/cache/conftool/dbconfig/20231017-051723-root.json [05:17:39] (03Merged) 10jenkins-bot: Update MinT to 2023-10-16-101614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/966170 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:17:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T349053 [05:19:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/965579 (https://phabricator.wikimedia.org/T349053) (owner: 10Gerrit maintenance bot) [05:19:10] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:21:33] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:24:19] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [05:29:51] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:31:28] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:33:31] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:34:55] ah. Seems failing to upgrade. [05:36:48] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:36:50] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:41:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:42:15] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy new Bullseye version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) [05:42:28] (03PS3) 10Ilias Sarantopoulos: service: Add entry for llm langid for Lift Wing in the api-gw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) [05:44:00] (HelmReleaseBadStatus) firing: Helm release machinetranslation/production on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:54:49] (03CR) 10Ilias Sarantopoulos: [C: 03+2] service: Add entry for llm langid for Lift Wing in the api-gw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [05:55:37] (03Merged) 10jenkins-bot: service: Add entry for llm langid for Lift Wing in the api-gw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [05:56:29] hmm. Not sure why it went to pending-upgrade status for machinetranslation. [05:59:44] !log Update MinT to 2023-10-16-101614-production (T333969, T336683, T348097) [05:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:51] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [05:59:52] T348097: Twi is listed as Akan in the MinT translation interface - https://phabricator.wikimedia.org/T348097 [05:59:52] T336683: Enable MinT support for languages with no Wikipedia yet - https://phabricator.wikimedia.org/T336683 [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T0600) [06:00:05] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T0600). [06:00:10] !log Starting s8 codfw failover from db2161 to db2165 - T349053 [06:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s8 codfw as read-only for maintenance - T349053', diff saved to https://phabricator.wikimedia.org/P52984 and previous config saved to /var/cache/conftool/dbconfig/20231017-060021-root.json [06:00:26] T349053: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T349053 [06:00:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2165 to s8 primary and set section read-write T349053', diff saved to https://phabricator.wikimedia.org/P52985 and previous config saved to /var/cache/conftool/dbconfig/20231017-060047-root.json [06:01:34] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/965580 (https://phabricator.wikimedia.org/T349053) (owner: 10Gerrit maintenance bot) [06:02:01] !log isaranto@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync [06:02:13] !log isaranto@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [06:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2161 T349053', diff saved to https://phabricator.wikimedia.org/P52986 and previous config saved to /var/cache/conftool/dbconfig/20231017-060214-root.json [06:04:35] !log isaranto@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [06:05:04] !log isaranto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [06:06:35] !log isaranto@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [06:06:54] !log isaranto@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [06:08:34] Anyone know how to get detailed log of 'pending-upgrade' status of MinT? https://grafana.wikimedia.org/d/UT4GtK3nz/helm-releases?var-site=codfw&var-cluster=k8s&var-namespace=machinetranslation&orgId=1 - I did see some connection errors but didn't note that logs :/ [06:11:06] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:13:44] (03PS1) 10Marostegui: production-m5.sql.erb: Remove testreduce grants [puppet] - 10https://gerrit.wikimedia.org/r/966327 (https://phabricator.wikimedia.org/T345831) [06:15:11] (03CR) 10Marostegui: [C: 03+2] Add Auto-Submitted: auto-generated to check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/963953 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [06:22:23] (03PS1) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [06:22:48] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [06:23:20] (03PS2) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [06:23:46] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [06:25:44] (03PS3) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [06:26:11] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [06:26:21] (03PS4) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [06:31:37] (03PS5) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [06:32:33] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [06:35:09] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [06:40:46] (03PS6) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [06:42:17] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:42:19] (03CR) 10Marostegui: "puppet looks good https://puppet-compiler.wmflabs.org/output/966329/44079/" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [06:43:26] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [06:46:09] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10ayounsi) The issue is that MSS and MTU are tightly coupled. If we increase the MTU on the realservers to allow for the encapsulation overhead (eg. to 9000), the realservers will ad... [06:58:14] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) @RobH I just want to make sure you saw Alex's message to you above. iirc you took care of some of the other moves. [07:00:05] Amir1, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:20] (03CR) 10Jelto: [C: 03+2] gitlab: add hardware 2fa issues to gitlab-replica banner [puppet] - 10https://gerrit.wikimedia.org/r/966255 (https://phabricator.wikimedia.org/T330639) (owner: 10Jelto) [07:03:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:25] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) [07:05:57] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966279 (https://phabricator.wikimedia.org/T348987) (owner: 10AOkoth) [07:07:30] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:monitoring Puppet runs are now monitored by Prometheus. [puppet] - 10https://gerrit.wikimedia.org/r/964869 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [07:09:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Marostegui) [07:13:13] good morning [07:13:27] Good morning :-) [07:18:31] 10SRE, 10SRE-Access-Requests: New SSH key for Jeff Green - https://phabricator.wikimedia.org/T348981 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF [07:21:56] (03CR) 10Hashar: [C: 03+2] "I have tested it locally using the production Gerrit as a backend and https://gerrit.wikimedia.org/r/c/integration/config/+/965869 as chan" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/966178 (https://phabricator.wikimedia.org/T348920) (owner: 10Hashar) [07:22:31] (03Merged) 10jenkins-bot: wm-checks-api: filter out Zuul start messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/966178 (https://phabricator.wikimedia.org/T348920) (owner: 10Hashar) [07:22:58] !log hashar@deploy2002 Started deploy [gerrit/gerrit@1153a16]: wm-checks-api: filter out Zuul start messages | T348920 [07:23:02] T348920: [wm-checks-api] Shows 'Starting gate-and-submit jobs.' as a run attempt - https://phabricator.wikimedia.org/T348920 [07:23:03] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@1153a16]: wm-checks-api: filter out Zuul start messages | T348920 (duration: 00m 05s) [07:24:13] (03PS1) 10Slyngshede: data.yaml: Update SSH key for user jgreen. [puppet] - 10https://gerrit.wikimedia.org/r/966446 [07:25:01] (03CR) 10CI reject: [V: 04-1] data.yaml: Update SSH key for user jgreen. [puppet] - 10https://gerrit.wikimedia.org/r/966446 (owner: 10Slyngshede) [07:25:04] I forgot to rebase :-( [07:26:03] (03PS2) 10Slyngshede: data.yaml: Update SSH key for user jgreen. [puppet] - 10https://gerrit.wikimedia.org/r/966446 [07:26:05] !log hashar@deploy2002 Started deploy [gerrit/gerrit@578be93]: wm-checks-api: filter out Zuul start messages | T348920 [07:26:12] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@578be93]: wm-checks-api: filter out Zuul start messages | T348920 (duration: 00m 07s) [07:26:55] (03PS3) 10Slyngshede: data.yaml: Update SSH key for user jgreen. [puppet] - 10https://gerrit.wikimedia.org/r/966446 (https://phabricator.wikimedia.org/T348981) [07:27:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: New SSH key for Jeff Green - https://phabricator.wikimedia.org/T348981 (10SLyngshede-WMF) No change for access right. Waiting for out of band verification of SSH key. [07:51:44] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [07:55:27] (03PS6) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [07:55:31] 10SRE-swift-storage, 10Move-Files-To-Commons, 10WMDE-TechWish-Maintenance, 10Patch-For-Review, 10Wikimedia-production-error: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688 (10thiemowmde) [07:55:42] (03PS1) 10Elukey: Add Lift Wing LangId SLO definition [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/966486 (https://phabricator.wikimedia.org/T340507) [07:56:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44081/console" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [08:01:37] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/965162 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [08:01:45] (03CR) 10Btullis: Remove kafka-jumbo100[1-6] from puppet [puppet] - 10https://gerrit.wikimedia.org/r/965162 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [08:02:15] (03CR) 10Brouberol: [C: 03+2] Remove kafka-jumbo100[1-6] from puppet [puppet] - 10https://gerrit.wikimedia.org/r/965162 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [08:02:46] (03CR) 10Btullis: "Hang on, I wonder if we should put them into a different role like spare instead of just removing them from site pp" [puppet] - 10https://gerrit.wikimedia.org/r/965162 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [08:06:12] RECOVERY - Check systemd state on kubernetes2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:42] (03CR) 10Btullis: Remove kafka-jumbo100[1-6] from puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965162 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [08:07:55] (03CR) 10Elukey: "Preview in git fetch https://gerrit.wikimedia.org/r/operations/grafana-grizzly refs/changes/86/966486/1 && git cherry-pick FETCH_HEAD" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/966486 (https://phabricator.wikimedia.org/T340507) (owner: 10Elukey) [08:10:47] (03PS1) 10Brouberol: Revert "Remove kafka-jumbo100[1-6] from puppet" [puppet] - 10https://gerrit.wikimedia.org/r/966233 [08:11:00] (03CR) 10Klausman: [C: 03+1] Add Lift Wing LangId SLO definition [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/966486 (https://phabricator.wikimedia.org/T340507) (owner: 10Elukey) [08:11:13] (03CR) 10CI reject: [V: 04-1] Revert "Remove kafka-jumbo100[1-6] from puppet" [puppet] - 10https://gerrit.wikimedia.org/r/966233 (owner: 10Brouberol) [08:12:03] (03PS2) 10Brouberol: Revert "Remove kafka-jumbo100[1-6] from puppet" [puppet] - 10https://gerrit.wikimedia.org/r/966233 [08:12:39] (03CR) 10Btullis: [C: 03+1] Revert "Remove kafka-jumbo100[1-6] from puppet" [puppet] - 10https://gerrit.wikimedia.org/r/966233 (owner: 10Brouberol) [08:12:59] (03CR) 10Brouberol: [C: 03+2] Revert "Remove kafka-jumbo100[1-6] from puppet" [puppet] - 10https://gerrit.wikimedia.org/r/966233 (owner: 10Brouberol) [08:16:43] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add Lift Wing LangId SLO definition [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/966486 (https://phabricator.wikimedia.org/T340507) (owner: 10Elukey) [08:21:38] (03PS1) 10Volans: dhcp: adapt to new Spicerack's dhcp() API [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973) [08:23:48] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:26:14] (03CR) 10CI reject: [V: 04-1] dhcp: adapt to new Spicerack's dhcp() API [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:26:58] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:32:30] !log push pfw policies - T348576 [08:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:49] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [08:38:15] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [08:39:00] (HelmReleaseBadStatus) resolved: Helm release machinetranslation/production on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:39:17] (03CR) 10JMeybohm: [C: 03+1] mw-api-ext, mw-web: Raise replicas 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/964457 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [08:41:26] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/961823 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [08:43:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:20] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:36] (03CR) 10Ayounsi: [C: 03+2] Add "Auto-Submitted" header to dbbackup scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962940 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [08:49:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:55:58] (03PS1) 10David Caro: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) [08:58:25] (03PS1) 10Volans: check_icinga: add Auto-Submitted header to emails [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/966493 (https://phabricator.wikimedia.org/T347835) [08:59:22] (03CR) 10Ayounsi: [C: 03+1] check_icinga: add Auto-Submitted header to emails [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/966493 (https://phabricator.wikimedia.org/T347835) (owner: 10Volans) [09:01:01] (03PS1) 10Majavah: P:openstack::pdns::auth: make pdns web server listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) [09:02:21] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44082/console" [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:02:47] (03CR) 10CI reject: [V: 04-1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [09:03:29] (03CR) 10Volans: [C: 04-1] "Please get in touch with the Data Platform SRE and SRE Observability teams to define a common strategy." [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [09:05:03] (03PS2) 10Majavah: P:openstack::pdns::auth: make pdns web server listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) [09:05:51] (03CR) 10Volans: [C: 03+2] check_icinga: add Auto-Submitted header to emails [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/966493 (https://phabricator.wikimedia.org/T347835) (owner: 10Volans) [09:05:59] (03CR) 10Ladsgroup: [C: 03+1] production-m5.sql.erb: Remove testreduce grants [puppet] - 10https://gerrit.wikimedia.org/r/966327 (https://phabricator.wikimedia.org/T345831) (owner: 10Marostegui) [09:06:24] (03Merged) 10jenkins-bot: check_icinga: add Auto-Submitted header to emails [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/966493 (https://phabricator.wikimedia.org/T347835) (owner: 10Volans) [09:06:30] (03PS3) 10Majavah: P:openstack::pdns::auth: make pdns web server listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) [09:09:18] (03PS4) 10Majavah: P:openstack::pdns::auth: make pdns web server listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) [09:10:11] (03PS2) 10David Caro: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) [09:10:37] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44085/console" [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:12:22] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:12:32] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on an-airflow[1002,1004-1006].eqiad.wmnet,an-launcher1002.eqiad.wmnet with reason: Rebooting Airflow instances for T344671 [09:12:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on an-airflow[1002,1004-1006].eqiad.wmnet,an-launcher1002.eqiad.wmnet with reason: Rebooting Airflow instances for T344671 [09:13:14] (03PS5) 10Majavah: P:openstack::pdns::auth: make pdns web server listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) [09:13:16] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1006.eqiad.wmnet [09:14:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44086/console" [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:14:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:14:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:17:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1006.eqiad.wmnet [09:19:22] (03CR) 10David Caro: elasticsearch: move to opensearch client (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [09:19:48] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:20:28] (03PS3) 10David Caro: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) [09:20:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1004.eqiad.wmnet [09:20:52] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:20:52] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:08] (03CR) 10Jbond: [C: 03+1] "lgtm once authenticate OOB" [puppet] - 10https://gerrit.wikimedia.org/r/966446 (https://phabricator.wikimedia.org/T348981) (owner: 10Slyngshede) [09:21:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:21:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:24:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1004.eqiad.wmnet [09:24:44] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1005.eqiad.wmnet [09:25:04] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961823 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [09:25:34] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10dcaro) I have created a patch to replace elasticsearch with opensearchpy: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/966492 Following @col... [09:25:46] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:26:22] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:28:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1005.eqiad.wmnet [09:28:26] (03CR) 10Volans: [C: 03+2] dhcp: acquire exclusive per-DC lock on write [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:28:39] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1002.eqiad.wmnet [09:30:47] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:31:52] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10CodeReviewBot) joal opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/519 Update analytics druid netfl... [09:33:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1002.eqiad.wmnet [09:33:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1007.eqiad.wmnet [09:34:00] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/965582 (https://phabricator.wikimedia.org/T349077) [09:35:01] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10dcaro) Related task {T345900} [09:35:12] (03Merged) 10jenkins-bot: dhcp: acquire exclusive per-DC lock on write [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:35:43] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@b010dae]: (no justification provided) [09:36:29] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@b010dae]: (no justification provided) (duration: 00m 46s) [09:37:10] (03CR) 10Jbond: [C: 03+2] docker_registry_ha::web: drop support for using puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/961823 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [09:39:49] (03CR) 10Brouberol: "I used the `role::insetup::data_engineering` role as a way to make sure puppet would still run on the node but would do almost nothing. I'" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [09:42:27] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet [09:42:39] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host an-airflow1007.eqiad.wmnet [09:43:12] (03PS2) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) [09:46:48] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:47:59] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on an-airflow[1002,1004-1006].eqiad.wmnet,an-launcher1002.eqiad.wmnet with reason: Rebooting Airflow instances for T344671 [09:48:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on an-airflow[1002,1004-1006].eqiad.wmnet,an-launcher1002.eqiad.wmnet with reason: Rebooting Airflow instances for T344671 [09:48:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1001.eqiad.wmnet [09:51:04] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:58:04] !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-airflow[1002,1004-1006].eqiad.wmnet,an-launcher1002.eqiad.wmnet [09:58:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-airflow[1002,1004-1006].eqiad.wmnet,an-launcher1002.eqiad.wmnet [09:58:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [09:58:17] (03PS1) 10Joal: Add `forwarded` field to turnilo netflow config [puppet] - 10https://gerrit.wikimedia.org/r/966499 (https://phabricator.wikimedia.org/T331707) [09:58:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [09:59:41] !log Deleted operations-puppet-catalog-compiler Jenkins job to replace it with a new job letting one picks the Puppet version(s) to compile against | T236373 [09:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:45] T236373: update pcc with puppet 7 support - https://phabricator.wikimedia.org/T236373 [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1000) [10:01:35] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Mvolz) [10:02:28] (03CR) 10Btullis: [C: 03+2] Add `forwarded` field to turnilo netflow config [puppet] - 10https://gerrit.wikimedia.org/r/966499 (https://phabricator.wikimedia.org/T331707) (owner: 10Joal) [10:15:56] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) [10:16:06] (03PS5) 10Jbond: puppet: add support for puppetserver returning nonzero rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 [10:16:29] (03PS6) 10Jbond: puppet: add support for puppetserver returning nonzero rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 [10:16:43] (03CR) 10Jbond: puppet: add support for puppetserver returning nonzero rc (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [10:18:08] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [10:21:46] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: fix remaining rewrite path [deployment-charts] - 10https://gerrit.wikimedia.org/r/965478 (https://phabricator.wikimedia.org/T347027) (owner: 10Hnowlan) [10:22:45] (03Merged) 10jenkins-bot: rest-gateway: fix remaining rewrite path [deployment-charts] - 10https://gerrit.wikimedia.org/r/965478 (https://phabricator.wikimedia.org/T347027) (owner: 10Hnowlan) [10:23:28] (03CR) 10Jbond: [C: 03+2] puppet: add support for puppetserver returning nonzero rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [10:28:04] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:28:14] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:30:06] (03Merged) 10jenkins-bot: puppet: add support for puppetserver returning nonzero rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [10:30:31] (03PS2) 10Hnowlan: rest-gateway: route API specs for AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966148 (https://phabricator.wikimedia.org/T343268) [10:34:08] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: route API specs for AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966148 (https://phabricator.wikimedia.org/T343268) (owner: 10Hnowlan) [10:35:30] (03Merged) 10jenkins-bot: rest-gateway: route API specs for AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966148 (https://phabricator.wikimedia.org/T343268) (owner: 10Hnowlan) [10:42:17] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:46:25] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s8 T349077 [10:46:30] T349077: Switchover s8 master (db1126 -> db1209) - https://phabricator.wikimedia.org/T349077 [10:46:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T349077 [10:48:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set db1209 with weight 0 T349077', diff saved to https://phabricator.wikimedia.org/P52987 and previous config saved to /var/cache/conftool/dbconfig/20231017-104839-arnaudb.json [10:49:59] (03CR) 10Elukey: "My 2c: I think it may be safer to first remove the config from common.yaml in a separate change, and let puppet restart services etc.. Onc" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [10:50:32] (03CR) 10Elukey: "Also let's run pcc on nodes that use kafka_config() or similar to figure out what changes will be made." [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [10:59:55] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:00:06] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:01:22] (03CR) 10Btullis: [C: 03+2] "This looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/966499 (https://phabricator.wikimedia.org/T331707) (owner: 10Joal) [11:03:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:06:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T343198)', diff saved to https://phabricator.wikimedia.org/P52988 and previous config saved to /var/cache/conftool/dbconfig/20231017-110658-arnaudb.json [11:07:04] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:11:37] (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/965582 (https://phabricator.wikimedia.org/T349077) (owner: 10Gerrit maintenance bot) [11:12:24] !log Starting s8 eqiad failover from db1126 to db1209 - T349077 [11:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:28] T349077: Switchover s8 master (db1126 -> db1209) - https://phabricator.wikimedia.org/T349077 [11:17:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Promote db1209 to s8 primary T349077', diff saved to https://phabricator.wikimedia.org/P52989 and previous config saved to /var/cache/conftool/dbconfig/20231017-111720-arnaudb.json [11:18:21] (03PS1) 10Hnowlan: rest-gateway: strictly order service configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/966512 [11:22:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P52990 and previous config saved to /var/cache/conftool/dbconfig/20231017-112204-arnaudb.json [11:22:41] (03PS1) 10Filippo Giunchedi: otel-coll: bump resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/966514 (https://phabricator.wikimedia.org/T345712) [11:27:36] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:29:24] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:33:54] (03PS1) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966516 (https://phabricator.wikimedia.org/T336044) [11:34:20] (03CR) 10CI reject: [V: 04-1] Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966516 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [11:34:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set db1126 with weight 275 T349077', diff saved to https://phabricator.wikimedia.org/P52991 and previous config saved to /var/cache/conftool/dbconfig/20231017-113432-arnaudb.json [11:34:37] T349077: Switchover s8 master (db1126 -> db1209) - https://phabricator.wikimedia.org/T349077 [11:35:41] (03Abandoned) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966516 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [11:37:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P52992 and previous config saved to /var/cache/conftool/dbconfig/20231017-113711-arnaudb.json [11:37:50] (03PS3) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) [11:38:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depool db1126 T349077', diff saved to https://phabricator.wikimedia.org/P52993 and previous config saved to /var/cache/conftool/dbconfig/20231017-113809-arnaudb.json [11:38:28] (03CR) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [11:39:30] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [11:39:34] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:41:47] (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: strictly order service configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/966512 (owner: 10Hnowlan) [11:47:39] (03PS4) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) [11:47:57] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [11:48:49] (03Abandoned) 10EoghanGaffney: [ci/firewall] Add cumin+deploy hosts to CI http allow list [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney) [11:51:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/6/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:51:28] (03CR) 10EoghanGaffney: "After discussing this in the team meeting yesterday we agreed to abandon this for the moment and discuss in a few weeks if there's a bette" [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney) [11:52:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T343198)', diff saved to https://phabricator.wikimedia.org/P52994 and previous config saved to /var/cache/conftool/dbconfig/20231017-115217-arnaudb.json [11:52:23] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1200) [12:01:48] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch) [12:15:32] (03CR) 10Brouberol: "This indeed looks like what I had in mind! I didn't undertake it at the time, as we found a way to avoid it, but if this works as intended" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [12:18:26] (03PS1) 10Awight: Workaround to center search terms label [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966238 (https://phabricator.wikimedia.org/T252346) [12:19:27] (03CR) 10WMDE-Fisch: [C: 03+2] Workaround to center search terms label [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966238 (https://phabricator.wikimedia.org/T252346) (owner: 10Awight) [12:21:55] (03CR) 10CI reject: [V: 04-1] Workaround to center search terms label [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966238 (https://phabricator.wikimedia.org/T252346) (owner: 10Awight) [12:23:13] (03Merged) 10jenkins-bot: Workaround to center search terms label [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966238 (https://phabricator.wikimedia.org/T252346) (owner: 10Awight) [12:25:02] (03CR) 10Elukey: "Looks good! One extra nit before proceeding. The Fundraising SREs use kafkatee from their infrastructure, that has a different puppet etc." [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:28:23] !log Starting Cassandra decommission(s) of restbase1017 — [12:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:55] (03CR) 10Joal: Add `forwarded` field to turnilo netflow config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966499 (https://phabricator.wikimedia.org/T331707) (owner: 10Joal) [12:29:02] (03CR) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from the inventory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966497 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:29:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:31:29] (03PS1) 10Elukey: d-i: Fix retrieval of reuse-parts-test.sh for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/966524 (https://phabricator.wikimedia.org/T339835) [12:34:05] (03CR) 10Brouberol: [C: 03+2] Remove the need for the analytics-meta database to require java [puppet] - 10https://gerrit.wikimedia.org/r/965761 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:34:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:34:18] (03CR) 10Brouberol: [C: 03+1] Remove the need for the analytics-meta database to require java [puppet] - 10https://gerrit.wikimedia.org/r/965761 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:34:49] (03CR) 10Elukey: [C: 03+1] install_server: create aqs reuse partition reuse recipe (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [12:36:34] (03PS1) 10Jbond: pcc: update towork with new matrix based jenkins job [puppet] - 10https://gerrit.wikimedia.org/r/966527 [12:37:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/11/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [12:38:13] (03PS1) 10Jdrewniak: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966528 (https://phabricator.wikimedia.org/T347208) [12:38:24] (03PS2) 10Jdrewniak: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966528 (https://phabricator.wikimedia.org/T347208) [12:41:03] (03CR) 10Jbond: [C: 03+2] pcc: update towork with new matrix based jenkins job [puppet] - 10https://gerrit.wikimedia.org/r/966527 (owner: 10Jbond) [12:42:57] (03CR) 10Elukey: "Fine for me, but I'd check with Aiko what to do with Readability and RR-MultiLingual, since the new Docker image should in theory carry th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) (owner: 10Ilias Sarantopoulos) [12:44:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: New SSH key for Jeff Green - https://phabricator.wikimedia.org/T348981 (10SLyngshede-WMF) [12:44:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: New SSH key for Jeff Green - https://phabricator.wikimedia.org/T348981 (10SLyngshede-WMF) Key has been verified via email [12:44:35] (03CR) 10Slyngshede: [C: 03+2] data.yaml: Update SSH key for user jgreen. [puppet] - 10https://gerrit.wikimedia.org/r/966446 (https://phabricator.wikimedia.org/T348981) (owner: 10Slyngshede) [12:46:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: New SSH key for Jeff Green - https://phabricator.wikimedia.org/T348981 (10SLyngshede-WMF) 05Open→03Resolved [12:47:02] (03PS1) 10Hashar: logging: reorder wmgMonologProcessors entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) [12:49:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1119 T339185', diff saved to https://phabricator.wikimedia.org/P52995 and previous config saved to /var/cache/conftool/dbconfig/20231017-124916-root.json [12:49:21] T339185: Test MariaDB + Debian bookworm on databases - https://phabricator.wikimedia.org/T339185 [12:50:14] (03CR) 10Hashar: "That broke with 1.41.0-wmf.30 which is now deployed on all wikis. Since we no more run 1.41.0-wmf.29 anywhere we don't need to keep back c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) (owner: 10Hashar) [12:52:21] (03PS2) 10Hashar: logging: reorder wmgMonologProcessors entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) [12:53:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:17] (03CR) 10Hashar: logging: reorder wmgMonologProcessors entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) (owner: 10Hashar) [12:54:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:55:01] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:55:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/16/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [12:56:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Cleanup Kartographer Nearby flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch) [12:56:04] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/17/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [12:58:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:55] (03CR) 10Brouberol: "Could we include a pcc run output?" [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:59:06] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) Stab in the dark guessing what commands are needed in codfw, based on man page and some guides (including info Artur... [12:59:17] (03PS1) 10Slyngshede: P:monitoring remove remnants of checkpuppetrun [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) [12:59:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bookworm [12:59:41] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [13:00:07] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1300). nyaa~ [13:00:07] jdrewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:40] (can't deploy, sorry!) [13:00:51] (03PS2) 10Slyngshede: P:monitoring remove remnants of checkpuppetrun [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) [13:01:50] me neither sorry [13:02:43] (03CR) 10Btullis: [V: 03+1] Create a new role for analytics_cluster::mariadb and assign it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [13:03:19] jan_drewniak: I can deploy :) [13:03:30] or I can assist you in deploying it :D [13:03:33] (03PS1) 10Btullis: Fix typo on yarn-spark-shuffle package name [puppet] - 10https://gerrit.wikimedia.org/r/966533 (https://phabricator.wikimedia.org/T344910) [13:04:25] (03CR) 10Btullis: [C: 03+2] Fix typo on yarn-spark-shuffle package name [puppet] - 10https://gerrit.wikimedia.org/r/966533 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:08:55] (03PS1) 10Btullis: Fix the name of the spark yarn shuffler packages [puppet] - 10https://gerrit.wikimedia.org/r/966534 (https://phabricator.wikimedia.org/T344910) [13:09:02] Who's the backport team this morning (EST)/afternoon (UTC)? [13:09:37] I added a last-minute entry to fix a potential parsercache issue w/ discussiontools: https://gerrit.wikimedia.org/r/c/966525/ [13:09:55] (03CR) 10Btullis: [C: 03+2] Fix the name of the spark yarn shuffler packages [puppet] - 10https://gerrit.wikimedia.org/r/966534 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:10:17] hashar: oh hi! sorry lost track of time... [13:10:59] hashar: thanks, it's just a config change, I'm fine deploying it myself [13:11:07] +1 [13:11:32] (03CR) 10Hashar: [C: 03+1] Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966528 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [13:12:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966528 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [13:13:17] (03Merged) 10jenkins-bot: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966528 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [13:14:36] (03CR) 10DCausse: "Unless something has changed recently the wdqs flink job in wikikube@staging used to connect to kafka-main properly." [puppet] - 10https://gerrit.wikimedia.org/r/966308 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [13:15:54] !log jdrewniak@deploy2002 Backport cancelled. [13:18:30] scap backport is telling me: [13:18:30] The following are unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.42.0-wmf.1 [13:18:30] commit 8d669455d76328d3ed30b1d52eedf83df7e66bef [13:18:30] Author: Adam Wight [13:18:30] Update git submodules [13:18:31] ... [13:18:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Papaul) @Jclark-ctr this is now fixed. You can try running the re-image again [13:20:52] so that's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AdvancedSearch/+/966238 [13:21:11] * jan_drewniak Can I continue the config backport with the unmerged changes? It looks like the undeployed changes are for an unrelated repo https://gerrit.wikimedia.org/r/q/Id5d110ffe467f8388bf5373caa940a8b28cc363afor [13:21:14] neither Adadm or Fisch are here, so I would just revert the backport [13:22:00] (03PS1) 10Slyngshede: P:monitoring absent Incinga check_eth. [puppet] - 10https://gerrit.wikimedia.org/r/966535 (https://phabricator.wikimedia.org/T332764) [13:24:01] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/23/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966535 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [13:24:34] (03PS1) 10Jdrewniak: Revert "Workaround to center search terms label" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966239 [13:24:56] * jan_drewniak Yeah, I'm going to revert the backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AdvancedSearch/+/966239 [13:25:24] (03CR) 10Majavah: [C: 03+1] Revert "Workaround to center search terms label" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966239 (owner: 10Jdrewniak) [13:25:35] sgtm [13:25:53] (03CR) 10Jdrewniak: [C: 03+2] Revert "Workaround to center search terms label" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966239 (owner: 10Jdrewniak) [13:26:09] !log jdrewniak@deploy2002 Backport cancelled. [13:29:37] (03Merged) 10jenkins-bot: Revert "Workaround to center search terms label" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966239 (owner: 10Jdrewniak) [13:30:01] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: strictly order service configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/966512 (owner: 10Hnowlan) [13:30:16] (03CR) 10Eevans: [C: 03+1] d-i: Fix retrieval of reuse-parts-test.sh for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/966524 (https://phabricator.wikimedia.org/T339835) (owner: 10Elukey) [13:30:27] jan_drewniak you're still working on deploying 966528 right?  can you or taavi do 966525 when you're done? [13:30:41] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:966528|Enable Vector readability survey on select wikis (T347208)]] [13:30:48] T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208 [13:31:00] (03Merged) 10jenkins-bot: rest-gateway: strictly order service configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/966512 (owner: 10Hnowlan) [13:32:18] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:966528|Enable Vector readability survey on select wikis (T347208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:48] (03PS1) 10Jbond: puppet-diffs: hiera update to with defaults [puppet] - 10https://gerrit.wikimedia.org/r/966538 [13:33:19] (03CR) 10Jbond: [C: 03+2] puppet-diffs: hiera update to with defaults [puppet] - 10https://gerrit.wikimedia.org/r/966538 (owner: 10Jbond) [13:34:57] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [13:35:37] * jan_drewniak cscott: I can do that, but the change still has to be cherry-picked. What branch should the change be backported to?1.41.0-wmf.30, 1.42.0-wmf.1 or both? [13:35:49] 1.42.0-wmf.1 [13:36:13] 1.41 is fine, doesn't need it. [13:36:28] sorry, should have done the cherry pick for you, i'm a bit rusty. [13:36:36] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10CodeReviewBot) tchin merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/519 Update analytics druid netf... [13:36:52] (03PS1) 10Jdrewniak: ParserOutputAccess: Fix local cache when page is edited within the process [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966241 (https://phabricator.wikimedia.org/T349033) [13:37:50] Ok I just cherry-picked it: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/966241 I'll run the deploy after mine is done [13:39:09] (03CR) 10Eevans: install_server: create aqs reuse partition reuse recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [13:39:23] jan_drewniak: thanks! [13:40:31] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:966528|Enable Vector readability survey on select wikis (T347208)]] (duration: 09m 50s) [13:40:38] T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208 [13:40:56] !log tchin@deploy2002 Started deploy [analytics/refinery@0d09fbd]: Regular analytics weekly train [analytics/refinery@0d09fbdc] [13:41:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966241 (https://phabricator.wikimedia.org/T349033) (owner: 10Jdrewniak) [13:46:29] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:46:41] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:47:01] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2160.codfw.wmnet with OS bookworm [13:48:04] (03PS11) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) [13:48:20] !log tchin@deploy2002 Finished deploy [analytics/refinery@0d09fbd]: Regular analytics weekly train [analytics/refinery@0d09fbdc] (duration: 07m 24s) [13:48:26] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:48:30] (03CR) 10CI reject: [V: 04-1] profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [13:48:41] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:49:10] !log tchin@deploy2002 Started deploy [analytics/refinery@0d09fbd] (thin): Regular analytics weekly train THIN [analytics/refinery@0d09fbdc] [13:49:18] !log tchin@deploy2002 Finished deploy [analytics/refinery@0d09fbd] (thin): Regular analytics weekly train THIN [analytics/refinery@0d09fbdc] (duration: 00m 07s) [13:49:33] !log tchin@deploy2002 Started deploy [analytics/refinery@0d09fbd] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0d09fbdc] [13:49:36] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:49:48] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:49:53] (03PS12) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) [13:50:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1225.eqiad.wmnet with reason: db1225 downtime for restoration [13:50:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1225.eqiad.wmnet with reason: db1225 downtime for restoration [13:52:23] (03PS1) 10C. Scott Ananian: DNM: null edit CI test [extensions/DiscussionTools] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966242 [13:52:33] !log tchin@deploy2002 Finished deploy [analytics/refinery@0d09fbd] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0d09fbdc] (duration: 02m 59s) [13:53:29] oops sorry looks like my irc connection dropped.  Did I miss anything? [13:54:45] (03Merged) 10jenkins-bot: ParserOutputAccess: Fix local cache when page is edited within the process [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966241 (https://phabricator.wikimedia.org/T349033) (owner: 10Jdrewniak) [13:55:09] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:966241|ParserOutputAccess: Fix local cache when page is edited within the process (T349033)]] [13:55:13] T349033: DiscussionTools CI is failing - https://phabricator.wikimedia.org/T349033 [13:56:29] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:966241|ParserOutputAccess: Fix local cache when page is edited within the process (T349033)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:57:02] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:57:34] (03CR) 10CI reject: [V: 04-1] DNM: null edit CI test [extensions/DiscussionTools] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966242 (owner: 10C. Scott Ananian) [13:57:37] cscott: ok it's up on mwdebug [13:58:17] (03CR) 10Jbond: [C: 04-1] "are you sure this has been ported to alertmanager. i just had an alert from icinga on puppetboard2003 which did not have an equivalent al" [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [13:58:36] ^ i verified the issue with CI on the 1.42.0-wmf.1 which was the principal reason for the backport.  now that this is merged on the 1.42.0-wmf.1 branch that CI issue should go away.  doing a recheck now. [13:58:58] (03CR) 10C. Scott Ananian: "recheck" [extensions/DiscussionTools] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966242 (owner: 10C. Scott Ananian) [13:59:23] but i'll also smoke test it on mwdebug testwiki while i'm waiting [13:59:40] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@fae5764]: (no justification provided) [14:01:03] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@fae5764]: (no justification provided) (duration: 01m 22s) [14:01:39] cscott: ok, let me know when it's good to sync [14:01:54] looks good on testwiki, at least no smoke coming out [14:02:20] ETA 3 min on jenkins, that's the 'real' test. [14:03:12] !log tchin@deploy2002 Started deploy [airflow-dags/analytics_test@be05071]: Regular analytics weekly train [14:03:18] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics_test@be05071]: Regular analytics weekly train (duration: 00m 06s) [14:04:12] jan_drewniak ok everything looks good, go ahead and sync.  thanks again! [14:05:56] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [14:07:33] (03PS13) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) [14:08:16] (03PS1) 10Volans: documentation: add section for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/966547 (https://phabricator.wikimedia.org/T341973) [14:08:18] (03PS1) 10Volans: netbox: remove deprecated methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/966548 [14:10:22] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10jijiki) [14:11:06] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:966241|ParserOutputAccess: Fix local cache when page is edited within the process (T349033)]] (duration: 15m 56s) [14:11:13] T349033: DiscussionTools CI is failing - https://phabricator.wikimedia.org/T349033 [14:11:24] cscott: ok finally done! [14:16:29] (03PS1) 10Majavah: nginx: make /etc/nginx depend on the package [puppet] - 10https://gerrit.wikimedia.org/r/966549 [14:23:26] (03CR) 10Jbond: "see inline for comments," [software/spicerack] - 10https://gerrit.wikimedia.org/r/966547 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:24:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966548 (owner: 10Volans) [14:24:43] !log denisse@deploy2002 Started deploy [performance/navtiming@2e17c67]: (no justification provided) [14:24:49] !log denisse@deploy2002 Finished deploy [performance/navtiming@2e17c67]: (no justification provided) (duration: 00m 05s) [14:27:23] (03CR) 10Hnowlan: [C: 03+1] wikifunctions: Use ClusterIP services for evaluators (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [14:27:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/966535 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [14:28:02] (03CR) 10Krinkle: [C: 03+1] "moving psr last LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) (owner: 10Hashar) [14:28:34] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:28:45] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:31:17] jan_drewniak thanks again! [14:31:22] (03PS1) 10Hnowlan: rest-gateway: move specs to first URLs to be matched on AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966550 (https://phabricator.wikimedia.org/T343268) [14:31:57] (03PS3) 10JMeybohm: wikifunctions: Use ClusterIP services for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) [14:33:10] (03CR) 10JMeybohm: wikifunctions: Use ClusterIP services for evaluators (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [14:33:47] (03PS2) 10Hnowlan: rest-gateway: move specs to first URLs to be matched on AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966550 (https://phabricator.wikimedia.org/T343268) [14:34:40] (03CR) 10Alex Paskulin: [C: 03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966550 (https://phabricator.wikimedia.org/T343268) (owner: 10Hnowlan) [14:37:21] (03PS2) 10Ottomata: Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [14:37:26] (03CR) 10Ottomata: "Rebased with v1.27.0" [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [14:37:40] (03PS3) 10Ottomata: Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [14:38:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:55] (03CR) 10Btullis: [C: 03+1] Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [14:39:12] (03CR) 10Ottomata: "Ah interesting! I noticed this when I was deving EventGate locally too, but I thought it must have been MacOS or something. Good to know" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965481 (https://phabricator.wikimedia.org/T347477) (owner: 10Elukey) [14:42:09] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/output/954114/26/" [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [14:43:23] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) Ok we seem to have muddled through, for the record commands needed as follows: ` wmcs-openstack port unset 1290224c-... [14:44:57] (03PS2) 10Volans: documentation: add section for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/966547 (https://phabricator.wikimedia.org/T341973) [14:44:59] (03PS2) 10Volans: netbox: remove deprecated methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/966548 [14:45:01] (03PS1) 10Volans: tests: remove unneded vulture allow list [software/spicerack] - 10https://gerrit.wikimedia.org/r/966552 [14:45:03] (03CR) 10Volans: "addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966547 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:46:19] (03CR) 10Ottomata: [C: 03+1] "This is fine with me!" [puppet] - 10https://gerrit.wikimedia.org/r/966308 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [14:50:45] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) a:03ssingh [14:51:18] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: move specs to first URLs to be matched on AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966550 (https://phabricator.wikimedia.org/T343268) (owner: 10Hnowlan) [14:52:07] (03Merged) 10jenkins-bot: rest-gateway: move specs to first URLs to be matched on AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966550 (https://phabricator.wikimedia.org/T343268) (owner: 10Hnowlan) [14:52:53] (03PS2) 10Jdlrobson: Wordmark for blk wiktionary and got wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966314 (https://phabricator.wikimedia.org/T341253) [14:53:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:08] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T349047 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [14:57:40] (03PS1) 10Slyngshede: puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 [14:58:25] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) Note that we have to merge and deploy this first: https://gerrit.wikimedia.org/r/c/operations/puppet/+/965708 [14:59:00] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:59:14] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:59:40] (03CR) 10Slyngshede: "I can tell if there's a reason why this was only enabled on the alerting cluster." [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [15:00:05] eoghan, jelto, and arnoldokoth: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1500). [15:00:20] 10SRE, 10ops-codfw, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10nskaggs) @Papaul Is it possible there is 1 more rack that could be dedicated in the current setup (so 2 total WMCS racks, 1 existing... [15:01:39] (03CR) 10Slyngshede: P:monitoring remove remnants of checkpuppetrun (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [15:02:57] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:03:07] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:03:25] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator maintenance [15:03:31] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10jhathaway) @ayounsi thanks for detailed replied and the linked blog posts. Given that additional data, I am substantially less concerned about using MSS clamping. [15:03:39] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator maintenance [15:03:56] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator maintenance [15:04:11] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator maintenance [15:05:08] (03PS13) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [15:05:35] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Some or all of the undeletion failed - https://phabricator.wikimedia.org/T348937 (10Sreejithk2000) Another file with the same issue https://commons.wikimedia.org/wiki/Special:Undelete/File:BBC2_striped_ident_1.jpg Some or all of the undeletion fa... [15:06:43] !log brennen@deploy2002 Started deploy [phabricator/deployment@745d703]: test deploy to phab2002 for T349038 [15:06:47] T349038: Deploy Phabricator/Phorge 2023-10-17 - https://phabricator.wikimedia.org/T349038 [15:07:03] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [15:07:16] !log brennen@deploy2002 Finished deploy [phabricator/deployment@745d703]: test deploy to phab2002 for T349038 (duration: 00m 33s) [15:07:26] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:07:37] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:08:32] !log brennen@deploy2002 Started deploy [phabricator/deployment@745d703]: deploy to phab1004 for T349038 [15:09:07] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10KStoller-WMF) [15:09:19] 10SRE, 10MediaWiki-Core-AuthManager, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, and 2 others: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out state - https://phabricator.wikimedia.org/T335125 (10KStoller... [15:09:30] !log brennen@deploy2002 Finished deploy [phabricator/deployment@745d703]: deploy to phab1004 for T349038 (duration: 00m 57s) [15:09:41] (03PS14) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [15:10:19] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [15:11:38] (03CR) 10Marostegui: [C: 03+1] d-i: Fix retrieval of reuse-parts-test.sh for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/966524 (https://phabricator.wikimedia.org/T339835) (owner: 10Elukey) [15:12:02] (NodeTextfileStale) resolved: Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:12:53] (03CR) 10Brouberol: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:16:50] (03CR) 10Elukey: [C: 03+2] d-i: Fix retrieval of reuse-parts-test.sh for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/966524 (https://phabricator.wikimedia.org/T339835) (owner: 10Elukey) [15:19:00] (03PS15) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [15:19:20] (03CR) 10Brouberol: "This is the content of the ssl.prom file generated by a local run of the script on a skein.crt file:" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:23:14] (03PS6) 10Brouberol: Publish metrics reflecting skein certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) [15:23:42] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:23:44] (03PS1) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/966560 (https://phabricator.wikimedia.org/T348883) [15:24:13] (03CR) 10Elukey: [C: 03+1] install_server: create aqs reuse partition reuse recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [15:24:53] 10SRE, 10Growth-Team, 10MediaWiki-Core-AuthManager, 10MediaWiki-User-login-and-signup, and 2 others: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out state - https://phabricator.wikimedia.org/T335125 (10KStoller-WMF) [15:25:15] (03PS2) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/966560 (https://phabricator.wikimedia.org/T348883) [15:26:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/28/console" [puppet] - 10https://gerrit.wikimedia.org/r/966560 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [15:27:21] (03PS3) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/966560 (https://phabricator.wikimedia.org/T348883) [15:28:34] (03PS4) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/966560 (https://phabricator.wikimedia.org/T348883) [15:31:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/30/console" [puppet] - 10https://gerrit.wikimedia.org/r/966560 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [15:33:02] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:33:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Gehel) [15:33:20] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10Gehel) 05Open→03Resolved a:03Gehel [15:33:50] (03CR) 10Ryan Kemper: [C: 03+2] icinga: round elasticsearch shard size check to 2 decimal places [puppet] - 10https://gerrit.wikimedia.org/r/962243 (https://phabricator.wikimedia.org/T327218) (owner: 10Cwhite) [15:38:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966547 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:38:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966552 (owner: 10Volans) [15:40:15] (03CR) 10Filippo Giunchedi: puppet-agent-fail: enable check for all clusters. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [15:41:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966549 (owner: 10Majavah) [15:43:32] (03CR) 10Volans: [C: 03+2] documentation: add section for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/966547 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:43:38] (03CR) 10Volans: [C: 03+2] netbox: remove deprecated methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/966548 (owner: 10Volans) [15:43:46] (03CR) 10Volans: [C: 03+2] tests: remove unneded vulture allow list [software/spicerack] - 10https://gerrit.wikimedia.org/r/966552 (owner: 10Volans) [15:44:21] (03CR) 10Dr0ptp4kt: [C: 03+1] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [15:46:22] (03PS1) 10Ssingh: wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) [15:46:46] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:47:14] (03CR) 10BBlack: [C: 03+1] wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) (owner: 10Ssingh) [15:47:19] (03CR) 10BCornwall: [C: 03+1] wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) (owner: 10Ssingh) [15:47:22] (03CR) 10CI reject: [V: 04-1] wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) (owner: 10Ssingh) [15:47:24] (03CR) 10Vgutierrez: [C: 03+1] wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) (owner: 10Ssingh) [15:48:30] (03PS2) 10Ssingh: wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) [15:49:06] (03PS3) 10Ssingh: wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) [15:50:11] (03Merged) 10jenkins-bot: documentation: add section for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/966547 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:50:18] (03CR) 10BBlack: [C: 03+1] wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) (owner: 10Ssingh) [15:50:20] (03CR) 10Vgutierrez: [C: 03+1] wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) (owner: 10Ssingh) [15:50:28] (03CR) 10Ssingh: [C: 03+2] wmfusercontent: add TXT record for cert validation [dns] - 10https://gerrit.wikimedia.org/r/966564 (https://phabricator.wikimedia.org/T339267) (owner: 10Ssingh) [15:50:52] (03Merged) 10jenkins-bot: netbox: remove deprecated methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/966548 (owner: 10Volans) [15:50:52] !log running authdns-update for CR 966564 [15:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:56] (03Merged) 10jenkins-bot: tests: remove unneded vulture allow list [software/spicerack] - 10https://gerrit.wikimedia.org/r/966552 (owner: 10Volans) [15:52:22] (03CR) 10Brouberol: "Thanks Filippo for the improvement pointers! Next version addresses them all" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:52:34] (03PS7) 10Brouberol: Publish metrics reflecting skein certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) [15:52:50] (03PS1) 10Ssingh: Revert "wmfusercontent: add TXT record for cert validation" [dns] - 10https://gerrit.wikimedia.org/r/966243 [15:56:07] (03PS8) 10Ryan Kemper: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [15:56:28] 10SRE, 10Infrastructure-Foundations, 10netops: Improve network BGP group definition and automation templates - https://phabricator.wikimedia.org/T349116 (10cmooney) p:05Triage→03Low [15:57:17] (03Abandoned) 10Ebernhardson: kafka-main: Allow connections from wikikube-staging [puppet] - 10https://gerrit.wikimedia.org/r/966308 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [15:58:19] 10SRE, 10Infrastructure-Foundations, 10netops: Improve network BGP group definition and automation templates - https://phabricator.wikimedia.org/T349116 (10cmooney) [15:58:53] 10SRE, 10Infrastructure-Foundations, 10netops: Improve Homer BGP group definition and automation templates - https://phabricator.wikimedia.org/T349116 (10cmooney) [16:00:05] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:05:05] (03PS8) 10Brouberol: Publish metrics reflecting skein certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) [16:05:32] RECOVERY - Juniper alarms on lsw1-e5-eqiad.mgmt is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:06:54] (03CR) 10Ayounsi: [C: 03+1] "Very nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:08:28] (03CR) 10Jbond: [C: 04-1] P:monitoring remove remnants of checkpuppetrun (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [16:08:34] PROBLEM - Host lsw1-e7-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:46] PROBLEM - BFD status on ssw1-f1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:08:56] PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:00] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) Based on the discussion with @Lhiraide, @NMariano-WMF and Greenhouse, we will be using the `gh-mail.wikimedia.org` domain instead, so the DNS records for that will be updated. [16:09:04] PROBLEM - BGP status on ssw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:32] RECOVERY - Host lsw1-e7-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [16:09:38] PROBLEM - BFD status on ssw1-e1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:10:18] RECOVERY - Juniper alarms on lsw1-e7-eqiad.mgmt is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:11:34] RECOVERY - BFD status on ssw1-f1-eqiad.mgmt is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:11:46] RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:52] RECOVERY - BGP status on ssw1-f1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:54] (03PS1) 10Brennen Bearnes: WIP: phabricator: remove redirector & config [puppet] - 10https://gerrit.wikimedia.org/r/966568 (https://phabricator.wikimedia.org/T344884) [16:12:28] RECOVERY - BFD status on ssw1-e1-eqiad.mgmt is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:14:28] (03PS1) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) [16:14:45] (Emergency syslog message) firing: Alert for device lsw1-e7-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:15:58] (03CR) 10Jforrester: [C: 04-2] [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) (owner: 10Jforrester) [16:17:50] (03PS1) 10Ebernhardson: cirrus updater: Configure kafka communication to use SSL [deployment-charts] - 10https://gerrit.wikimedia.org/r/966571 [16:18:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [16:19:26] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Configure kafka communication to use SSL [deployment-charts] - 10https://gerrit.wikimedia.org/r/966571 (owner: 10Ebernhardson) [16:19:45] (Emergency syslog message) resolved: Device lsw1-e7-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:20:27] (03Merged) 10jenkins-bot: cirrus updater: Configure kafka communication to use SSL [deployment-charts] - 10https://gerrit.wikimedia.org/r/966571 (owner: 10Ebernhardson) [16:22:13] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:22:13] (03CR) 10Btullis: [C: 03+2] Add `forwarded` field to turnilo netflow config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966499 (https://phabricator.wikimedia.org/T331707) (owner: 10Joal) [16:22:47] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:23:45] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) [16:24:05] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) [16:24:54] (03PS1) 10Ssingh: wikimedia.org: update DNS records for Greenhouse [dns] - 10https://gerrit.wikimedia.org/r/966573 (https://phabricator.wikimedia.org/T348335) [16:25:34] (03PS2) 10Slyngshede: puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 [16:26:47] (03CR) 10CI reject: [V: 04-1] puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [16:28:13] (03CR) 10Slyngshede: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [16:31:51] (03PS3) 10Slyngshede: puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 [16:34:30] (03CR) 10Cathal Mooney: [C: 03+2] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:35:08] (03Merged) 10jenkins-bot: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:38:06] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) >>! In T290750#9252559, @elukey wrote: > Eventstreams has been ported to nodejs18, the last LTS. I am working on doi... [16:50:02] 10SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120 (10BCornwall) [16:50:29] 10SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120 (10BCornwall) p:05Triage→03Low [16:52:48] (03PS2) 10Brennen Bearnes: phabricator: remove redirector & config [puppet] - 10https://gerrit.wikimedia.org/r/966568 (https://phabricator.wikimedia.org/T344884) [16:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1700) [17:05:15] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [17:05:18] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [17:12:08] 10SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120 (10BCornwall) [17:12:27] 10SRE, 10Infrastructure-Foundations, 10netops: Tighter control on exported BGP routes from MRs - https://phabricator.wikimedia.org/T348739 (10cmooney) 05Open→03Resolved [17:14:33] (03PS1) 10Cathal Mooney: Add homer automation for management router bgp [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) [17:18:56] 10SRE, 10ops-eqiad: 1 PSU down on both lsw1-e5-eqiad and lsw1-e7-eqiad - https://phabricator.wikimedia.org/T349002 (10cmooney) 05Open→03Resolved a:03VRiley-WMF All good now @VRiley-WMF was able to fix just some loose cables. [17:28:16] (03PS1) 10Brennen Bearnes: Pass full content to Parsoid for redirect pages [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966245 (https://phabricator.wikimedia.org/T349087) [17:33:05] (03CR) 10Subramanya Sastry: "Let's hold for a bit while we verify this fix in beta cluster. Waiting for zuul to merge my patch and for this to land on beta before we c" [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966245 (https://phabricator.wikimedia.org/T349087) (owner: 10Brennen Bearnes) [17:37:11] (03CR) 10Brennen Bearnes: Pass full content to Parsoid for redirect pages (031 comment) [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966245 (https://phabricator.wikimedia.org/T349087) (owner: 10Brennen Bearnes) [17:46:11] 10SRE, 10Infrastructure-Foundations, 10netops: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10cmooney) p:05Triage→03Medium [17:46:22] 10SRE, 10Infrastructure-Foundations, 10netops: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10cmooney) [17:46:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [17:49:49] (03CR) 10Subramanya Sastry: [C: 03+1] "Looks fixed in beta (I added the report on the phab task)." [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966245 (https://phabricator.wikimedia.org/T349087) (owner: 10Brennen Bearnes) [17:50:26] jouncebot nowandnext [17:50:26] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1700) [17:50:26] In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1800) [17:51:55] I am glad we got all the risky patch issues fixed before the train rollout. :) [17:53:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966245 (https://phabricator.wikimedia.org/T349087) (owner: 10Brennen Bearnes) [17:53:10] subbu: indeed [17:53:20] much appreciated. :) [17:54:03] :) ty to everyone who helped. I am leaving a status update on the task. [17:54:17] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) [18:00:06] brennen and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1800). [18:08:15] (03Merged) 10jenkins-bot: Pass full content to Parsoid for redirect pages [core] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966245 (https://phabricator.wikimedia.org/T349087) (owner: 10Brennen Bearnes) [18:08:36] !log brennen@deploy2002 Started scap: Backport for [[gerrit:966245|Pass full content to Parsoid for redirect pages (T349087)]] [18:08:54] T349087: Redirects on RESTBase testsuite are failing - https://phabricator.wikimedia.org/T349087 [18:09:56] !log brennen@deploy2002 brennen: Backport for [[gerrit:966245|Pass full content to Parsoid for redirect pages (T349087)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:11:02] !log brennen@deploy2002 brennen: Continuing with sync [18:11:25] (03CR) 10Fabfur: "This change is ready for review." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [18:11:35] (03PS4) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [18:13:23] (03PS1) 10Ladsgroup: Set wikidatawiki to write both for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) [18:14:41] jouncebot: nowandnext [18:14:42] For the next 1 hour(s) and 45 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T1800) [18:14:42] In 1 hour(s) and 45 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T2000) [18:16:18] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:966245|Pass full content to Parsoid for redirect pages (T349087)]] (duration: 07m 42s) [18:16:26] T349087: Redirects on RESTBase testsuite are failing - https://phabricator.wikimedia.org/T349087 [18:16:45] (03PS2) 10Ladsgroup: wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) [18:16:48] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [18:18:00] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:14] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:18:31] !log train 1.42.0-wmf.1 (T348354): blockers resolved, rolling to group0 [18:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:37] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [18:18:38] PROBLEM - Thanos swift https on thanos-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [18:18:52] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [18:18:59] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966594 (https://phabricator.wikimedia.org/T348354) [18:19:01] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966594 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [18:19:07] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:24] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:40] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:19:48] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966594 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [18:19:51] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Aklapper) [18:19:54] RECOVERY - Thanos swift https on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:20:08] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.348 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:24:07] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:34] (03PS1) 10Herron: import upstream 0.7.1 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/966595 [18:25:36] (03PS1) 10Herron: Merge tag 'upstream/0.7.1' [debs/pyrra] - 10https://gerrit.wikimedia.org/r/966596 [18:25:56] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.1 refs T348354 [18:26:01] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [18:32:53] brennen: good morning :) [18:33:45] earlier today I found out the error log entries are no more deduplicated. I crafted a patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/966529/ [18:34:04] and Timo (who wrote the related mediawiki/core) +1 ed it [18:34:10] so I am willing to push the hotfix [18:34:24] hashar: ah, i missed the +1. yeah, deploying seems like a good idea. [18:34:43] I was puzzled during the log triage on thursday [18:34:48] (03PS3) 10Jdlrobson: Wordmark for blk wiktionary and got wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966314 (https://phabricator.wikimedia.org/T341253) [18:34:58] and eventually this morning I decided to track it down (initially blaming kibana ) [18:35:06] may I push it now? [18:37:07] (03PS1) 10Majavah: Add virtual domain mapping for OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966598 (https://phabricator.wikimedia.org/T348484) [18:37:09] hashar: please do [18:37:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) (owner: 10Hashar) [18:37:28] thanks for tracking that down. [18:37:43] I should have caught it on thursday or even earlier [18:38:02] then I felt confused assuming something got changed on purpose in Kibana [18:38:02] :D [18:38:08] (03Merged) 10jenkins-bot: logging: reorder wmgMonologProcessors entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) (owner: 10Hashar) [18:38:34] !log hashar@deploy2002 Started scap: Backport for [[gerrit:966529|logging: reorder wmgMonologProcessors entries (T349086)]] [18:38:39] T349086: MediaWiki normalized_message field has placeholders replaced since October 12th - https://phabricator.wikimedia.org/T349086 [18:39:54] !log hashar@deploy2002 hashar: Backport for [[gerrit:966529|logging: reorder wmgMonologProcessors entries (T349086)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:41:29] MainWANObjectCache using store {class} [18:41:34] and the message is MainWANObjectCache using store MemcachedPeclBagOStuff [18:41:38] so looks good on mwdebug [18:41:41] !log hashar@deploy2002 hashar: Continuing with sync [18:42:37] of course all log entries created in between have the wrong normalized_message [18:46:44] poor fpm restart :/ [18:46:48] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:966529|logging: reorder wmgMonologProcessors entries (T349086)]] (duration: 08m 14s) [18:49:24] and for rest of prod I can see an exception having `labels.normalized_message:[{reqId}] {exception_url} ` [18:50:09] brennen: also there are some errors in Cite with: TypeError: Argument 1 passed to Cite\AnchorFormatter::refKey() must be of the type string, null given [18:50:19] which I have filed earlier while I was looking at that message normalization issue [18:50:40] T349068 [18:50:41] T349068: Cite: Argument 1 passed to Cite\AnchorFormatter::refKey() must be of the type string, null given - https://phabricator.wikimedia.org/T349068 [18:51:05] I haven't investigated much, it is probably a user input issue, but maybe the code is misbehaving [18:53:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:07] I am so happy I have managed to track down that bug and to have made a fix for it [18:55:29] (03PS1) 10Andrew Bogott: codfw1dev: update Horizon version [puppet] - 10https://gerrit.wikimedia.org/r/966605 (https://phabricator.wikimedia.org/T348885) [18:56:33] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: update Horizon version [puppet] - 10https://gerrit.wikimedia.org/r/966605 (https://phabricator.wikimedia.org/T348885) (owner: 10Andrew Bogott) [19:06:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:06:37] (03PS1) 10Andrew Bogott: eqiad1: update horizon version [puppet] - 10https://gerrit.wikimedia.org/r/966628 (https://phabricator.wikimedia.org/T348885) [19:08:46] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1: update horizon version [puppet] - 10https://gerrit.wikimedia.org/r/966628 (https://phabricator.wikimedia.org/T348885) (owner: 10Andrew Bogott) [19:11:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:22:39] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Some or all of the undeletion failed - https://phabricator.wikimedia.org/T348937 (10TheDJ) >>! In T348937#9252275, @TheresNoTime wrote: > - Feels like there's been a spike in swift issues lately (i.e. T348688, T348586, T328872) Did we switch DCs r... [19:27:06] (03PS1) 10Jdrewniak: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966607 (https://phabricator.wikimedia.org/T347208) [19:48:12] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966634 [19:54:47] (03CR) 10CI reject: [V: 04-1] CHANGELOG: add changelogs for release v8.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966634 (owner: 10Volans) [19:55:45] (03PS2) 10Volans: CHANGELOG: add changelogs for release v8.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966634 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231017T2000). [20:00:06] jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:01:23] I can deploy [20:01:40] looks like it's just my patches, they can be deployed together [20:02:02] here not sure where my entry went.. [20:02:04] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:33] One of Jan's patches was actually authored by you, so maybe that's it? [20:03:15] looks like Jan removed me haha https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2120651&oldid=2120608 [20:03:32] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966634 (owner: 10Volans) [20:04:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966314 (https://phabricator.wikimedia.org/T341253) (owner: 10Jdlrobson) [20:04:09] ah i see Jan just changed the name [20:04:22] so yeh https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/966314 is the one im here for :) [20:04:34] Jdlrobson: yes, the patch is still there :P [20:04:53] (03Merged) 10jenkins-bot: Wordmark for blk wiktionary and got wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966314 (https://phabricator.wikimedia.org/T341253) (owner: 10Jdlrobson) [20:05:16] !log catrope@deploy2002 Started scap: Backport for [[gerrit:966314|Wordmark for blk wiktionary and got wikipedia (T341253 T341257)]] [20:05:33] T341257: Design: Provide wordmarks/taglines for Wiktionary projects - https://phabricator.wikimedia.org/T341257 [20:05:34] T341253: Provide wordmark and tagline for Gothic Wikipedia - https://phabricator.wikimedia.org/T341253 [20:05:42] (03PS1) 10Jdlrobson: Fixes incorrect Hebrew logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966636 (https://phabricator.wikimedia.org/T341251) [20:06:34] !log catrope@deploy2002 catrope and jdlrobson: Backport for [[gerrit:966314|Wordmark for blk wiktionary and got wikipedia (T341253 T341257)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:58] Jdlrobson: Your patch is on mwdebug, please test [20:08:52] RoanKattouw: LGTM. Looks like I forgot to run the build script for gotwiki but I can do that in the follow up [20:09:26] (03PS2) 10Jdlrobson: Fixes incorrect Hebrew logo and applies gotwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966636 (https://phabricator.wikimedia.org/T341253) [20:09:33] ^ RoanKattouw follow up. Please sync the current one [20:10:45] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966634 (owner: 10Volans) [20:11:24] !log catrope@deploy2002 catrope and jdlrobson: Continuing with sync [20:11:28] Syncing [20:14:11] (03PS1) 10Volans: Upstream release v8.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966639 [20:14:40] (03CR) 10Eevans: [C: 03+2] echostore: update Kask image to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966301 (https://phabricator.wikimedia.org/T348647) (owner: 10Eevans) [20:15:00] (03PS3) 10Catrope: Fixes incorrect Hebrew logo and applies gotwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966636 (https://phabricator.wikimedia.org/T341253) (owner: 10Jdlrobson) [20:15:34] Then I'll pick up our fix (966636) next, and then Jan's patch [20:15:56] sounds good [20:16:33] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:966314|Wordmark for blk wiktionary and got wikipedia (T341253 T341257)]] (duration: 11m 17s) [20:16:45] T341257: Design: Provide wordmarks/taglines for Wiktionary projects - https://phabricator.wikimedia.org/T341257 [20:16:45] T341253: Provide wordmark and tagline for Gothic Wikipedia - https://phabricator.wikimedia.org/T341253 [20:17:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966636 (https://phabricator.wikimedia.org/T341253) (owner: 10Jdlrobson) [20:18:45] (03Merged) 10jenkins-bot: Fixes incorrect Hebrew logo and applies gotwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966636 (https://phabricator.wikimedia.org/T341253) (owner: 10Jdlrobson) [20:19:09] !log catrope@deploy2002 Started scap: Backport for [[gerrit:966636|Fixes incorrect Hebrew logo and applies gotwiki (T341253 T341251)]] [20:19:15] T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251 [20:20:30] !log catrope@deploy2002 jdlrobson and catrope: Backport for [[gerrit:966636|Fixes incorrect Hebrew logo and applies gotwiki (T341253 T341251)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:00] Jdlrobson: OK fix is on mwdebug, please test [20:21:11] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [20:21:30] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [20:21:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [20:23:32] RoanKattouw: on it [20:23:44] RoanKattouw: and all good! [20:23:45] please sync! [20:24:02] !log catrope@deploy2002 jdlrobson and catrope: Continuing with sync [20:24:08] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/echostore: apply [20:24:28] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [20:26:20] (and thank you!) [20:26:34] (03CR) 10Volans: [C: 03+2] Upstream release v8.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966639 (owner: 10Volans) [20:26:59] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/echostore: apply [20:27:13] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [20:29:09] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:966636|Fixes incorrect Hebrew logo and applies gotwiki (T341253 T341251)]] (duration: 09m 59s) [20:29:14] T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251 [20:29:15] T341253: Provide wordmark and tagline for Gothic Wikipedia - https://phabricator.wikimedia.org/T341253 [20:29:42] (03CR) 10Eevans: [C: 03+2] sessionstore: update Kask image to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966302 (https://phabricator.wikimedia.org/T348647) (owner: 10Eevans) [20:31:00] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [20:31:13] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [20:33:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966607 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [20:33:26] (03Merged) 10jenkins-bot: Upstream release v8.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966639 (owner: 10Volans) [20:34:28] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [20:34:55] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [20:36:32] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [20:36:52] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [20:36:59] !log uploaded spicerack_8.0.0 to apt.wikimedia.org bullseye-wikimedia [20:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:40] (03Merged) 10jenkins-bot: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966607 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [20:52:02] !log catrope@deploy2002 Started scap: Backport for [[gerrit:966607|Add language prefix to Readability survey (T347208)]] [20:52:06] T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208 [20:52:43] Hi. Apologies for the last minute request. Is there someone still available for a backport [20:52:54] !log bking@cumin1001 depool wdqs eqiad due to rdf-streaming-updater failure [20:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:21] !log catrope@deploy2002 catrope and jdrewniak: Backport for [[gerrit:966607|Add language prefix to Readability survey (T347208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:55:16] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:58:46] jan_drewniak: Your change is now on mwdebug, pleas etest [20:59:22] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:52] RoanKattouw: looks good [21:00:01] !log catrope@deploy2002 catrope and jdrewniak: Continuing with sync [21:01:25] RoanKattouw: I added a last minute request. If it's too late, np. Let me know. Thanks [21:01:32] No worries, I can do it [21:01:39] thanks so much [21:02:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:02:54] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10wiki_willy) @Jclark-ctr or @VRiley-WMF - can one of you follow up on Ben's question above on an-tool1010, along with Alex's comment on deploy1102? Thanks, Willy [21:05:05] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:966607|Add language prefix to Readability survey (T347208)]] (duration: 13m 03s) [21:05:16] T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208 [21:06:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [21:10:50] !log bking@cumin1001 repool wdqs eqiad after rdf-streaming-updater fix [21:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:16:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [21:32:21] !log catrope@deploy2002 backport Cancelled [21:33:13] kimberly_sarabia: Sorry I saw too late that https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/966630/ needs a cherry-pick, and I have to leave the house soon. Could you reschedule it for a future deployment window? [21:35:15] RoanKattouw: sure will do [21:45:23] (03CR) 10BCornwall: "I'm not entirely sure why switching this causes us to lose a decimal place: I'm noticing that instead of e.g. 99.84 it's displaying 99.8" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [21:45:30] (03PS3) 10BCornwall: slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) [21:48:21] (03CR) 10BCornwall: [C: 03+1] Revert "wmfusercontent: add TXT record for cert validation" [dns] - 10https://gerrit.wikimedia.org/r/966243 (owner: 10Ssingh) [22:03:21] !log pyrra.wm.o upgraded to 0.7.1 T302995 [22:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:38] T302995: Explore dedicated (non-grafana) SLO Visualization and Management - https://phabricator.wikimedia.org/T302995 [22:04:40] (03PS1) 10BCornwall: mtail: Condense varnish SLI counter logic [puppet] - 10https://gerrit.wikimedia.org/r/966644 [22:07:12] (03CR) 10CI reject: [V: 04-1] mtail: Condense varnish SLI counter logic [puppet] - 10https://gerrit.wikimedia.org/r/966644 (owner: 10BCornwall) [22:15:42] (03PS2) 10BCornwall: mtail: Add comment warning of dragons [puppet] - 10https://gerrit.wikimedia.org/r/966644 [22:16:27] (03CR) 10BBlack: [C: 03+1] ":P" [puppet] - 10https://gerrit.wikimedia.org/r/966644 (owner: 10BCornwall) [22:18:48] (03CR) 10BCornwall: [C: 03+2] mtail: Add comment warning of dragons [puppet] - 10https://gerrit.wikimedia.org/r/966644 (owner: 10BCornwall) [22:19:30] (03CR) 10Herron: "Nice! Minor fix needed inline, but LGTM overall" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:21:31] (03PS4) 10BCornwall: slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) [22:21:46] (03CR) 10BCornwall: slo_definitions: Switch to using varnish_sli_bad (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:24:17] (03CR) 10Herron: [C: 03+1] "LGTM! 📈" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:30:31] (03Abandoned) 10Herron: import upstream 0.7.1 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/966595 (owner: 10Herron) [22:30:34] (03Abandoned) 10Herron: Merge tag 'upstream/0.7.1' [debs/pyrra] - 10https://gerrit.wikimedia.org/r/966596 (owner: 10Herron) [22:53:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:00:24] (03CR) 10Subramanya Sastry: Enable Parsoid interal REST API only on Parsoid cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)