[00:01:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030559 (owner: 10TrainBranchBot) [00:06:18] (03PS1) 10Tim Starling: ext.CodeMirror.visualEditor: don't load on RTL pages [extensions/CodeMirror] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031069 (https://phabricator.wikimedia.org/T363752) [00:06:54] (03CR) 10Tim Starling: [C:03+2] ext.CodeMirror.visualEditor: don't load on RTL pages [extensions/CodeMirror] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031069 (https://phabricator.wikimedia.org/T363752) (owner: 10Tim Starling) [00:08:48] FIRING: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:10:03] (03PS1) 10Tim Starling: Fix exception when creating an election with the OpenSSL encryption type [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031070 (https://phabricator.wikimedia.org/T209892) [00:10:14] (03CR) 10Tim Starling: [C:03+2] Fix exception when creating an election with the OpenSSL encryption type [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031070 (https://phabricator.wikimedia.org/T209892) (owner: 10Tim Starling) [00:15:47] (03Merged) 10jenkins-bot: ext.CodeMirror.visualEditor: don't load on RTL pages [extensions/CodeMirror] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031069 (https://phabricator.wikimedia.org/T363752) (owner: 10Tim Starling) [00:15:49] (03Merged) 10jenkins-bot: Fix exception when creating an election with the OpenSSL encryption type [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031070 (https://phabricator.wikimedia.org/T209892) (owner: 10Tim Starling) [00:19:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [00:19:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [00:19:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T364299)', diff saved to https://phabricator.wikimedia.org/P62372 and previous config saved to /var/cache/conftool/dbconfig/20240514-001956-marostegui.json [00:20:01] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:20:37] !log tstarling@deploy1002 Started scap: Fix SecurePoll exception T209892 and CodeMirror 5 RTL T363752 [00:20:42] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [00:20:42] T363752: CodeMirror shouldn't load in the 2017 editor on RTL pages - https://phabricator.wikimedia.org/T363752 [00:34:52] (03PS1) 10Scott French: DNM: ipiod: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) [00:35:34] !log tstarling@deploy1002 Finished scap: Fix SecurePoll exception T209892 and CodeMirror 5 RTL T363752 (duration: 14m 56s) [00:35:40] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [00:35:41] T363752: CodeMirror shouldn't load in the 2017 editor on RTL pages - https://phabricator.wikimedia.org/T363752 [00:38:16] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9792816 (10Papaul) 05Open→03Resolved All the old mgmt switch are back in place [00:41:50] (03PS2) 10Scott French: ipiod: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) [00:53:02] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:07:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.5 [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1030560 (https://phabricator.wikimedia.org/T361399) [01:08:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.5 [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1030560 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [01:13:48] RESOLVED: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:28:28] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.5 [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1030560 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [01:47:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T352010)', diff saved to https://phabricator.wikimedia.org/P62373 and previous config saved to /var/cache/conftool/dbconfig/20240514-014753-ladsgroup.json [01:48:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:53:34] (03PS1) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031071 [01:55:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:55:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0200) [02:02:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:02:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.318 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:03:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P62374 and previous config saved to /var/cache/conftool/dbconfig/20240514-020301-ladsgroup.json [02:08:23] (03PS1) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031072 [02:18:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P62375 and previous config saved to /var/cache/conftool/dbconfig/20240514-021809-ladsgroup.json [02:33:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T352010)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240514-023316-ladsgroup.json [02:33:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:33:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:34:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:34:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:36:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:02] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:41:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0300) [03:01:43] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031127 (https://phabricator.wikimedia.org/T361399) [03:01:46] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031127 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [03:02:24] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031127 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [03:02:52] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.5 refs T361399 [03:03:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:02] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:42] T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399 [03:05:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:29:12] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:39:12] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:46:34] PROBLEM - Check whether ferm is active by checking the default input chain on mw1349 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:47:06] PROBLEM - Check whether ferm is active by checking the default input chain on parse1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:47:16] PROBLEM - Check whether ferm is active by checking the default input chain on parse1023 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:52:06] (03CR) 10Subramanya Sastry: [C:03+1] Fix the loss of ParserOutput pointer in ContentDOMTransformStages [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031067 (https://phabricator.wikimedia.org/T364597) (owner: 10C. Scott Ananian) [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0400) [04:00:38] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.5 refs T361399 (duration: 57m 45s) [04:00:42] T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399 [04:05:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:10:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:11:21] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-03-28-061726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015258 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [04:12:04] Deploying MinT ^^ [04:12:08] (03Merged) 10jenkins-bot: Update MinT to 2024-03-28-061726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015258 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [04:14:22] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [04:16:34] RECOVERY - Check whether ferm is active by checking the default input chain on mw1349 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:17:06] RECOVERY - Check whether ferm is active by checking the default input chain on parse1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:17:16] RECOVERY - Check whether ferm is active by checking the default input chain on parse1023 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:18:43] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [04:23:02] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:47] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [04:33:48] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [04:45:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:45:49] FIRING: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:48:02] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:49] RESOLVED: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:59:02] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [04:59:42] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364809 (10phaultfinder) 03NEW [04:59:45] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364810 (10phaultfinder) 03NEW [05:08:57] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:15:22] !log Updated MinT to 2024-03-28-061726-production (T333969) [05:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:28] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [05:16:09] (03PS3) 10KartikMistry: Update cxserver to 2024-04-23-221507-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T363263) [05:17:19] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-04-23-221507-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T363263) (owner: 10KartikMistry) [05:18:22] (03Merged) 10jenkins-bot: Update cxserver to 2024-04-23-221507-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T363263) (owner: 10KartikMistry) [05:19:21] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:19:42] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:22:09] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:22:40] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:24:46] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:24:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:25:21] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:25:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:26:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:26:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:31:21] !log Updated cxserver to 2024-04-23-221507-production (T363263, T333969, T360303, T360310) [05:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:31] T363263: Post-creation work for iglwiki - https://phabricator.wikimedia.org/T363263 [05:31:32] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [05:31:33] T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303 [05:31:33] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [05:33:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:32] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364810#9793105 (10phaultfinder) [05:50:27] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1030561 (https://phabricator.wikimedia.org/T364814) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0600) [06:00:04] kormat, marostegui, Amir1, and arnaudb: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0600). Please do the needful. [06:05:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:02] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:05] (03CR) 10Marostegui: Enable section-wide circuit breaking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [06:18:54] (03CR) 10Marostegui: Enable section-wide circuit breaking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [06:26:18] (03PS1) 10Marostegui: es1022: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031275 [06:26:45] (03CR) 10Marostegui: [C:03+2] es1022: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031275 (owner: 10Marostegui) [06:33:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2185.codfw.wmnet with OS bookworm [06:33:56] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db2185.codfw.wmnet with OS bookworm [06:35:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2185.codfw.wmnet with OS bookworm [06:36:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:41:01] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for English Wiktionary admins - https://phabricator.wikimedia.org/T364731#9793207 (10Vininn126) Thank you very much! [06:41:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:47] (03PS1) 10Marostegui: db2185: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031293 (https://phabricator.wikimedia.org/T364296) [06:48:02] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2185.codfw.wmnet with reason: host reimage [06:56:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2185.codfw.wmnet with reason: host reimage [07:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0700). [07:00:05] Msz2001 and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:04:26] * kart_ is here [07:04:46] !log installing glib2.0 security updates [07:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:56] Msz2001: around? [07:04:59] yes [07:05:30] Are you going to self deploy or looking for the deployer? [07:05:45] I'm looking for one [07:07:01] OK. I can deploy. [07:07:13] Ok, thanks [07:07:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030978 (https://phabricator.wikimedia.org/T364769) (owner: 10Msz2001) [07:08:31] (03Merged) 10jenkins-bot: Set $wgSignatureValidation to 'disallow' on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030978 (https://phabricator.wikimedia.org/T364769) (owner: 10Msz2001) [07:09:14] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1030978|Set $wgSignatureValidation to 'disallow' on Polish Wikipedia (T364769)]] [07:09:19] T364769: Set $wgSignatureValidation to 'disallow' on Polish Wikipedia - https://phabricator.wikimedia.org/T364769 [07:09:37] Msz2001: I'll ping you when patch is available to test on mwdebug servers. [07:09:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM. The new access group has been approved in yesterday's SRE IF meeting." [puppet] - 10https://gerrit.wikimedia.org/r/1027052 (https://phabricator.wikimedia.org/T364494) (owner: 10Dzahn) [07:09:48] ok [07:11:11] (03CR) 10Marostegui: [C:03+2] db2185: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031293 (https://phabricator.wikimedia.org/T364296) (owner: 10Marostegui) [07:11:35] Hi, what's going on with refreshing special pages? https://sr.wikipedia.org/wiki/Special:BrokenRedirects wasn't updated since 7th of May. [07:12:22] !log kartik@deploy1002 kartik and msz2001: Backport for [[gerrit:1030978|Set $wgSignatureValidation to 'disallow' on Polish Wikipedia (T364769)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:12:44] Msz2001: Please test. [07:13:11] Can confirm the patch works as intended [07:14:18] Kizule: We have a similar problem on plwiki, eg. https://pl.wikipedia.org/wiki/Specjalna:Statystyki_oznaczania (8th May), I haven't checked if it's already filed [07:14:45] Msz2001: cool. Deploying. [07:15:01] !log kartik@deploy1002 kartik and msz2001: Continuing with sync [07:15:46] o/ I added a patch to the backport window (hope it's OK), I can deploy it once you're done the scheduled patches [07:16:13] (03CR) 10KartikMistry: [C:03+2] CX: Add mw.cx.UserPermissionChecker [extensions/ContentTranslation] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030325 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry) [07:16:55] (+2 my next patch for reducing CI wait time) [07:17:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2185.codfw.wmnet with OS bookworm [07:17:44] PROBLEM - Check whether ferm is active by checking the default input chain on mw1493 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:18:26] dcausse: sure. [07:19:30] PROBLEM - Check whether ferm is active by checking the default input chain on mw2384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:20:04] PROBLEM - Check whether ferm is active by checking the default input chain on parse1010 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:21:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw1356 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:22:16] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:22:20] PROBLEM - Check whether ferm is active by checking the default input chain on kubemaster1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:22:32] (03PS1) 10Marostegui: site.pp: Regex for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1031302 [07:23:12] (03CR) 10Marostegui: [C:03+2] site.pp: Regex for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1031302 (owner: 10Marostegui) [07:27:43] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1030978|Set $wgSignatureValidation to 'disallow' on Polish Wikipedia (T364769)]] (duration: 18m 28s) [07:27:46] T364769: Set $wgSignatureValidation to 'disallow' on Polish Wikipedia - https://phabricator.wikimedia.org/T364769 [07:28:05] Thanks for delpoying! [07:28:23] Msz2001: You're welcome! [07:28:43] can i also add a patch to the list for this morning? O:-) (i'd need a deployer) [07:28:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030325 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry) [07:29:48] FIRING: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:31:35] kart_ i added one (1031067), if that works for you great and grateful, if not it'll wait a later window [07:31:39] ihurbain: It seems we will run out of the window, but let' see! [07:31:44] thank you :) [07:32:12] CI will take most of time in my patch :/ [07:32:37] yeah :/ (and there's David in the queue before mine, so i'm not holding my breath) [07:33:34] ihurbain: mine might be more complicated than I initially thought so I might just reschedule it for this afternoon [07:34:53] (afternoon also looks quite full fwiw) [07:35:18] yes just saw that :/ [07:36:09] Bakport/config window should be of 2 hours :) [07:36:25] :) [07:36:52] jouncebot: next [07:36:52] In 0 hour(s) and 23 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800) [07:37:13] I doubt we can deploy more than 3 patches given CI+deployment is taking time for each patches. Add testing in that. [07:37:21] ah and we have early train too [07:38:36] (03Merged) 10jenkins-bot: CX: Add mw.cx.UserPermissionChecker [extensions/ContentTranslation] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030325 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry) [07:39:07] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1030325|CX: Add mw.cx.UserPermissionChecker (T349959)]] [07:39:11] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [07:42:54] !log kartik@deploy1002 kartik: Backport for [[gerrit:1030325|CX: Add mw.cx.UserPermissionChecker (T349959)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:43:36] ihurbain: moved my patch out of the way, feel free to +2 your patch while kart_ is deploying [07:44:25] mmh do i actually want to do that if there's a chance the train rolls before my patch is deployed? (genuine question, i don't know) [07:44:41] !log kartik@deploy1002 kartik: Continuing with sync [07:45:15] (and can i actually do that if i'm not deploying myself, process-wise?) [07:45:41] ihurbain: it might possibly take a bit of the train deploy window indeed [07:46:39] !log installing libgd2 security updates [07:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:55] re +2 I guess it does not matter as long as the patch gets deployed soon after it's merged [07:47:44] RECOVERY - Check whether ferm is active by checking the default input chain on mw1493 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:49:32] which i can't really guarantee considering timings :/ [07:50:04] RECOVERY - Check whether ferm is active by checking the default input chain on parse1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:51:35] ihurbain: yes now it's unlikely we'll have enough time :( [07:51:56] RECOVERY - Check whether ferm is active by checking the default input chain on mw1356 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:52:14] (03PS1) 10Hashar: Revert "Gerrit: update mail soy templates to match upstream" [puppet] - 10https://gerrit.wikimedia.org/r/1031172 (https://phabricator.wikimedia.org/T364484) [07:52:16] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:52:20] RECOVERY - Check whether ferm is active by checking the default input chain on kubemaster1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:52:28] i'll drop mine in late utc and it'll work too. :) [07:52:37] o/ [07:52:47] dcausse: ihurbain: you can extend the backport window if you want [07:52:52] jouncebot: next [07:52:52] In 0 hour(s) and 7 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800) [07:53:16] the next one is the train and I am running this week with andre as the backup [07:53:23] with the train rolling at that time? (fwiw: i'm not pushing back, i'm just trying really hard to not step on anyone's toes :D ) [07:53:25] o/ [07:53:35] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 215887 [07:53:37] hashar: I cancelled mine, but happy to help deploy the one from Isabelle [07:53:38] but there is not much happening on Tuesday beside waiting :) [07:53:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 215887 [07:54:34] !log installing PHP 7.3 security updates [07:54:35] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [07:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:06] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [07:56:08] anyway. if we can extend & deploy, then i appreciate it (it'll fix DT on wikitech :P ), if not it can wait until this evening. [07:56:59] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1030325|CX: Add mw.cx.UserPermissionChecker (T349959)]] (duration: 17m 52s) [07:57:03] T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959 [07:57:23] OK. My patch is done. [07:57:45] kart_: ack [07:58:15] hashar: do we have enough time for a backport on wmf/1.43.0-wmf.4? [08:00:04] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800) [08:03:06] I'm cool with waiting a bit but up to hashar [08:03:26] dcausse: depends? ;) [08:03:38] yes please go ahead [08:03:42] ok [08:03:47] \o/ [08:04:18] (03CR) 10DCausse: [C:03+2] Fix the loss of ParserOutput pointer in ContentDOMTransformStages [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031067 (https://phabricator.wikimedia.org/T364597) (owner: 10C. Scott Ananian) [08:07:35] (03CR) 10Muehlenhoff: [C:03+2] Revert "Gerrit: update mail soy templates to match upstream" [puppet] - 10https://gerrit.wikimedia.org/r/1031172 (https://phabricator.wikimedia.org/T364484) (owner: 10Hashar) [08:09:52] (03CR) 10Filippo Giunchedi: [C:04-1] "See my comment on related task" [puppet] - 10https://gerrit.wikimedia.org/r/1031050 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron) [08:11:55] (03PS1) 10Jcrespo: dbbackups: Start monitoring es6, es7 for regular backups produced [puppet] - 10https://gerrit.wikimedia.org/r/1031387 (https://phabricator.wikimedia.org/T363812) [08:12:05] (03CR) 10Hashar: [C:03+1] ci: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:12:28] (03CR) 10Hashar: [C:03+1] "I guess am still confused by the Docker/containerd model :-]" [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:12:43] 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAp servers to Bullseye - https://phabricator.wikimedia.org/T364823 (10MoritzMuehlenhoff) 03NEW [08:13:48] 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9793379 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:15:37] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubestagemaster2005.codfw.wmnet with OS bullseye [08:15:47] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9793393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaste... [08:17:40] (03CR) 10Btullis: [V:03+1 C:03+2] Add the airflow profile to the statistics::explorer role [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) (owner: 10Btullis) [08:19:13] (03CR) 10Marostegui: [C:03+1] dbbackups: Start monitoring es6, es7 for regular backups produced [puppet] - 10https://gerrit.wikimedia.org/r/1031387 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [08:19:30] RECOVERY - Check whether ferm is active by checking the default input chain on mw2384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:21:09] (03PS1) 10Muehlenhoff: Remove access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1031391 [08:21:54] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bdgreenlee out of all services on: 2208 hosts [08:22:35] (03CR) 10Muehlenhoff: [C:03+2] Remove access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1031391 (owner: 10Muehlenhoff) [08:22:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bdgreenlee out of all services on: 2208 hosts [08:25:30] (03Merged) 10jenkins-bot: Fix the loss of ParserOutput pointer in ContentDOMTransformStages [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031067 (https://phabricator.wikimedia.org/T364597) (owner: 10C. Scott Ananian) [08:25:46] (03PS2) 10Klausman: role::ml_cache::storage: Add staging and cross-DC IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1031390 (https://phabricator.wikimedia.org/T360428) [08:26:53] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1031067|Fix the loss of ParserOutput pointer in ContentDOMTransformStages (T364597)]] [08:26:58] T364597: Missing content on discussion tools on Parsoid - https://phabricator.wikimedia.org/T364597 [08:27:07] (03CR) 10Jcrespo: [C:03+2] dbbackups: Start monitoring es6, es7 for regular backups produced [puppet] - 10https://gerrit.wikimedia.org/r/1031387 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [08:27:53] ihurbain: started to deploy, is this something you can test on debug servers? [08:27:59] there is [08:28:31] should i look into that now? are we on eqiad? [08:29:04] ihurbain: it's not yet there, but you mentionned wikitech and I'm not sure wikitech is run from test servers [08:29:19] it's not, but it's testable in other places too [08:29:23] ok [08:29:32] it's just more visible on wikitech because we run parsoid by default there on DT :) [08:29:35] !log dcausse@deploy1002 dcausse and cscott: Backport for [[gerrit:1031067|Fix the loss of ParserOutput pointer in ContentDOMTransformStages (T364597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:29:46] ihurbain: now it's there ^ [08:30:30] aaaaand it works \o/ [08:30:33] ship it! [08:30:47] shipping! [08:30:49] !log dcausse@deploy1002 dcausse and cscott: Continuing with sync [08:31:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1031390 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [08:31:49] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9793453 (10JMeybohm) 05Open→03Resolved It's not clear to me what happened here. The makevm call was unable to... [08:31:55] (03CR) 10Klausman: [V:03+1 C:03+2] role::ml_cache::storage: Add staging and cross-DC IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1031390 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [08:33:41] (03CR) 10Hashar: [C:03+1] "We got that due notably to Pytorch which is a fairly large installation (14G iirc), the context is T338317#9623848 and T364773 + this chan" [puppet] - 10https://gerrit.wikimedia.org/r/1031045 (https://phabricator.wikimedia.org/T364773) (owner: 10Ahmon Dancy) [08:34:42] (03CR) 10JMeybohm: [C:03+2] Fix all-etcd, wikikube-master and wikikube-etcd aliases [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:34:49] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster100[345] [puppet] - 10https://gerrit.wikimedia.org/r/1030996 (https://phabricator.wikimedia.org/T364746) (owner: 10JMeybohm) [08:37:58] PROBLEM - Check whether ferm is active by checking the default input chain on mw1393 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:39:48] RESOLVED: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:41:05] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1003.eqiad.wmnet [08:41:06] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:43:11] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1031067|Fix the loss of ParserOutput pointer in ContentDOMTransformStages (T364597)]] (duration: 16m 17s) [08:43:15] T364597: Missing content on discussion tools on Parsoid - https://phabricator.wikimedia.org/T364597 [08:43:18] ihurbain: done [08:43:23] yay! [08:43:38] hashar, andre: we're done :) [08:43:48] dcausse: thank you very much; thank you hashar and andre too for accepting the train delay! [08:44:19] thanks [08:44:32] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002" [08:44:34] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1004.eqiad.wmnet [08:45:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002" [08:45:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:45:21] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster1003.eqiad.wmnet on all recursors [08:45:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1003.eqiad.wmnet on all recursors [08:45:33] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:46:32] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002" [08:47:05] back with a coffee [08:47:11] andre: wanna do it over a google meet? [08:47:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002" [08:48:09] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1005.eqiad.wmnet [08:48:14] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002" [08:48:41] hashar, would be nice to refresh my memories, feel free to join the one in your calendar [08:49:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002" [08:49:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:49:05] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster1004.eqiad.wmnet on all recursors [08:49:08] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1004.eqiad.wmnet on all recursors [08:49:26] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:49:32] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1003.eqiad.wmnet with OS bullseye [08:49:33] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002" [08:49:47] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793530 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [08:50:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002" [08:52:40] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1004.eqiad.wmnet with OS bullseye [08:52:51] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002" [08:52:52] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [08:54:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002" [08:54:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:54:09] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster1005.eqiad.wmnet on all recursors [08:54:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1005.eqiad.wmnet on all recursors [08:54:40] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002" [08:57:00] (03CR) 10JMeybohm: [C:03+1] Service mesh: rename local_service cluster (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030220 (owner: 10CDanis) [08:57:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002" [08:58:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on serpens.wikimedia.org with reason: OS update [08:58:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on serpens.wikimedia.org with reason: OS update [08:58:27] 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9793550 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=34ac3b76-436c-436c-afc2-20387cde43fb) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with... [09:02:10] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage [09:02:19] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage [09:04:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage [09:05:46] (03PS3) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) [09:06:29] (03CR) 10CI reject: [V:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [09:06:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage [09:07:59] RECOVERY - Check whether ferm is active by checking the default input chain on mw1393 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:09:03] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031396 (https://phabricator.wikimedia.org/T361399) [09:09:05] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031396 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [09:09:27] (03PS4) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) [09:09:43] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031396 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [09:11:55] (03PS7) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) [09:13:02] FIRING: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:13:27] (03CR) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service (032 comments) [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [09:14:09] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1005.eqiad.wmnet with OS bullseye [09:14:22] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [09:17:01] (03PS3) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) [09:18:02] RESOLVED: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:18:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw1435 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:18:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1003.eqiad.wmnet with OS bullseye [09:18:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1003.eqiad.wmnet [09:18:51] FIRING: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:04] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10... [09:19:59] (03CR) 10CI reject: [V:04-1] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [09:20:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1004.eqiad.wmnet with OS bullseye [09:20:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1004.eqiad.wmnet [09:20:15] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10... [09:21:43] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:22:19] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:23:02] RESOLVED: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:05] (03PS1) 10Jcrespo: dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) [09:24:07] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2421/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [09:24:20] (03PS4) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) [09:24:45] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.5 refs T361399 [09:24:49] T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399 [09:25:50] (03CR) 10Vgutierrez: "Idea looks good. PCC run needs to be adjusted, cp hosts aren't involved here and acme-chief ones are missing" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [09:26:03] (03CR) 10CI reject: [V:04-1] dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [09:26:48] 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9793591 (10MoritzMuehlenhoff) serpens has been migrated to Bullseye, seaborgium to follow in a few days. [09:27:01] (03PS3) 10JMeybohm: Add kubestagemaster2004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307) [09:27:11] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage [09:27:19] (03CR) 10CI reject: [V:04-1] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [09:27:30] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [09:28:17] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:28:36] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9793592 (10ayounsi) @cmooney what do you think of duplicating the other POPs allocation scheme? For example looking at eqiad as example, keep 2a02:ec80:a000::/40 as "reserved for future growth" Then... [09:31:25] (03CR) 10Marostegui: db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [09:31:32] jouncebot: now [09:31:32] For the next 0 hour(s) and 28 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800) [09:31:35] jouncebot: next [09:31:35] In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1000) [09:31:41] (03PS3) 10Filippo Giunchedi: jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) [09:31:41] (03PS3) 10Filippo Giunchedi: jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) [09:31:41] (03PS3) 10Filippo Giunchedi: jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) [09:31:47] marostegui: we have finished promoting the train :) [09:31:55] and currently browsing the log spam with andre [09:31:55] hashar: thanks :) [09:31:58] so feel free to deploy [09:31:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage [09:32:03] hashar: <3 [09:32:08] I have a couple database that have vanished though [09:32:14] uh? [09:32:21] Unknown database 'wikishared' (db1223) [09:32:32] (03CR) 10CI reject: [V:04-1] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [09:32:35] Error 1049: Unknown database 'cognate_wiktionary' [09:32:35] :) [09:32:36] (03CR) 10CI reject: [V:04-1] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [09:32:46] hashar: I’m looking at the cognate_wiktionary rn [09:32:46] (03CR) 10CI reject: [V:04-1] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [09:33:03] hashar: db1223 isn't supposed to have wikishared [09:33:06] they happen frm time to time, I guess cause the code paths are not hit that often [09:33:09] so something might be wrong with the code [09:33:16] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:33:20] (03PS2) 10JMeybohm: Add kubestagemaster2004 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1030958 (https://phabricator.wikimedia.org/T363307) [09:33:28] * hashar points at DNS [09:33:29] err [09:33:31] PHP [09:33:39] (03CR) 10Ladsgroup: [C:03+1] db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [09:33:42] db1223 is s3 and wikishared lives in x1 [09:33:44] Lucas_WMDE: danke schon! [09:33:47] (03CR) 10Marostegui: [C:03+2] db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [09:34:30] (03Merged) 10jenkins-bot: db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [09:35:01] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:1030918|db-production.php: Make es4 and es5 RO (T364447)]] [09:35:02] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:35:06] T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447 [09:35:21] marostegui: I will investigate a bit. Maybe it is known [09:35:29] hashar: let me know if I can help [09:36:14] I suspect it is a regression with this week code [09:36:18] I am filing a task [09:37:25] hashar: I just filed T364827 for the cognate task [09:37:26] T364827: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'cognate_wiktionary' - https://phabricator.wikimedia.org/T364827 [09:37:30] no idea if wikishared is related though [09:37:38] I am digging into it [09:37:52] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:1030918|db-production.php: Make es4 and es5 RO (T364447)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:37:55] apparently comes from a post job from LoginNotify [09:37:58] !log marostegui@deploy1002 marostegui: Continuing with sync [09:39:59] Lucas_WMDE: just commented there [09:40:20] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2004 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1030958 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:40:32] Both things are related, they are both looking for databases in s3, but those two databases: cognate_wiktionary and wikishared live in x1 [09:41:39] Is testwiki.ce_question_answers in that case as well? [09:41:52] mediawiki_job_campaignevents-aggregateparticipantanswers-testwiki is failing due to not finding this table [09:42:12] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1010 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:42:14] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:42:15] claime: test wiki does live in s3 [09:42:30] claime: let me check if that tablje exists [09:42:50] claime: cumin2024@db1175.eqiad.wmnet[testwiki]> show tables like 'ce_qu%'; [09:42:50] Empty set (0.001 sec) [09:43:28] So there is not such table in testwiki [09:44:35] The timing feels weird though. The 3AM run went fine, the 6AM run failed [09:45:07] Ah, the train ran in between [09:45:13] I am going to rollback the train [09:45:16] in a few minutes [09:45:20] hashar: I don’t think it’s the train [09:45:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1005.eqiad.wmnet with OS bullseye [09:45:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1005.eqiad.wmnet [09:45:24] since I can also reproduce it on dewiktionary [09:45:28] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10... [09:45:29] (I’ll leave a comment on the task in a sec) [09:45:37] ah [09:45:56] there is also https://phabricator.wikimedia.org/T364828 which is not able to find the wikishared database due to a misconfig [09:46:06] so I suspect maybe gthe database layer might be confused / wrong [09:46:27] that other task is for the LoginNotify exdtension and I guess that breaks its feature [09:46:45] might be worth trying a train rollback anyway, I guess [09:47:33] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1003.eqiad.wmnet to plain [09:47:37] !log jayme@cumin1002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of kubestagemaster1003.eqiad.wmnet to plain [09:47:40] (03CR) 10Volans: [C:03+2] external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [09:47:46] hashar: scratch all that, I fail at testing [09:47:50] dewiktionary not affected AFAICT [09:47:52] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1003.eqiad.wmnet to plain [09:47:54] so probably train after all [09:48:04] (03CR) 10Ladsgroup: Enable section-wide circuit breaking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [09:48:20] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793708 (10ops-monitoring-bot) VM kubestagemaster1003.eqiad.wmnet switching disk type to plain [09:48:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1003.eqiad.wmnet to plain [09:48:32] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1004.eqiad.wmnet to plain [09:48:37] (03CR) 10Marostegui: [C:03+1] "Sounds good, thanks a lot for working on this. This is super great" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [09:48:58] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793711 (10ops-monitoring-bot) VM kubestagemaster1004.eqiad.wmnet switching disk type to plain [09:49:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1004.eqiad.wmnet to plain [09:49:14] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1005.eqiad.wmnet to plain [09:49:38] Lucas_WMDE: yeah I think that issue with Cognate is similar to the one with LoginNotify [09:49:45] and probably share the same cause [09:49:52] I have marked both UBN / Blockers [09:50:03] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793712 (10ops-monitoring-bot) VM kubestagemaster1005.eqiad.wmnet switching disk type to plain [09:50:08] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1005.eqiad.wmnet to plain [09:50:08] and they should be reproducible on the test wikis [09:50:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [09:50:14] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 [09:50:19] I'll file a bug for CampaignEvents, since it's a missing table and not a db not found, seems like a different issue [09:50:29] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:1030918|db-production.php: Make es4 and es5 RO (T364447)]] (duration: 15m 28s) [09:50:33] T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447 [09:51:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:05:00 on 6 hosts with reason: Primary switchover es4 T364451 [09:51:48] T364451: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T364451 [09:52:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on 6 hosts with reason: Primary switchover es4 T364451 [09:52:29] (03CR) 10Ladsgroup: "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [09:52:32] (03PS3) 10Ladsgroup: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) [09:52:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:05:00 on 6 hosts with reason: Checking RO status [09:52:47] I am rolling back now [09:52:52] (03CR) 10CI reject: [V:04-1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 (owner: 10Muehlenhoff) [09:53:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on 6 hosts with reason: Checking RO status [09:53:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [09:54:11] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031404 (https://phabricator.wikimedia.org/T361399) [09:54:12] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031404 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [09:54:36] (03Abandoned) 10Hashar: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031404 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [09:55:41] (03PS4) 10Filippo Giunchedi: jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) [09:55:41] (03PS4) 10Filippo Giunchedi: jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) [09:56:00] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793752 (10JMeybohm) 05Open→03Resolved [09:56:03] (03PS1) 10Hashar: Revert "group0 wikis to 1.43.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031405 (https://phabricator.wikimedia.org/T361399) [09:56:04] (03CR) 10Hashar: [C:03+2] Revert "group0 wikis to 1.43.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031405 (https://phabricator.wikimedia.org/T361399) (owner: 10Hashar) [09:56:15] andre: ^ [09:56:42] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.43.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031405 (https://phabricator.wikimedia.org/T361399) (owner: 10Hashar) [09:58:02] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:22] (03PS1) 10Muehlenhoff: Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 [09:58:32] (03PS2) 10Muehlenhoff: Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 [09:58:51] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:19] (03CR) 10Filippo Giunchedi: "The CI failure is expected, I kept the change in aux values.yaml separate (Ie5b4213379b) to highlight what makes CI pass. Though I can mer" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1000) [10:00:38] (train window not done yet) [10:00:53] (03CR) 10Alexandros Kosiaris: [C:03+1] Service mesh: rename local_service cluster (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030220 (owner: 10CDanis) [10:01:26] (03CR) 10CI reject: [V:04-1] Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 (owner: 10Muehlenhoff) [10:01:29] (03PS1) 10Marostegui: site.pp: Clarify the status of each section [puppet] - 10https://gerrit.wikimedia.org/r/1031408 (https://phabricator.wikimedia.org/T364447) [10:01:57] FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:02:35] (03PS2) 10Wargo: Assign applychangetags right to group "all" on plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031174 (https://phabricator.wikimedia.org/T363638) [10:03:02] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:10] it is rolling back [10:04:05] (03CR) 10Marostegui: [C:03+2] site.pp: Clarify the status of each section [puppet] - 10https://gerrit.wikimedia.org/r/1031408 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [10:04:11] (03CR) 10Alexandros Kosiaris: [C:03+1] Service mesh: rename local_service cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030221 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [10:05:07] (03CR) 10Marostegui: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [10:05:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw2383 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:05:15] (03PS3) 10Muehlenhoff: Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 [10:05:34] (03CR) 10Marostegui: "Thanks for working on this guys!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [10:07:06] (03PS3) 10Gmodena: EventStreamConfig: Add webrequest.frontend.v1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) [10:07:55] (03CR) 10MVernon: [C:03+1] profile::swift::proxy_tls: Use Envoy unconditionally and drop Hiera flag [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:08:31] (03CR) 10MVernon: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:11:36] (03CR) 10Muehlenhoff: [C:03+2] ci: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:11:38] (03CR) 10Kamila Součková: [C:03+1] benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [10:11:58] (03CR) 10Majavah: [C:03+1] Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 (owner: 10Muehlenhoff) [10:12:13] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:12:14] (03CR) 10Muehlenhoff: [C:03+2] Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 (owner: 10Muehlenhoff) [10:12:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:12:28] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.43.0-wmf.5" - T361399 [10:12:35] T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399 [10:13:30] (03PS2) 10Wargo: Add alias for NS_PROJECT for Multilingual Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031175 (https://phabricator.wikimedia.org/T363904) [10:13:50] (03PS2) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 [10:16:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:17:38] hashar: I can get the fix out of the door soon, wanna retry again soon? [10:17:43] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=kubestagemaster2004.codfw.wmnet [10:17:44] !log jayme@cumin1002 conftool action : set/weight=10; selector: name=kubestagemaster2004.codfw.wmnet [10:18:02] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw1435 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:21:07] (03PS2) 10Muehlenhoff: profile::swift::proxy_tls: Use Envoy unconditionally and drop Hiera flag [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) [10:21:33] PROBLEM - MariaDB Replica Lag: s8 on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 36093.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:49] Amir1: well train is rolled back so we are all set :] There is time to polish the patch, but potentially maybe the code should be reverted since it might have other fault and lack tests [10:22:08] or maybe it is a trivial code that is already covered by test, then one test is surely missing [10:22:10] anyway [10:22:31] it is covered by tests, I'll add regression test soon [10:22:38] yeah that would be great :) [10:22:50] there is no rush. Train got rolled back [10:22:56] db2152 is expected [10:22:59] I am off for lunch break with kids :) [10:23:01] But I thought i downtimed it [10:23:41] (03PS1) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) [10:23:43] ACKNOWLEDGEMENT - MariaDB Replica Lag: s8 on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 36213.57 seconds Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:24:05] (03CR) 10CI reject: [V:04-1] external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [10:24:20] Amir1: there is another blocker for VisualEditor which needs a backport as well. i will give it a try after lunch [10:24:32] so essentially: no rush [10:25:42] (03PS2) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) [10:26:34] (03PS5) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) [10:26:35] (03CR) 10JMeybohm: [C:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [10:29:03] (03PS2) 10Wargo: $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) [10:33:00] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9793897 (10MoritzMuehlenhoff) But isn't it simper to just grep in the output of a single cookbook as opposed to grep the output of multiple tools? [10:33:22] (03CR) 10Muehlenhoff: [C:03+2] profile::swift::proxy_tls: Use Envoy unconditionally and drop Hiera flag [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:35:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw2383 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:35:58] (03PS2) 10Muehlenhoff: Inline profile::swift::proxy_tls [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) [10:38:27] (03PS1) 10Ladsgroup: rdbms: Fix picking the database from the LB domain [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827) [10:38:39] (03PS1) 10Klausman: ml-services: Change references to cassandra clusters from using _ to - [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428) [10:38:40] (03CR) 10Ladsgroup: [C:03+2] rdbms: Fix picking the database from the LB domain [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827) (owner: 10Ladsgroup) [10:39:18] (03PS1) 10Btullis: Improve dumps::web::rsync::nginxlogs management [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) [10:39:19] (03PS1) 10Btullis: Manage the directory for dumps.wikimedia.org logs on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820) [10:40:55] (03PS3) 10Wargo: $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) [10:43:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:44:50] (03CR) 10Volans: "I didn't check if the ASN correspond to the existing IPs in requestctl, but they are matching the names." [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [10:45:20] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2422/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis) [10:46:42] (03PS3) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) [10:47:07] (03CR) 10CI reject: [V:04-1] external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [10:48:02] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:45] (03CR) 10Ladsgroup: "I'd say keep it simple. We don't need to introduce too many functions. I just deploy this then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [10:49:00] (03PS4) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) [10:49:43] (03PS2) 10Btullis: Improve dumps::web::rsync::nginxlogs management [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) [10:49:43] (03PS2) 10Btullis: Manage the directory for dumps.wikimedia.org logs on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820) [10:51:21] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis) [10:52:29] (03CR) 10Jelto: "I added all related ASNs I could find as discussed in IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [10:56:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827) (owner: 10Ladsgroup) [10:59:36] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9793972 (10cmooney) >>! In T187929#9793592, @ayounsi wrote: > @cmooney what do you think of duplicating the other POPs allocation scheme? > For example looking at eqiad as example, keep 2a02:ec80:a00... [11:00:04] (03PS2) 10Jcrespo: dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) [11:00:23] (03PS1) 10Muehlenhoff: gerrit::migration: Let rsync handle the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1031423 [11:02:24] (03Merged) 10jenkins-bot: rdbms: Fix picking the database from the LB domain [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827) (owner: 10Ladsgroup) [11:02:33] (03PS5) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) [11:02:53] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1031177|rdbms: Fix picking the database from the LB domain (T364827)]] [11:02:56] T364827: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'cognate_wiktionary' - https://phabricator.wikimedia.org/T364827 [11:03:17] (03CR) 10Jelto: "let's start with a small set of ASNs first and expand if needed" [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [11:03:22] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [11:03:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031423 (owner: 10Muehlenhoff) [11:03:27] (03CR) 10MVernon: [C:03+1] "Looks reasonable to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [11:04:33] (03CR) 10Jcrespo: [C:03+2] dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [11:05:31] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1031177|rdbms: Fix picking the database from the LB domain (T364827)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:05:57] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [11:06:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [11:06:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [11:07:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T364299)', diff saved to https://phabricator.wikimedia.org/P62378 and previous config saved to /var/cache/conftool/dbconfig/20240514-110704-marostegui.json [11:07:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:10:17] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1030 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:11:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:12:54] (03CR) 10Btullis: [V:03+1 C:03+2] Improve dumps::web::rsync::nginxlogs management [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis) [11:13:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P62379 and previous config saved to /var/cache/conftool/dbconfig/20240514-111302-root.json [11:13:41] RECOVERY - MariaDB Replica Lag: s8 on db2152 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:14:31] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9794015 (10WDoranWMF) Approved! [11:16:13] PROBLEM - dump of es6 in eqiad on backupmon1001 is CRITICAL: Last dump for es6 at eqiad (es1036) taken on 2024-05-14 06:09:20 is 1.7 GiB, but the previous one was 328 KiB, a change of +544922.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:18:38] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! One open-question in line but I am happy either way. Nice one :)" [puppet] - 10https://gerrit.wikimedia.org/r/1030185 (https://phabricator.wikimedia.org/T363702) (owner: 10Bking) [11:18:40] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1031177|rdbms: Fix picking the database from the LB domain (T364827)]] (duration: 15m 47s) [11:18:46] T364827: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'cognate_wiktionary' - https://phabricator.wikimedia.org/T364827 [11:19:13] PROBLEM - dump of es7 in codfw on backupmon1001 is CRITICAL: Last dump for es7 at codfw (es2040) taken on 2024-05-14 05:35:25 is 1.7 GiB, but the previous one was 329 KiB, a change of +533277.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:21:58] (03CR) 10Btullis: [C:03+2] Manage the directory for dumps.wikimedia.org logs on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis) [11:22:13] PROBLEM - dump of es7 in eqiad on backupmon1001 is CRITICAL: Last dump for es7 at eqiad (es1040) taken on 2024-05-14 06:10:27 is 1.7 GiB, but the previous one was 329 KiB, a change of +543674.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:22:21] (03CR) 10Muehlenhoff: [C:03+2] Inline profile::swift::proxy_tls [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [11:23:02] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:19] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:24:44] (03PS5) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) [11:28:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P62381 and previous config saved to /var/cache/conftool/dbconfig/20240514-112807-root.json [11:29:24] (03PS2) 10Ladsgroup: etcd: Ignore parsercache clusters in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) [11:29:29] (03CR) 10Ladsgroup: [C:03+2] etcd: Ignore parsercache clusters in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [11:29:43] jouncebot: nowandnext [11:29:43] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [11:29:43] In 0 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1200) [11:29:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [11:30:05] (03Merged) 10jenkins-bot: etcd: Ignore parsercache clusters in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup) [11:30:35] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1030938|etcd: Ignore parsercache clusters in externalLoads (T362786)]] [11:30:41] T362786: Enable dbctl for parsercache - https://phabricator.wikimedia.org/T362786 [11:31:49] (03CR) 10JMeybohm: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [11:32:45] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9794071 (10KOfori) a:05KOfori→03Eevans Approved. [11:33:12] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1030938|etcd: Ignore parsercache clusters in externalLoads (T362786)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:35:05] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [11:38:03] (03CR) 10Hnowlan: [C:03+1] "lgtm. In future it would be nice to use either the external-services-networkpolicy module or a shared approach for sessionstore and echost" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [11:39:56] jouncebot: nowandnext [11:39:56] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [11:39:56] In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1200) [11:40:17] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1030 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:43:13] PROBLEM - dump of es6 in codfw on backupmon1001 is CRITICAL: Last dump for es6 at codfw (es2036) taken on 2024-05-14 05:33:02 is 1.7 GiB, but the previous one was 328 KiB, a change of +534582.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:43:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P62382 and previous config saved to /var/cache/conftool/dbconfig/20240514-114314-root.json [11:47:58] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1030938|etcd: Ignore parsercache clusters in externalLoads (T362786)]] (duration: 17m 22s) [11:48:02] T362786: Enable dbctl for parsercache - https://phabricator.wikimedia.org/T362786 [11:58:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P62383 and previous config saved to /var/cache/conftool/dbconfig/20240514-115820-root.json [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1200) [12:01:49] (03CR) 10Ladsgroup: [C:03+2] Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [12:02:28] (03Merged) 10jenkins-bot: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [12:03:18] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]] [12:03:23] T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930 [12:06:00] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:08:02] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:06] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9794186 (10Clement_Goubert) We are currently holding at 85% of global traffic, and as such not reimaging anymore serv... [12:11:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:11:53] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [12:12:00] (03PS1) 10Muehlenhoff: Zookeeper: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1031429 [12:12:27] (03CR) 10Btullis: [C:03+2] Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [12:13:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P62384 and previous config saved to /var/cache/conftool/dbconfig/20240514-121326-root.json [12:14:24] (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 (owner: 10Muehlenhoff) [12:15:48] (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 (owner: 10Muehlenhoff) [12:16:15] PROBLEM - Check whether ferm is active by checking the default input chain on parse1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:16:36] (03PS1) 10Clément Goubert: mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323) [12:16:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw1387 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:16:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw1352 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:17:07] PROBLEM - Check whether ferm is active by checking the default input chain on mw2320 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:18:02] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff) [12:20:07] PROBLEM - Check whether ferm is active by checking the default input chain on mw2381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:23:19] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:31] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]] (duration: 21m 12s) [12:24:34] T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930 [12:24:53] (03CR) 10Vgutierrez: [C:03+2] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:24:58] (03PS2) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) [12:25:29] (03CR) 10Jelto: [C:03+2] external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto) [12:26:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host serpens.wikimedia.org [12:27:55] (03PS1) 10Muehlenhoff: Switch serpens to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031431 (https://phabricator.wikimedia.org/T349619) [12:29:45] (03CR) 10Brouberol: [C:03+1] "LG thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [12:30:29] is CI doing ok? I've been waiting ~6 minutes already for a CI check on https://gerrit.wikimedia.org/r/c/operations/dns/+/1030939 [12:30:58] vgutierrez: https://integration.wikimedia.org/zuul/ [12:31:15] that gives you an overview [12:31:34] hashar: weirdly I don't see an operations/dns patch there at all [12:31:49] yeah.. it doesn't seem to be queued [12:31:52] then without looking, we are reimaging one of the server and currently run with half the capaicity [12:32:00] though in practice it is rarely used fully [12:32:02] operations/dns is not configured to run any gate-and-submit jobs on a +2 [12:32:22] taavi: but it should be triggerede on the rebase? [12:32:35] the bottleneck would be the `zuul-merger` process which picks the proposed patch, merge it against the tip of the branch and the result is used by CI to run the tests [12:32:49] oh [12:33:01] not if there's a +2 applied at that point [12:33:07] uh [12:33:21] (03CR) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:33:48] (03CR) 10CDanis: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:33:50] that repo is broken [12:33:59] the rebase should have cleared the +2 [12:34:15] hashar: it cleared the V:+2 not the C:+2 [12:34:56] (03CR) 10Vgutierrez: [C:03+2] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:34:57] * hashar shakes fist at CopyOnTrivialRebase [12:35:14] historically ops/dns had a simpler config (no gate-and-submit, etc) because we hoped it might still function when some things are broken :) [12:35:17] !log depool upload@ulsfo before enabling IPIP encapsulation - T357257 [12:35:20] + it is fast forward only [12:35:28] when in most case we migrated repositories to use rebase if necessary [12:36:03] ff-only kinda makes sense there too, IMHO [12:36:42] possibly yes :) [12:38:05] but a lot of this is fear-driven engineering, and we've never had a compelling story for how an SRE deploys an emergency DNS change when all the things are broken (other than a very manual and tedious way) [12:38:47] if we ever "solve" the latter with something slightly-more-elegant, maybe we can care less about the dependencies involved in the "normal" flow [12:39:04] OH I FOUND OUT [12:39:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host serpens.wikimedia.org [12:39:17] the Code-Review label got copied cause that is just a rebase [12:39:21] bblack: IME, the only way you ever work your way out of that cycle is by having a process and exercising it semi-regularly (even 1x-2x/yr is enough) [12:39:33] and usually we do want to copy the Code-Review+1 and carry it between rebase [12:40:12] when there is a CR+2 , that is usually intended to trigger a submit/merge so it is unlikely a rebase follow and even if that is the case, I guess we might still want to carry the CR+2 [12:40:19] but for operations/dns the CR+2 does nothing yeah [12:40:56] and there is an opitimzation in CI to not bother running tests from the `test` or `test-prio` for a change that already has a CR+2 since they will be run by the `gate-and-submit` pipeline [12:41:00] so yeah that is "normal" [12:41:06] (sorry I am thinking out loud) [12:41:38] (FWIW, in “normal” / extension repos I find CR+2 surviving a rebase to be a useful feature ^^) [12:41:48] yeah [12:41:55] Lucas_WMDE: sure, when it would trigger anything at all :) [12:42:02] yeah ^^ [12:42:09] but operations/dns does not CI merging changes on a CR+2 [12:42:31] I think the reason was that at the time we did not want CI to submit changes to sensible repositories such as operations/dns and operations/puppet [12:42:54] so those two repos have a different process [12:43:40] (03PS1) 10LSobanski: Filter out addresses handled by gsuite that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145) [12:44:37] also operations/puppet used to have "fast forward only" merge strategy which caused SRE to spend their time racing to have their change rebased on tip of the branch https://phabricator.wikimedia.org/T224033 [12:45:17] that got solved by changing the strategy to "rebase if necessary" which is that Gerrit rebase it under the hood and keep a linear strategy [12:45:45] yeah, ff-only isn't really sustainable at a higher commit rate [12:46:01] but luckily ops/dns is relatively-slow [12:46:15] RECOVERY - Check whether ferm is active by checking the default input chain on parse1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:46:26] operations/dns is still using "fast forward only". I think the only advantage for it is that if the branch received a change to one of the files touched by the change, Gerrit will mark it as being in conflict in the web ui [12:46:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw1387 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:46:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw1352 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:47:03] (03CR) 10Ladsgroup: "This is making the data flow a bit unclear to me. I prefer all etcd value overrides be set in one place. It involves setting global variab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [12:47:07] RECOVERY - Check whether ferm is active by checking the default input chain on mw2320 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:48:47] RECOVERY - snapshot of s8 in eqiad on backupmon1001 is OK: Last snapshot for s8 at eqiad (db1171) taken on 2024-05-14 11:16:50 (1594 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:50:07] RECOVERY - Check whether ferm is active by checking the default input chain on mw2381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:54:35] (03PS23) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [12:55:00] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Make Translations extension work with upstream Phorge [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [12:55:59] (03PS1) 10Lucas Werkmeister (WMDE): admin: Update deployment description [puppet] - 10https://gerrit.wikimedia.org/r/1031435 [12:56:08] (03CR) 10Lucas Werkmeister (WMDE): "Just a little suggestion :)" [puppet] - 10https://gerrit.wikimedia.org/r/1031435 (owner: 10Lucas Werkmeister (WMDE)) [12:57:02] (03PS1) 10Vgutierrez: lvs: Skip ferm rules if firewall provider is none [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) [12:57:37] RECOVERY - snapshot of s8 in codfw on backupmon1001 is OK: Last snapshot for s8 at codfw (db2198) taken on 2024-05-14 11:51:06 (1632 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:58:41] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [12:59:28] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [12:59:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 12:00:00 on db2114.codfw.wmnet,db1125.eqiad.wmnet with reason: Testing [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1300). [13:00:04] MatmaRex and Jdlrobson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 12:00:00 on db2114.codfw.wmnet,db1125.eqiad.wmnet with reason: Testing [13:00:14] o/ [13:00:26] hi [13:00:50] uhh, let's skip my first patch again, i'm looking at comments in slack now that say it might not be correct [13:01:00] ok, sure [13:01:01] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2426/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [13:01:30] the other two are good to go (and should have no effect for users, but should reduce database load a tiny bit) [13:01:49] is it okay to deploy them together? [13:01:59] to save a bit of time [13:02:14] (03PS4) 10Bartosz Dziewoński: Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) [13:02:19] (03PS4) 10Bartosz Dziewoński: Use ConditionalUserOptions for "discussiontools-autotopicsub" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) [13:02:23] yeah [13:03:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński) [13:03:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński) [13:04:04] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2427/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:04:12] (03Merged) 10jenkins-bot: Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński) [13:04:15] (03Merged) 10jenkins-bot: Use ConditionalUserOptions for "discussiontools-autotopicsub" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński) [13:04:24] (03CR) 10Jelto: [V:03+1] gitlab: enable custom exporter on all instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [13:04:31] * Lucas_WMDE subscribes to the slack thread [13:04:43] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1030532|Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (T357221)]], [[gerrit:1030535|Use ConditionalUserOptions for "discussiontools-autotopicsub" (T357221)]] [13:04:48] T357221: Handle preferences for new users using "ConditionalUserOptions" config instead of "LocalUserCreated" hook inserting preference rows - https://phabricator.wikimedia.org/T357221 [13:05:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2152.codfw.wmnet [13:05:56] (03PS3) 10Jelto: prometheus::ops: scrape custom gitlab exporter [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) [13:06:37] (03PS1) 10Muehlenhoff: Switch db2152 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031438 (https://phabricator.wikimedia.org/T349619) [13:07:17] !log lucaswerkmeister-wmde@deploy1002 matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:1030532|Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (T357221)]], [[gerrit:1030535|Use ConditionalUserOptions for "discussiontools-autotopicsub" (T357221)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:15] (03CR) 10Muehlenhoff: [C:03+2] Switch db2152 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031438 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:08:26] !log ayounsi@cumin1002 START - Cookbook sre.hosts.dhcp for host netmon2002.wikimedia.org [13:08:31] MatmaRex: can you test the two conditional options? [13:08:52] or do they not make a difference until https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/1030533/1/includes/Hooks/PreferenceHooks.php is merged? [13:08:54] (03CR) 10Klausman: [C:03+2] ml-services: Change references to cassandra clusters from using _ to - [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:09:13] (03PS1) 10Jforrester: Convert function to arrow function to fix context [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031180 (https://phabricator.wikimedia.org/T364783) [13:09:19] Lucas_WMDE: not really, the extensions/DiscussionTools code redundantly does the same thing [13:09:45] (03Merged) 10jenkins-bot: ml-services: Change references to cassandra clusters from using _ to - [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:09:51] ok [13:10:01] well I am back [13:10:14] !log lucaswerkmeister-wmde@deploy1002 matmarex and lucaswerkmeister-wmde: Continuing with sync [13:10:18] I have made the mistake to open Slack and attempt to catch up with 2 weeks worth of backlog [13:10:23] oh no [13:10:39] (03PS1) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) [13:10:53] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:11:10] (03PS2) 10Vgutierrez: lvs: Skip ferm rules if firewall provider is not ferm [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) [13:11:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2152.codfw.wmnet [13:12:23] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2429/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:12:44] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:15:10] (03PS2) 10Muehlenhoff: Zookeeper: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1031429 [13:15:10] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2428/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [13:15:16] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on upload and upload-https services [puppet] - 10https://gerrit.wikimedia.org/r/1030022 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:16:13] (03PS2) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) [13:16:51] (03CR) 10Jelto: [V:03+1] prometheus::ops: scrape custom gitlab exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [13:17:36] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:17:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:18:19] (03CR) 10Vgutierrez: [V:03+1 C:03+2] "Thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:18:20] (03PS1) 10DCausse: cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837) [13:18:29] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:18:44] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:19:00] (03CR) 10Muehlenhoff: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1031435 (owner: 10Lucas Werkmeister (WMDE)) [13:19:19] (03CR) 10Kamila Součková: [C:03+1] mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [13:19:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff) [13:19:29] (03PS1) 10Peter Fischer: Search update pipeline: fix for long rev IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031442 [13:19:30] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP encapsulation on high-traffic2@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030021 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:19:55] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: fix for long rev IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031442 (owner: 10Peter Fischer) [13:19:56] (03PS1) 10Clément Goubert: kubernetes: Space out ferm icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) [13:19:58] moritzm: ok to merge Moritz Mühlenhoff: admin: Update deployment description (af05b685a9) :? [13:20:30] (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [13:20:51] (03Merged) 10jenkins-bot: Search update pipeline: fix for long rev IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031442 (owner: 10Peter Fischer) [13:21:34] (03Merged) 10jenkins-bot: mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [13:22:15] andre: so A.mir fixed the database issue :) [13:22:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:22:37] hashar, yay [13:22:43] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1030532|Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (T357221)]], [[gerrit:1030535|Use ConditionalUserOptions for "discussiontools-autotopicsub" (T357221)]] (duration: 17m 59s) [13:22:46] T357221: Handle preferences for new users using "ConditionalUserOptions" config instead of "LocalUserCreated" hook inserting preference rows - https://phabricator.wikimedia.org/T357221 [13:22:54] and it looks like the train blocker was ... not a train blocker :) [13:23:06] hashar: second deployment attempt? :) (should probably move to releng) [13:23:14] (03CR) 10Filippo Giunchedi: Move Swift on thanos-fe1001 to PKI TLS cert (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:24:03] andre: we do the mediawiki train sync up here :] [13:24:07] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [13:24:08] (03PS4) 10Vgutierrez: cache: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) [13:24:18] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [13:24:19] hashar, ah, alright. I'll shut up and watch. [13:24:21] albeit it is really spammy nowadays with all those bots :-\ [13:24:25] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host netmon2002.wikimedia.org [13:24:46] jouncebot: now [13:24:46] For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1300) [13:25:02] (03PS6) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) [13:25:11] Jdlrobson: around? [13:25:18] (03CR) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [13:25:22] (03CR) 10Muehlenhoff: kubernetes: Space out ferm icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [13:25:32] otherwise I would be done with the window at the moment (fyi hashar) [13:25:36] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [13:25:42] but I’d wait a few minutes to see if jon shows up [13:25:43] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [13:25:52] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2433/co" [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:25:53] (03CR) 10Elukey: [V:03+1] "Just for precaution, I checked the list of IPs in thanos-fe1001 hitting the 443 port, and compared them with k8s eqiad IPs. This is the li" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:26:14] the depends-on is not even correct [13:26:44] anyway I digress [13:26:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [13:27:51] James_F seemed like he wanted to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1031180 [13:28:07] vgutierrez: sorry, yes please [13:28:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2154.codfw.wmnet [13:28:49] (03PS7) 10Effie Mouzeli: flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) [13:29:02] (03PS3) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) [13:29:37] (03PS2) 10Clément Goubert: kubernetes: Space out ferm icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) [13:29:37] @seen James_F [13:29:54] meh, I don’t remember the magic IRC incantation to see if it’s his working time or not ^^ [13:30:04] but yeah MatmaRex that looks like a reasonable change to backport [13:30:21] hey Lucas_WMDE [13:30:22] (03PS3) 10Hashar: Add notheme class to Echo [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [13:30:23] (03CR) 10Bartosz Dziewoński: [C:03+1] "De-scheduled, I'm no longer sure that this is correct. See discussion in https://wikimedia.slack.com/archives/C01R06P8D1B/p171564938410743" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [13:30:27] (03CR) 10Bartosz Dziewoński: [C:04-1] Update wgCdnMaxAge value and documentation to match Varnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [13:30:29] sorry i got the time wrong my an hour [13:30:41] (and relying on jetlag haha) [13:31:09] (03CR) 10Clément Goubert: [V:03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2434/" [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [13:31:44] (03CR) 10Clément Goubert: [V:03+1] kubernetes: Space out ferm icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [13:31:59] hi! [13:32:00] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "kicking off gate-and-submit" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [13:32:09] (03CR) 10Hashar: "I have removed the `Depends-On` which I guess was for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1031068 and then if th" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [13:32:20] (03Abandoned) 10Hashar: Suppress phan errors caused by UserMerge undeploy [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031068 (https://phabricator.wikimedia.org/T364610) (owner: 10Jdlrobson) [13:32:40] Lucas_WMDE: I removed the depends-on on that patch [13:32:42] hashar: thanks! [13:32:43] it was confusing [13:32:43] !log disable puppet on A:cp before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030051 - T357257 [13:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:47] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [13:32:47] and would have prevented the merge I believe [13:32:49] that was driving me mad yesterday [13:33:05] + that was unrelated to Echo or the proposed patch but an issue in CI configuration :) [13:33:35] then [13:34:06] I’m confused by https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1030200 [13:34:09] we have code in production still relying on the undeployed UserMerge [13:34:25] my guess is that those code paths are never reached in prod :) [13:34:42] there are some differences between what https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1030200/3/wmf-config/InitialiseSettings.php removes and what https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1030289/2/skin.json does in Vector [13:34:57] do we no longer need the Special:UserLogin / Special:CreateAccount part? [13:34:59] looking [13:35:06] (03CR) 10Effie Mouzeli: [C:03+1] "please give me a shout when you merge this" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:35:10] (03CR) 10Vgutierrez: [V:03+1 C:03+2] cache: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:35:35] andre: I will run the train immediately after the backport window has completed [13:35:45] ok [13:36:05] Lucas_WMDE: oh it looks like Kim changed the patchset. Let's revert back to patchset 2. [13:36:07] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:36:08] thanks for checking that [13:36:13] RECOVERY - Host ps1-c6-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.83 ms [13:36:18] !log re-enable puppet on A:cp-text - T357257 [13:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:37:03] (03PS4) 10Jdlrobson: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [13:37:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [13:37:10] (03PS5) 10Jdlrobson: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [13:37:21] Lucas_WMDE: amended [13:37:24] (03PS1) 10Muehlenhoff: Switch db2154 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031445 (https://phabricator.wikimedia.org/T349619) [13:38:19] (03PS2) 10Muehlenhoff: Switch db2154 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031445 (https://phabricator.wikimedia.org/T349619) [13:38:24] Jdlrobson: is it okay to deploy both config changes at once? [13:38:33] yep [13:38:43] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.26 ms [13:39:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [13:39:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031047 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [13:39:22] (03CR) 10Muehlenhoff: [C:03+2] Switch db2154 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031445 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:39:50] vgutierrez: ok to merge the upload@ulsfo ipip patch along? [13:40:01] (03Merged) 10jenkins-bot: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [13:40:01] moritzm: errr I merged both [13:40:05] (03Merged) 10jenkins-bot: Phase 5: Vector-2022.js should no longer load legacy Vector code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031047 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [13:40:18] moritzm: or not... [13:40:21] it's still being shown to me? [13:40:22] moritzm: yes please :) [13:40:24] I'll merge [13:40:38] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1030200|Deploy disabled limited width on main page (T357706)]], [[gerrit:1031047|Phase 5: Vector-2022.js should no longer load legacy Vector code (T301212)]] [13:40:42] T357706: [config] Disable limited width on the main page and associated history page - https://phabricator.wikimedia.org/T357706 [13:40:43] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [13:40:57] (03CR) 10Jdlrobson: [C:03+1] "Kim: I Reverted to PS2 as deleting the config here as it had other consequences." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia) [13:41:18] vgutierrez: puppet merge complete [13:41:19] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:41:22] moritzm: thx [13:42:28] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [13:42:35] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [13:43:17] !log lucaswerkmeister-wmde@deploy1002 jdlrobson and ksarabia and lucaswerkmeister-wmde: Backport for [[gerrit:1030200|Deploy disabled limited width on main page (T357706)]], [[gerrit:1031047|Phase 5: Vector-2022.js should no longer load legacy Vector code (T301212)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:43:20] Jdlrobson: both should be ready to test with WikimediaDebug now [13:43:24] thanks checking [13:43:26] Lucas_WMDE: Oh, hey, sorry, wasn't looking at IRC. [13:43:34] hi! [13:43:41] Yes, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1031180 would be nice to reduce logspam and make Jdlrobson happy. [13:43:43] the backport window is looking fuller now than a few minutes ago, I’m afraid [13:43:47] No worries. [13:43:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2154.codfw.wmnet [13:44:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2161.codfw.wmnet [13:44:07] and I don’t want to overrun too much today as hashar is waiting ^^ [13:44:14] but I guess I could deploy it out-of-window after the train is done [13:44:27] PROBLEM - MariaDB Replica Lag: s8 on db2154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 9378.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:44:29] if at all possible yeah [13:44:29] Sure, no rush on my part. [13:44:35] I need to leave early today [13:44:39] hm, ok [13:44:42] Lucas_WMDE: both look great! please sync [13:44:44] then maybe I should remove my +2 on the Echo patch [13:44:46] !log lucaswerkmeister-wmde@deploy1002 jdlrobson and ksarabia and lucaswerkmeister-wmde: Continuing with sync [13:44:50] and postpone that too [13:44:59] (03PS1) 10Muehlenhoff: Switch 2161 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031447 (https://phabricator.wikimedia.org/T349619) [13:45:00] James_F: I can also deploy it later today if that's helpful [13:45:07] (I do like being happy! haha) [13:45:18] Which I'll need to do if the Echo one doesn't merge [13:45:19] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Rescinding +2 – let’s delay this a bit so the train isn’t postponed even more than necessary. We can deploy it later." [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [13:46:09] Lucas_WMDE: so later will be 1pm PST (UTC late backport window) or were you thinking an out of window after the train? [13:46:14] !log re-enable puppet on A:cp-upload - T357257 [13:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:20] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [13:46:23] I was thinking out of window [13:46:43] maybe already before the SRE Collaboration Services office hours [13:47:03] okay cool. I just need to grab some breakfast but I'll be back in 2hrs [13:47:08] to me both fixes seem obvious enough that I’d be okay syncing them without a test [13:47:30] or see how much I can test myself, maybe [13:47:41] I'll be around no problem. Thanks for your help this morning, the config catch, and for waiting for me! [13:48:12] (03PS4) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) [13:49:01] (03CR) 10Filippo Giunchedi: [C:03+1] Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:49:11] Jdlrobson: have a good breakfast! :) [13:50:17] i think it went down and came back up [13:50:29] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364809#9794704 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated. pings on mgmt. [13:50:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw1425 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:50:52] (03PS1) 10Vgutierrez: Revert "hiera: Enable IPIP encapsulation on high-traffic2@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1031182 [13:51:14] (03PS2) 10Vgutierrez: Revert "hiera: Enable IPIP encapsulation on high-traffic2@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1031182 (https://phabricator.wikimedia.org/T357257) [13:51:34] (03PS1) 10Elukey: Revert "services: move Tegola's Swift config in staging to local envoy proxy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031183 [13:52:17] (03CR) 10Muehlenhoff: [C:03+2] Switch 2161 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031447 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:52:39] (03PS1) 10Vgutierrez: hiera: Disable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031450 (https://phabricator.wikimedia.org/T357257) [13:53:08] (03CR) 10Krinkle: db-production: Generate sectionsByDB on the fly (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe) [13:53:41] (03CR) 10Elukey: [C:03+2] Revert "services: move Tegola's Swift config in staging to local envoy proxy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031183 (owner: 10Elukey) [13:53:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw1451 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:54:17] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2435/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031450 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:54:35] PROBLEM - Check whether ferm is active by checking the default input chain on mw1382 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:54:39] (03PS1) 10Muehlenhoff: Remove obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/1031451 [13:54:45] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Enable IPIP encapsulation on high-traffic2@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1031182 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:56:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [13:56:26] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Disable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031450 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:57:10] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1030200|Deploy disabled limited width on main page (T357706)]], [[gerrit:1031047|Phase 5: Vector-2022.js should no longer load legacy Vector code (T301212)]] (duration: 16m 32s) [13:57:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2161.codfw.wmnet [13:57:15] T357706: [config] Disable limited width on the main page and associated history page - https://phabricator.wikimedia.org/T357706 [13:57:15] !log UTC afternoon backport+config window done [13:57:15] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [13:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:20] hashar: all yours [13:57:24] \o/ [13:57:33] (03CR) 10Majavah: [C:03+1] Remove obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/1031451 (owner: 10Muehlenhoff) [13:57:36] andre: I am running the train [13:57:37] (03CR) 10Ladsgroup: db-production: Generate sectionsByDB on the fly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe) [13:57:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2162.codfw.wmnet [13:58:05] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031452 (https://phabricator.wikimedia.org/T361399) [13:58:08] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031452 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [13:58:43] (03PS1) 10Muehlenhoff: Switch db2162 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031453 (https://phabricator.wikimedia.org/T349619) [13:58:56] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031452 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [13:59:55] (03CR) 10Muehlenhoff: [C:03+2] Switch db2162 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031453 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:59:58] hashar: yay! A bit of unexpected real-life interference over here but I'm gonna check logstash too [14:00:31] PROBLEM - Check whether ferm is active by checking the default input chain on parse1018 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:01:53] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T364863 (10phaultfinder) 03NEW [14:03:17] scap is restart fpm [14:03:21] ing [14:03:22] oh my [14:04:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2162.codfw.wmnet [14:04:17] (03CR) 10Volans: [C:04-1] "Many functionalities are available via wmflib that is already installed in all systems." [puppet] - 10https://gerrit.wikimedia.org/r/1030185 (https://phabricator.wikimedia.org/T363702) (owner: 10Bking) [14:05:03] (03PS1) 10JMeybohm: Add kubestagemaster2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1031457 (https://phabricator.wikimedia.org/T363307) [14:05:53] (03PS1) 10Jdlrobson: Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212) [14:05:54] (03PS1) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [14:06:09] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:06:47] (03PS1) 10Vgutierrez: Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184 [14:06:51] (03PS1) 10JMeybohm: Add kubestagemaster2005 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1031460 (https://phabricator.wikimedia.org/T363307) [14:06:53] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [14:07:06] (03CR) 10Ssingh: [C:03+1] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184 (owner: 10Vgutierrez) [14:07:18] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [14:07:22] (03CR) 10Clément Goubert: [V:03+1 C:03+2] kubernetes: Space out ferm icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [14:08:45] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1031457 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [14:09:31] (03PS2) 10Vgutierrez: Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184 [14:10:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2163.codfw.wmnet [14:11:22] (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2005 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1031460 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [14:11:35] (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184 (owner: 10Vgutierrez) [14:11:40] (03PS1) 10Muehlenhoff: Switch db2163 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031461 (https://phabricator.wikimedia.org/T349619) [14:12:25] !log repool upload@ulsfo IPIP encapsulation NOT enabled - T357257 [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:28] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:14:18] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.5 refs T361399 [14:14:19] is scap still restarting php-fpm? [14:14:22] T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399 [14:14:25] ah ^^ [14:14:32] impeccable timing [14:15:19] (03CR) 10Muehlenhoff: [C:03+2] Switch db2163 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031461 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:16:37] yeah [14:16:57] Lucas_WMDE: yeah we still restart php-fpm on baremetal hosts [14:17:04] I guess cause php 7.4 still get some opcache corruption [14:17:07] or to clear some cache [14:17:09] or whatever [14:17:14] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9794867 (10Jclark-ctr) [14:17:18] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9794868 (10Jclark-ctr) @Volans i also see this as a learning opportunity most of these are just logs. Some dcops members are very light on linux and we could be expanding knowledge and cou... [14:17:27] yeah, I remember we disabled automatically rereading PHP files based on mtime or something like that [14:17:32] it just took longer than I expected [14:17:45] and there is some stuff being off by one [14:18:08] like class magically changing from I say Vector2022 to Uector2022 [14:18:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw2428 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:18:44] I don't think we ever tried to reproduce the issue or investigated the root cause [14:18:53] given restarting / clearing the cache fixes it [14:19:25] I thought our best guess was cosmic rays [14:19:28] but I might be imagining that [14:19:52] are there more things to deploy for the train or could I do some more backports now? [14:20:06] (also happy to wait if you want to verify first whether a rollback is needed or not) [14:20:10] (03PS1) 10Filippo Giunchedi: utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462 [14:20:10] (03PS1) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463 [14:20:10] (03PS1) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 [14:20:11] (03PS1) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465 [14:20:46] (03CR) 10CI reject: [V:04-1] zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [14:20:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw1425 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:21:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2163.codfw.wmnet [14:23:14] (03PS2) 10Filippo Giunchedi: utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462 [14:23:14] (03PS2) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463 [14:23:14] (03PS2) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 [14:23:14] (03PS2) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465 [14:23:55] RECOVERY - Check whether ferm is active by checking the default input chain on mw1451 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:24:06] !log depool cp4049 [14:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:35] RECOVERY - Check whether ferm is active by checking the default input chain on mw1382 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:25:46] (03PS1) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861) [14:27:46] (03PS3) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463 [14:27:46] (03PS3) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 [14:27:46] (03PS3) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465 [14:28:45] !log repool cp4049 [14:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:51] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:31] RECOVERY - Check whether ferm is active by checking the default input chain on parse1018 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:31:27] !log depool cp4049 [14:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:02] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:02] FIRING: JobUnavailable: Reduced availability for job lvs_realserver in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:33:55] !log repool cp4049 [14:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:18] lvs_realserver in ops@ulsfo is a side effect of me reverting the IPIP encapsulation change on upload@ulsfo [14:35:22] (03CR) 10Muehlenhoff: "Or we simply remove the option? All Kafka brokers use PKI these days and given that the variable was misnamed that also shows that no clou" [puppet] - 10https://gerrit.wikimedia.org/r/1031463 (owner: 10Filippo Giunchedi) [14:35:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2165.codfw.wmnet [14:36:09] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:36:47] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2436/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [14:37:51] (03PS1) 10Muehlenhoff: Switch db2165 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031489 (https://phabricator.wikimedia.org/T349619) [14:38:02] FIRING: [3x] JobUnavailable: Reduced availability for job lvs_realserver in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:09] !log installing dav1d security updates [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:02] (03CR) 10Muehlenhoff: [C:03+2] Switch db2165 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031489 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:39:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9794965 (10Marostegui) @btullis what is the status of this? I can see the host is up, but not yet provisioned? ` root@an-redacteddb1001:~# df -hT /srv Filesystem Type Size Used A... [14:39:50] well train looks fine this time :] [14:39:52] andre: ^ [14:40:13] (03CR) 10Bking: [C:03+1] cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse) [14:40:34] (03CR) 10Herron: [C:03+1] pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi) [14:41:06] Lucas_WMDE: train looks fine, so if you want to do backport you can do them now! [14:41:08] thanks :) [14:41:24] ok, thanks! [14:41:25] (03PS1) 10Muehlenhoff: Add library hint for dav1d [puppet] - 10https://gerrit.wikimedia.org/r/1031490 [14:41:41] jouncebot: next [14:41:41] In 0 hour(s) and 18 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1500) [14:41:50] but 18 minutes is a bit short for non-config CI, I think [14:41:58] I’ll wait for the window to start and see if anyone’s using it [14:42:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9794972 (10Jclark-ctr) kafka-main1010 Rack: E 5 U 26 Cableid : 2013339101771 Port : 6 [14:43:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2165.codfw.wmnet [14:44:39] (03PS1) 10Vgutierrez: prometheus::ops: Filter lvs_realserver_clamper by enabled parameters [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257) [14:45:09] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for dav1d [puppet] - 10https://gerrit.wikimedia.org/r/1031490 (owner: 10Muehlenhoff) [14:45:43] (03CR) 10Dzahn: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1027052 (https://phabricator.wikimedia.org/T364494) (owner: 10Dzahn) [14:46:41] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364810#9795004 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted. all in C6 up now. [14:47:31] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 224 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:48:02] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw2428 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:48:43] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2437/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:49:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2166.codfw.wmnet [14:50:37] (03PS1) 10Herron: pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) [14:50:45] 10ops-codfw, 06SRE: InterfaceSpeedError - https://phabricator.wikimedia.org/T364863#9795023 (10Jhancock.wm) the cable or the 1G SFP might need to be replaced. can we downtime the server for a small window to test the cabling? [14:51:33] (03CR) 10Ssingh: [C:03+1] prometheus::ops: Filter lvs_realserver_clamper by enabled parameters [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:51:58] (03PS2) 10Herron: pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) [14:52:17] (03CR) 10Vgutierrez: [V:03+1 C:03+2] "Thanks for the review sukhe!" [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:53:23] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2438/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron) [14:55:05] !loh installing openjdk-17/jetty9 security updates [14:56:41] (03PS1) 10Muehlenhoff: Switch db2166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031493 (https://phabricator.wikimedia.org/T349619) [14:57:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 40 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:57:52] moritzm: it didn't log due to typo. thanks for the group approval [14:57:58] (03CR) 10Filippo Giunchedi: "I audited instance-puppet.git and the variable is mistyped there too unfortunately:" [puppet] - 10https://gerrit.wikimedia.org/r/1031463 (owner: 10Filippo Giunchedi) [14:58:02] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:08] (03CR) 10Muehlenhoff: [C:03+2] Switch db2166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031493 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:58:14] yw :-) [15:00:05] eoghan, jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1500). Please do the needful. [15:00:23] (03PS1) 10David Caro: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494 [15:01:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2166.codfw.wmnet [15:01:44] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870 (10RobH) 03NEW p:05Triage→03High [15:01:49] (03PS1) 10Jdlrobson: Enable night mode on Vector on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T363814) [15:02:29] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9795094 (10RobH) [15:02:32] (03CR) 10CI reject: [V:04-1] Enable night mode on Vector on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [15:03:14] (03PS2) 10Jdlrobson: Enable night mode on Vector on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T363814) [15:03:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2167.codfw.wmnet [15:03:32] (03CR) 10Arturo Borrero Gonzalez: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [15:03:47] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge update [15:04:01] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge update [15:04:10] !log brennen@deploy1002 Started deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850 [15:04:14] T364850: Deploy Phabricator/Phorge 2024-05-14 - https://phabricator.wikimedia.org/T364850 [15:04:44] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850 (duration: 00m 33s) [15:04:45] (03PS13) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [15:04:50] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:04:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:05:03] !log brennen@deploy1002 Started deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850 [15:05:33] (03PS1) 10Muehlenhoff: Switch db2167 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031496 (https://phabricator.wikimedia.org/T349619) [15:05:53] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850 (duration: 00m 50s) [15:09:14] (03CR) 10Dzahn: [C:03+1] Filter out addresses handled by gsuite that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [15:10:49] (03CR) 10Muehlenhoff: [C:03+2] Switch db2167 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031496 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:11:23] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:11:33] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:11:40] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2439/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:12:04] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:12:12] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:12:23] (03CR) 10Dzahn: [C:03+1] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [15:12:37] RECOVERY - snapshot of s5 in codfw on backupmon1001 is OK: Last snapshot for s5 at codfw (db2201) taken on 2024-05-14 14:16:13 (659 GiB, -0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:13:37] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:13:39] !log installing expat security updates [15:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:44] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [15:13:57] (03CR) 10Scott French: [C:03+1] Add CertProvider to hot reload TLS certs for gRPC service (032 comments) [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:15:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2167.codfw.wmnet [15:15:21] (03PS1) 10Scott French: aqs-http-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) [15:16:08] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [15:16:11] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:16:35] ^ hmm.. i'll tell Papaul [15:16:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2181.codfw.wmnet [15:18:07] (03PS2) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031071 [15:18:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:18:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:18:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T352010)', diff saved to https://phabricator.wikimedia.org/P62387 and previous config saved to /var/cache/conftool/dbconfig/20240514-151838-ladsgroup.json [15:18:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:20:22] (03CR) 10Peter Fischer: [C:03+2] cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse) [15:20:43] (03PS1) 10Muehlenhoff: Switch db2181 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031498 (https://phabricator.wikimedia.org/T349619) [15:21:08] (03Merged) 10jenkins-bot: cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse) [15:21:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1031423/2441/gerrit2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1031423 (owner: 10Muehlenhoff) [15:22:50] (03CR) 10Muehlenhoff: [C:03+2] Switch db2181 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031498 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:23:42] mutante: thanks! [15:24:05] (03PS6) 10BCornwall: hieradata: Move acme certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) [15:25:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [15:25:11] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:25:12] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:25:28] (03CR) 10Volans: testing, please ignore (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1031071 (owner: 10BCornwall) [15:25:39] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2442/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [15:26:38] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:26:39] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:26:49] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:26:50] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:29:46] (03CR) 10Andrea Denisse: [C:03+1] pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron) [15:32:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2181.codfw.wmnet [15:32:27] jouncebot: nowandnext [15:32:27] For the next 0 hour(s) and 27 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1500) [15:32:27] In 0 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1600) [15:32:51] does anyone mind if I do some backports now? [15:34:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2195.codfw.wmnet [15:35:01] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.52 ms [15:35:18] (03CR) 10Dzahn: [C:03+2] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [15:36:03] I’ll start them now, but they’ll need a while in CI, so you have plenty of time to tell me to cancel the deployment :) [15:36:09] (03CR) 10Vgutierrez: [C:03+1] hieradata: Move acme certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [15:36:13] (03PS1) 10Muehlenhoff: Switch db2195 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031501 (https://phabricator.wikimedia.org/T349619) [15:36:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [15:36:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031180 (https://phabricator.wikimedia.org/T364783) (owner: 10Jforrester) [15:36:47] ^ deploying those two backports [15:37:01] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi) [15:37:09] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:37:31] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:38:02] ^ aware that some codfw maintenance is going on [15:38:18] logmsgbot: here :) [15:38:54] (03CR) 10Muehlenhoff: "We can also file a task to move these Kafkak hosts in deployment-prep to PKI as well and then simply remove the option if there's no react" [puppet] - 10https://gerrit.wikimedia.org/r/1031463 (owner: 10Filippo Giunchedi) [15:38:58] (03CR) 10Scott French: "Thank you both in advance for the review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:39:16] (03CR) 10Muehlenhoff: [C:03+2] Switch db2195 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031501 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:40:23] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:27] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [15:40:55] (03CR) 10Dzahn: [C:03+2] Configure Docker builder GC settings for CI [puppet] - 10https://gerrit.wikimedia.org/r/1031045 (https://phabricator.wikimedia.org/T364773) (owner: 10Ahmon Dancy) [15:42:14] (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [15:42:25] (03CR) 10Scott French: [C:03+2] benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [15:42:39] RECOVERY - Host ps1-d2-codfw is UP: PING WARNING - Packet loss = 71%, RTA = 31.13 ms [15:42:43] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [15:42:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2195.codfw.wmnet [15:43:35] (03Merged) 10jenkins-bot: benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [15:44:34] (03CR) 10EoghanGaffney: [C:03+2] Filter out addresses handled by gsuite that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [15:47:53] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=kubestagemaster2005.codfw.wmnet [15:47:53] !log jayme@cumin1002 conftool action : set/weight=10; selector: name=kubestagemaster2005.codfw.wmnet [15:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:55] Lucas_WMDE: sorry that ping was meant for you not logmsgbot^ [15:49:02] mutante: thanks for merging the patches from the puppet window :) jhathaway and I have a conflicting meeting so I was going to get an early start, pleasant surprise to see them already done [15:49:16] Jdlrobson: ah, thanks, I missed that ^^ [15:49:32] rzl: is it okay if my deploy runs a bit into your window then? [15:49:44] Lucas_WMDE: yep, mine'll be a no-op [15:49:45] (Zuul still predicts 8 mins ETA before CI is even done) [15:49:47] yay [15:49:57] rzl: you're welcome. well, for me it was like that I was looking at merging those and it had first no relation to the window :) [15:50:11] only then noticed they are the same ones, heh [15:50:30] (cc dancy, no need to do anything in the window but feel free to grab us if you need a rollback or followup or anything) [15:56:54] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron) [15:57:16] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi) [15:57:25] (03PS4) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 [15:57:42] (03Merged) 10jenkins-bot: Add notheme class to Echo [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson) [15:57:45] (03Merged) 10jenkins-bot: Convert function to arrow function to fix context [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031180 (https://phabricator.wikimedia.org/T364783) (owner: 10Jforrester) [15:58:12] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi) [15:58:16] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1030984|Add notheme class to Echo (T363779)]], [[gerrit:1031180|Convert function to arrow function to fix context (T364783)]] [15:58:23] T363779: [Bug] Echo not compatible with desktop night theme - https://phabricator.wikimedia.org/T363779 [15:58:24] T364783: Large amount of errors in animateToolbarIntoView function in VisualEditor - https://phabricator.wikimedia.org/T364783 [16:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:01:29] (03CR) 10Cathal Mooney: [C:03+2] Support VM BGP automation using Netbox flag for L3 POPs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) (owner: 10Cathal Mooney) [16:01:32] hm, scap tells me that something failed [16:02:05] it’s expecting https://totoro.wikimedia.org/wiki/Main_Page to redirect to https://foundation.wikimedia.org/wiki/Main_Page on mwdebug2002 [16:02:33] (03CR) 10Herron: [V:03+1 C:03+2] pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron) [16:02:40] I can’t even resolve that host o_O [16:03:06] what.. I look quite a bit at DNS and never seen that [16:03:21] (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [16:03:56] https://gerrit.wikimedia.org/g/operations/puppet/+/08b0b935c4578b12fdadc6f1bc13df0adc207c2e/modules/profile/files/httpbb/appserver/test_wikimania_wikimedia.yaml#46 [16:04:04] * Lucas_WMDE peeks at blame [16:04:35] it's an NXDOMAIN for toroto [16:04:38] *totoro [16:04:51] (03PS2) 10David Caro: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494 [16:04:59] I’m guessing the scap check isn’t even meant to use DNS [16:04:59] (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [16:05:04] let me retry the check and see what happens… [16:05:11] it's not appearing in "git log" in DNS repo either [16:05:33] !log lucaswerkmeister-wmde@deploy1002 jdlrobson and jforrester and lucaswerkmeister-wmde: Backport for [[gerrit:1030984|Add notheme class to Echo (T363779)]], [[gerrit:1031180|Convert function to arrow function to fix context (T364783)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:05:35] (and it seemingly didn’t use DNS either, as it actually got a 503 status code) [16:05:39] T363779: [Bug] Echo not compatible with desktop night theme - https://phabricator.wikimedia.org/T363779 [16:05:39] T364783: Large amount of errors in animateToolbarIntoView function in VisualEditor - https://phabricator.wikimedia.org/T364783 [16:05:43] that still doesn’t explain why we even check for this bizarre host name [16:05:51] but anyway – looks like the recheck worked 🤷 [16:05:54] Jdlrobson: can you test the changes? [16:06:10] rzl: ever heard of a "totoro.wikimedia.org" vhost on appservers? [16:06:20] Lucas_WMDE: yep [16:06:22] in a meeting, back to you in a bit [16:06:32] no rush [16:06:37] (test was seemingly introduced in https://gerrit.wikimedia.org/r/c/operations/puppet/+/444908/3/modules/profile/files/mediawiki/web_testing/tests/test_wikimania_wikimedia FWIW) [16:06:44] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified wmf homer plugin - cmooney@cumin1002 - T364480 [16:06:48] T364480: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480 [16:07:03] Lucas_WMDE: in .. 2018 ?:) [16:07:18] yeah [16:07:25] I think the scap failure was just a flake [16:07:26] but scap only complains now? odd [16:07:31] but I am now curious what the test even means [16:07:41] and whether we can just remove it [16:07:47] Lucas_WMDE: it's a test for https://wikitech.wikimedia.org/wiki/Httpbb [16:07:49] (03PS1) 10David Caro: cirrus_streaming_updater_cloudelastic: fix missing job_name [alerts] - 10https://gerrit.wikimedia.org/r/1031503 [16:07:51] but apparently the redirect must exist somewhere, if the check worked after a retry [16:07:58] scap runs `httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 --retry_on_timeout` [16:08:00] notifications = good to sync. [16:08:23] Arrow function => good to sync Lucas_WMDE [16:08:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified wmf homer plugin - cmooney@cumin1002 - T364480 [16:08:27] ack [16:08:29] !log lucaswerkmeister-wmde@deploy1002 jdlrobson and jforrester and lucaswerkmeister-wmde: Continuing with sync [16:08:36] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480#9795483 (10ops-monitoring-bot) Deployed homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified... [16:08:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795486 (10Papaul) [16:08:54] (03CR) 10CI reject: [V:04-1] cirrus_streaming_updater_cloudelastic: fix missing job_name [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro) [16:09:04] (03CR) 10David Caro: "Adding you as reviewer as you added those tests :), feel free to direct me to someone else." [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro) [16:09:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795489 (10Papaul) [16:09:54] (03CR) 10Cathal Mooney: [C:03+2] Increase timeout for Netbox Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1029226 (owner: 10Cathal Mooney) [16:10:15] (03CR) 10David Caro: "Oh, wait, they seem to only fail in my local, maybe it's the version of promtool/pint" [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro) [16:10:23] (03Merged) 10jenkins-bot: Increase timeout for Netbox Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1029226 (owner: 10Cathal Mooney) [16:10:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795496 (10Papaul) [16:11:52] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:12:18] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:12:37] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [16:12:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [16:13:24] (03CR) 10David Caro: "Yep, promtool 2.45.0 fails, 2.52 works 🎉" [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro) [16:13:34] (03Abandoned) 10David Caro: cirrus_streaming_updater_cloudelastic: fix missing job_name [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro) [16:13:44] mutante: Scap does a retry/continue/exit interaction loop around the testserver and canary checks as of version 4.70.0 (07 Mar 2024). [16:14:14] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:14:19] (if a tty is available) [16:14:40] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [16:14:45] dancy: soo.. I ran the same httpbb command that scap runs and that like.. always PASSes [16:14:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [16:14:56] dancy: it even passes with "cookiemonster.wikimedia.org" [16:15:03] Nod. I have nothing to say about that. [16:16:30] it also passes when I use --hosts=mwdebug1002.eqiad.wmnet instead of the discovery service name and drop the port [16:16:37] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 250.21 ms [16:16:39] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 244.41 ms [16:17:02] Lucas_WMDE: Can you provide the full transcript (ideally in phab ticket) ? [16:17:27] sure [16:17:51] thx [16:20:23] dancy (and mutante, rzl, sukhe if interested): https://phabricator.wikimedia.org/T364880 [16:20:28] absolutely no idea which tags to put on it [16:20:57] (03PS1) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) [16:20:59] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1030984|Add notheme class to Echo (T363779)]], [[gerrit:1031180|Convert function to arrow function to fix context (T364783)]] (duration: 22m 43s) [16:21:04] T363779: [Bug] Echo not compatible with desktop night theme - https://phabricator.wikimedia.org/T363779 [16:21:05] T364783: Large amount of errors in animateToolbarIntoView function in VisualEditor - https://phabricator.wikimedia.org/T364783 [16:21:12] slaps "SRE" on it to start with [16:21:13] Jdlrobson, James_F: should be deployed now [16:21:18] <3 [16:21:39] 06SRE, 13Patch-For-Review: Failed scap check for totoro.wikimedia.org during deployment - https://phabricator.wikimedia.org/T364880#9795759 (10Dzahn) [16:22:02] 06SRE, 13Patch-For-Review: Failed scap check for totoro.wikimedia.org during deployment - https://phabricator.wikimedia.org/T364880#9795784 (10Lucas_Werkmeister_WMDE) [16:23:01] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:02] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:03] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:23:08] (03CR) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE)) [16:23:14] Mystery solved. Nice work Lucas [16:23:16] * Lucas_WMDE done deploying btw [16:23:47] 06SRE, 10Scap, 13Patch-For-Review: Failed scap check for totoro.wikimedia.org during deployment - https://phabricator.wikimedia.org/T364880#9795831 (10dancy) [16:24:28] (03PS1) 10JMeybohm: kubernetes::master: Retry kube-publish-sa-certs 5 times [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307) [16:27:04] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2443/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [16:27:39] 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during sacp deployment - https://phabricator.wikimedia.org/T364880#9795868 (10Dzahn) scap runs `httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 --retry_on_timeout` This... [16:27:48] 06SRE, 06serviceops, 06Traffic-Icebox, 06Trust and Safety Product Team: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933#9795878 (10TAdeleye_WMF) [16:28:52] 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during sacp deployment - https://phabricator.wikimedia.org/T364880#9795900 (10dancy) [16:30:19] 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9795958 (10dancy) [16:30:57] 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9795971 (10Lucas_Werkmeister_WMDE) > ^ It seems wrong that this doesn't fail. I’m not sure why it should fail? It seem to match the behavior I can see... [16:32:24] dancy: Lucas_WMDE: confirmed. check passes whatever the virtual host is, as long as it's whatever.wikimedia.org and the path stays: /wiki/Main_Page . as soon as the path changes .. then it starts behaving as expected [16:33:15] Main_Page isn't in the rewrite rules directly though [16:34:39] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:37:30] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1006 - vriley@cumin1002" [16:38:05] (03CR) 10Andrew Bogott: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro) [16:38:27] unfortunately I don’t see the error that scap got in logstash [16:38:37] nothing around the right time in host:mwdebug2002 AFAICT [16:38:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1006 - vriley@cumin1002" [16:38:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:38:44] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: name=mw2286.codfw.wmnet [16:39:01] (unfortunately scap didn’t print the response body nor any other response headers, so there’s not much to go on…) [16:39:11] 06SRE, 10Scap, 06serviceops-radar, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9796100 (10Dzahn) Lucas is right. I can confirm the test passes with any wikimedia.org subdomain as long as the path stays /wiki/M... [16:39:17] !log depooled mw2286.codfw.wmnet because of interface error / needed cable replacement T364863 [16:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:20] T364863: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863 [16:39:25] 06SRE, 10Scap, 06serviceops-radar, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9796101 (10Dzahn) [16:39:37] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:40:46] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw2286.codfw.wmnet with reason: T364863 [16:41:00] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw2286.codfw.wmnet with reason: T364863 [16:41:10] mutante/Lucas_WMDE: Please file bugs against httpbb if you want changes [16:41:13] 10ops-codfw, 06SRE, 06serviceops: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863#9796105 (10Dzahn) [16:41:53] hmm. I wonder if it already takes a flag to be more spammy [16:42:28] Not that I see. [16:43:09] me neither [16:44:05] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:44:30] (03CR) 10BCornwall: [V:03+1 C:03+2] "Looks like CI is failing due to an unrelated puppet repo issue. I'll rebase/rerun later." [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [16:44:47] (03CR) 10BCornwall: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [16:44:57] 10ops-codfw, 06SRE, 06serviceops: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863#9796143 (10Dzahn) @Jhancock.wm cc: @RLazarus I depooled the server and set a downtime of 24 hours. [16:46:37] dancy: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/httpbb/+/a84d0d2703cfad340e0e479dc42582efd4ce893b/httpbb/main.py#162 and the following lines don’t look like the response is logged anywhere in general, only the parts that are relevant to the failed test [16:46:40] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:46:57] should I make a separate task for that? that it should drop details in a file or something? [16:48:03] (03CR) 10BCornwall: testing, please ignore (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1031071 (owner: 10BCornwall) [16:48:23] (03CR) 10Dzahn: [C:03+2] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [16:48:48] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1007 - vriley@cumin1002" [16:48:51] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 246.77 ms [16:48:53] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 230.65 ms [16:49:40] (03CR) 10Dzahn: [C:03+2] "Notice: /Stage[main]/Gitlab/Systemd::Service[gitlab-exporter]/Service[gitlab-exporter]/ensure: ensure changed 'stopped' to 'running'" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [16:49:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1007 - vriley@cumin1002" [16:49:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:50:11] Lucas_WMDE: Yeah, a separate ticket as a subtask of T364880 would be good. I think an option to print error details to stdout is sufficient. [16:50:12] T364880: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880 [16:50:29] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1007.mgmt.eqiad.wmnet with reboot policy FORCED [16:51:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9796178 (10VRiley-WMF) [16:51:36] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:54:16] dancy: ok, filed https://phabricator.wikimedia.org/T364886 [16:54:16] 06SRE, 10Scap, 06serviceops-radar: httpbb should show more information / details about failed checks - https://phabricator.wikimedia.org/T364886 (10Lucas_Werkmeister_WMDE) 03NEW [16:55:16] Thanks! [16:55:40] thakns Lucas_WMDE! [16:55:54] (for the previous one as well) [16:56:03] (03CR) 10Jsn.sherman: Dont recalculate winners from scratch each round (031 comment) [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) (owner: 10Driedmueller) [16:56:16] mutante: Thanks for the merges! [16:56:26] (03CR) 10Dzahn: [C:03+1] "Antoine, any concerns?" [puppet] - 10https://gerrit.wikimedia.org/r/1029212 (https://phabricator.wikimedia.org/T333029) (owner: 10Addshore) [16:57:13] dancy: yw! it was mostly unrelated to the window [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1700) [17:00:09] !log ryankemper@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [17:02:33] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1007.mgmt.eqiad.wmnet with reboot policy FORCED [17:05:48] (03PS1) 10Santiago Faci: Bumping mpic version: v.0.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031517 (https://phabricator.wikimedia.org/T364170) [17:06:56] (03CR) 10Clare Ming: [C:03+2] Bumping mpic version: v.0.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031517 (https://phabricator.wikimedia.org/T364170) (owner: 10Santiago Faci) [17:08:05] (03Merged) 10jenkins-bot: Bumping mpic version: v.0.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031517 (https://phabricator.wikimedia.org/T364170) (owner: 10Santiago Faci) [17:08:51] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:37] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [17:11:10] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [17:11:31] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [17:12:00] !log ryankemper@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [17:15:17] (03CR) 10Scott French: "Just to make sure I understand the motivation: does systemd-sysv-generator not work on bookworm, or are you doing this in advance of its d" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [17:16:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9796326 (10Papaul) [17:18:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [17:19:33] !log ryankemper@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [17:24:20] 06SRE, 06Infrastructure-Foundations, 10netops: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480#9796348 (10cmooney) 05Open→03Resolved Patch to Homer wmf plugin merged now, so BGP to VMs at POPs / on L3 switches now under automation too. [17:25:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [17:27:14] !log ryankemper@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [17:33:35] (03CR) 10Scott French: [C:03+1] "Sounds reasonable to me, though I don't have a good understanding of how long it might take for the local etcd node to become ready in thi" [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [17:33:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [17:38:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9796396 (10MMiller_WMF) I approve! [17:39:54] (03PS1) 10DCausse: cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837) [17:40:44] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891 (10RobH) 03NEW [17:41:16] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9796419 (10RobH) [17:41:35] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9796420 (10RobH) [17:42:04] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9796422 (10RobH) [17:43:30] (03PS2) 10DCausse: cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837) [17:45:01] (03PS3) 10Jdlrobson: Enable night mode on Vector on testwiki, disable on Special:Homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699) [17:47:21] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs6003 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [17:47:47] (03PS2) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) [17:47:51] (03CR) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:52:49] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs7003 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [17:53:21] ^ we are looking into this, we see what has changed so figuring out a reversal [17:55:06] (03PS1) 10Herron: pyrra: linkrecommendation: onboard slo from grizzly [puppet] - 10https://gerrit.wikimedia.org/r/1031527 (https://phabricator.wikimedia.org/T302995) [18:01:27] (03PS1) 10Ssingh: Revert "hiera: Enable IPIP on upload and upload-https services" [puppet] - 10https://gerrit.wikimedia.org/r/1031470 [18:04:00] (03CR) 10Ssingh: [C:03+2] Revert "hiera: Enable IPIP on upload and upload-https services" [puppet] - 10https://gerrit.wikimedia.org/r/1031470 (owner: 10Ssingh) [18:07:35] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs4010 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:08:19] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:08:30] (03PS2) 10Herron: pyrra: linkrecommendation: onboard slo from grizzly [puppet] - 10https://gerrit.wikimedia.org/r/1031527 (https://phabricator.wikimedia.org/T302995) [18:10:13] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:10:52] !log sudo cumin -b1 -s120 'A:lvs' 'systemctl restart pybal.service': clearing up alert for reverted pybal.conf CR 1031470 [18:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:37] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs3010 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:14:38] !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@6270c72]: (no justification provided) [18:15:12] !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@6270c72]: (no justification provided) (duration: 00m 34s) [18:17:46] !log [CORRECTION] above pybal restart was NOT run [18:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:00] !log restart pybal on backup LVSes [18:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:19] now I realize it should have been [below]. oh well [18:18:55] (03Abandoned) 10Herron: pyrra-filesystem: increase StartLimits and delay notified unit [puppet] - 10https://gerrit.wikimedia.org/r/1031050 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron) [18:22:11] (03CR) 10JMeybohm: [C:03+1] flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [18:22:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:22:45] (03PS1) 10Dreamrimmer: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) [18:23:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:24:33] RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2198) taken on 2024-05-14 17:24:03 (1244 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [18:26:52] (03PS3) 10Scott French: conftool-data: bootstrap parser-cache sections and instances [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786) [18:33:07] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs3010 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:33:07] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs6003 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:33:07] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs7003 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:33:14] all should recover now [18:46:21] (03PS7) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (https://phabricator.wikimedia.org/T362786) [18:47:18] (03Abandoned) 10Scott French: WIP: etcd.php: ignore pc sections in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030496 (owner: 10Scott French) [18:48:35] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs4010 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:52:45] (03PS1) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031541 (https://phabricator.wikimedia.org/T363861) [18:53:42] (03PS2) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861) [18:54:00] (03Abandoned) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031541 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson) [18:54:38] (03PS1) 10CDanis: otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407) [19:07:30] (03CR) 10Krinkle: varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [19:11:24] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031175 (https://phabricator.wikimedia.org/T363904) (owner: 10Wargo) [19:13:56] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:14:06] (03CR) 10Krinkle: db-production: Generate sectionsByDB on the fly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe) [19:16:02] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1008 - vriley@cumin1002" [19:16:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1008 - vriley@cumin1002" [19:16:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:22] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:18:13] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1008.mgmt.eqiad.wmnet with reboot policy FORCED [19:18:27] !log T364907 💔cdanis@apt1002.wikimedia.org ~ 🕞🍵 sudo -i reprepro --keepunreferencedfiles includedeb bullseye-wikimedia ~/otelcol-contrib_0.100.0_linux_amd64.deb [19:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:31] T364907: upgrade to latest stable version of otelcol-contrib - https://phabricator.wikimedia.org/T364907 [19:18:53] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:50] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [19:20:29] (03PS1) 10Ryan Kemper: CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 [19:21:10] (03PS1) 10Jclark-ctr: add kafka-main[12]006-10 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1031544 (https://phabricator.wikimedia.org/T363212) [19:21:34] (03CR) 10CI reject: [V:04-1] CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 (owner: 10Ryan Kemper) [19:21:58] (03CR) 10Jclark-ctr: [C:03+2] add kafka-main[12]006-10 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1031544 (https://phabricator.wikimedia.org/T363212) (owner: 10Jclark-ctr) [19:23:57] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main1006'] [19:24:39] (03PS2) 10Ryan Kemper: CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 [19:25:13] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kafka-main1006'] [19:26:32] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [19:26:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9796969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye [19:30:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1008.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:33] (03PS1) 10CDanis: otelcol: bump to v0.100.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1031546 (https://phabricator.wikimedia.org/T364907) [19:32:31] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:32:52] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:32:53] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:33:23] FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:43] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:34:21] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Sat 08 Jun 2024 03:25:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:37:22] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1010 - vriley@cumin1002" [19:38:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kafka-main1010 - vriley@cumin1002" [19:38:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:38:23] RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:18] (03CR) 10CDanis: [V:03+2 C:03+2] otelcol: bump to v0.100.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1031546 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis) [19:39:28] (03CR) 10Hashar: [C:03+1] Clarify totoro.wikimedia.org test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE)) [19:39:35] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [19:41:37] (03PS4) 10Jdlrobson: Enable night mode on Vector on testwiki, disable on Special:Homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699) [19:41:49] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [19:45:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [19:46:00] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:20] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:32] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:55] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [19:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:46] (03PS2) 10CDanis: otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407) [19:50:46] (03PS1) 10CDanis: otelcol: do service name transform first of all [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031551 (https://phabricator.wikimedia.org/T363407) [19:51:16] (03CR) 10CDanis: [C:03+2] otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:51:23] (03CR) 10CDanis: [C:03+2] otelcol: do service name transform first of all [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031551 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:51:29] (03CR) 10Ladsgroup: "hmm, the svg files, specially the mediawiki ones are way too big." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [19:52:10] (03Merged) 10jenkins-bot: otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:52:13] (03Merged) 10jenkins-bot: otelcol: do service name transform first of all [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031551 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:53:10] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [19:53:31] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T2000). [20:00:04] ebernhardson and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:47] hi. I don't have a patch listed, but was wondering if I could debug two patches on an mwdebug host during this time slot per https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debug_via_Gerrit_and_Scap [20:01:01] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:01:25] (03PS1) 10CDanis: otelcol: bump version to v0.100.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031554 (https://phabricator.wikimedia.org/T363407) [20:02:13] o/ [20:02:15] i can deploy [20:02:29] o/ [20:02:55] kostajh: that sounds fine to me - guessing it won't interfere much with deploying the other patches? [20:03:14] (03CR) 10CDanis: [C:03+2] otelcol: bump version to v0.100.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031554 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [20:03:49] Jdlrobson: i'll start with yours unless ebernhardson is around? [20:04:07] (03CR) 10Clare Ming: [C:03+2] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson) [20:04:09] (03Merged) 10jenkins-bot: otelcol: bump version to v0.100.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031554 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [20:04:24] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [20:04:33] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [20:04:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699) (owner: 10Jdlrobson) [20:05:08] \o [20:05:20] (03Merged) 10jenkins-bot: Enable night mode on Vector on testwiki, disable on Special:Homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699) (owner: 10Jdlrobson) [20:05:34] hi ebernhardson: hope it's ok i started with Jon's patches -- i'll do yours imminently [20:05:49] cjming: yea no worries [20:05:54] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1031495|Enable night mode on Vector on testwiki, disable on Special:Homepage (T357699 T363814)]] [20:05:59] T357699: Prepare Special:Homepage for night mode - https://phabricator.wikimedia.org/T357699 [20:05:59] T363814: Release dark mode as a beta feature on desktop (May 15th) - https://phabricator.wikimedia.org/T363814 [20:06:37] !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [20:06:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED [20:06:54] !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [20:07:50] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [20:08:00] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [20:08:34] !log cjming@deploy1002 jdlrobson and cjming: Backport for [[gerrit:1031495|Enable night mode on Vector on testwiki, disable on Special:Homepage (T357699 T363814)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:55] Jdlrobson: can you check your 2nd patch? still waiting for 1st patch to merge [20:09:26] cjming: I'm not sure, I am not super familiar with the process [20:09:56] cjming: on it [20:10:14] kostajh: in theory, we should be done in about 15-20 minutes if all goes zippy [20:10:58] ok [20:11:29] cjming: lgtm please sync but i might need a follow up [20:11:43] ok [20:11:45] !log cjming@deploy1002 jdlrobson and cjming: Continuing with sync [20:13:23] kostajh: just reading that wikitech page -- i think we have to stagger our scaps -- aiui scap cmds need to be consecutive - can't be run in parallel [20:14:05] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@ecf603d]: update discolytics to 0.18.0 [20:14:05] (03PS5) 10Ebernhardson: cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) [20:14:32] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@ecf603d]: update discolytics to 0.18.0 (duration: 00m 27s) [20:14:37] (03PS1) 10Peter Fischer: Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558 [20:15:02] cjming: yeah, I can wait until the end [20:15:10] or I might pick this up again tomorrow when I'm more awake [20:15:41] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797204 (10Ladsgroup) a:03Ladsgroup [20:16:05] (03CR) 10Ebernhardson: [C:03+1] Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558 (owner: 10Peter Fischer) [20:16:20] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797205 (10Ladsgroup) We will go with the name wikipedia-pl-admins to be consistent with other wikis. Hope that's fine with you. [20:16:54] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797209 (10Msz2001) Sure, no problem [20:19:03] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797224 (10Ladsgroup) 05Open→03Resolved {{done}} https://lists.wikimedia.org/postorius/lists/wikipedia-pl-admins.lists.wikimedia.org/ [20:20:28] cjming: two things I don't know how to do from reading https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debug_via_Gerrit_and_Scap -- exact steps to follow to cherry pick the patches, and what does "clean up the deployment server" mean? [20:20:40] it's in bold, so it sounds important :) [20:21:55] kostajh: kostajh `scap pull` is what cleans it up, but its indended from mwdebug* not mwmaint* [20:22:42] kostajh: i think it means just run `scap pull` on the debug server [20:22:47] i think the idea here is put patch on deploy server, pull it over to debug host, to testing, remove patch from deploy server, pull again [20:23:29] oh, i guess the highlighted bit is about fixing the git repo to match what it was before [20:24:34] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1031495|Enable night mode on Vector on testwiki, disable on Special:Homepage (T357699 T363814)]] (duration: 18m 40s) [20:24:39] T357699: Prepare Special:Homepage for night mode - https://phabricator.wikimedia.org/T357699 [20:24:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [20:24:39] T363814: Release dark mode as a beta feature on desktop (May 15th) - https://phabricator.wikimedia.org/T363814 [20:25:18] ebernhardson: doing yours next while waiting for Jon's other patch to merge - just go ahead and sync then the time comes? [20:25:52] (03Merged) 10jenkins-bot: cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [20:25:57] Jdlrobson: 2nd patch should be live [20:26:22] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1031029|cirrus: Shift 25% of public wikis writes in eqiad to replacement updater (T363475)]] [20:26:26] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:26:44] PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:26:59] (03CR) 10Ebernhardson: [C:03+2] Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558 (owner: 10Peter Fischer) [20:27:22] cjming: yea, we can only see it from the jobqueue [20:27:28] cjming: go ahead and push when ready [20:27:29] (03Merged) 10jenkins-bot: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson) [20:27:41] alrighty [20:27:50] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921 (10Eevans) 03NEW [20:27:51] (03Merged) 10jenkins-bot: Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558 (owner: 10Peter Fischer) [20:28:02] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:28:08] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:28:16] (03CR) 10CDanis: [C:03+1] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [20:28:21] (03CR) 10CDanis: [C:03+1] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [20:28:25] (03CR) 10CDanis: [C:03+1] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi) [20:28:58] !log cjming@deploy1002 cjming and ebernhardson: Backport for [[gerrit:1031029|cirrus: Shift 25% of public wikis writes in eqiad to replacement updater (T363475)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:03] !log cjming@deploy1002 cjming and ebernhardson: Continuing with sync [20:31:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [20:34:34] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.43 ms [20:36:16] cjming: ready to test? [20:36:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [20:37:05] cjming: i suspect we messed this up, the log spam isn't resolving as i expected. might nede a revert, but checking a few more things [20:37:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:38:10] Jdlrobson: almost - your backport was still merging so i snuck in Erik's patch [20:38:35] cjming: oh never mind, the logs did die down as expected. just took another minute [20:38:42] everything look sreasonable here [20:38:59] nice! [20:39:46] ok cjming np [20:41:25] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1031029|cirrus: Shift 25% of public wikis writes in eqiad to replacement updater (T363475)]] (duration: 15m 02s) [20:41:29] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:42:01] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1031466|Override VE overlays in night-mode (T363861)]] [20:42:05] T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861 [20:42:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:42:38] (03CR) 10Dzahn: [V:03+1 C:03+2] "just creates an empty directory we can start rsyncing to - will follow-up with rsync::quickdatacopy https://puppet-compiler.wmflabs.org/o" [puppet] - 10https://gerrit.wikimedia.org/r/1022193 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [20:44:19] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:44:43] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1031466|Override VE overlays in night-mode (T363861)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:44:43] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:44:56] Jdlrobson: 1st patch ready for testing [20:45:00] (03PS1) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031476 [20:45:20] (03Abandoned) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031476 (owner: 10BCornwall) [20:46:41] cjming: looking now [20:47:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9797373 (10VRiley-WMF) [20:47:20] cjming: so backports to deploy branches are taking 40 mins these days? [20:47:45] Jdlrobson: merging takes like 20+ minutes [20:47:59] cjming: please sync that one [20:48:05] !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync [20:48:22] I have a follow up to my config flag but it's not going to fit into the last 10 mins of the window so I guess I need to schedule it tomorrow? [20:49:55] kostajh: not sure if you're still around - do you still want to use the remaining time in this window for your testing? sorry it all took a bit longer than i was hoping - should be done in another minute [20:49:56] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [20:52:40] Jdlrobson: if Kosta is N/A i'm happy to squeeze in one more config patch [20:54:16] (03PS1) 10Jdlrobson: [Follow-up] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861) [20:56:06] (03PS1) 10Jdlrobson: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814) [20:56:17] (03PS1) 10Jdlrobson: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) [20:56:44] RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:58:54] (03PS1) 10Jdlrobson: Enable night mode as a desktop beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814) [21:00:45] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1031466|Override VE overlays in night-mode (T363861)]] (duration: 18m 44s) [21:00:48] Jdlrobson: backport is live [21:00:50] T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861 [21:02:01] (03PS1) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [21:02:23] (03CR) 10CI reject: [V:04-1] stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:03:59] (03PS2) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [21:04:27] (03CR) 10Dzahn: [V:03+1 C:03+2] "followed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031565" [puppet] - 10https://gerrit.wikimedia.org/r/1022193 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:11:22] (03CR) 10Dzahn: [V:04-1 C:04-1] "https://puppet-compiler.wmflabs.org/output/1031565/2446/stewards2001.codfw.wmnet/change.stewards2001.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:12:53] (03CR) 10Dzahn: [V:04-1 C:04-1] "arr.. we would first have to move the definition of the lists server primary host to common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:13:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:38] PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 9360 MB (2% inode=99%): /tmp 9360 MB (2% inode=99%): /var/tmp 9360 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [21:16:20] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797524 (10Scott_French) Thanks, @Eevans. If you can drive development of the new data gateway (i.e., base... [21:18:21] (03CR) 10JHathaway: [C:03+1] "looks good, any reason you dropped the `--`?" [puppet] - 10https://gerrit.wikimedia.org/r/1031462 (owner: 10Filippo Giunchedi) [21:19:19] (03CR) 10CI reject: [V:04-1] Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [21:20:08] RECOVERY - MariaDB Replica Lag: s8 on db2154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:20:29] (03PS3) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [21:20:48] (03CR) 10CI reject: [V:04-1] stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:20:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364299)', diff saved to https://phabricator.wikimedia.org/P62390 and previous config saved to /var/cache/conftool/dbconfig/20240514-212052-marostegui.json [21:20:59] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:22:57] (03PS4) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) [21:24:52] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797608 (10Scott_French) [21:35:01] (03Abandoned) 10C. Scott Ananian: Enable ParserMigration extension on commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007394 (owner: 10C. Scott Ananian) [21:36:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P62391 and previous config saved to /var/cache/conftool/dbconfig/20240514-213601-marostegui.json [21:36:35] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797656 (10Eevans) [21:47:06] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797672 (10Eevans) [21:48:33] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797670 (10CodeReviewBot) eevans opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/74... [21:51:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P62392 and previous config saved to /var/cache/conftool/dbconfig/20240514-215109-marostegui.json [21:53:35] (03CR) 10Jdlrobson: "recheck" [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson) [21:58:07] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797746 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/74 A... [22:02:14] RECOVERY - Host asw-c-codfw is UP: PING WARNING - Packet loss = 75%, RTA = 87.78 ms [22:02:44] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.193.0.18 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [22:06:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364299)', diff saved to https://phabricator.wikimedia.org/P62393 and previous config saved to /var/cache/conftool/dbconfig/20240514-220617-marostegui.json [22:06:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [22:06:23] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:06:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [22:06:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T364299)', diff saved to https://phabricator.wikimedia.org/P62394 and previous config saved to /var/cache/conftool/dbconfig/20240514-220640-marostegui.json [22:08:38] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [22:22:34] !log zabe@mwmaint1002:/tmp/upload$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="Yann" . # T364877 [22:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:40] T364877: Server side upload for Yann - https://phabricator.wikimedia.org/T364877 [22:25:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9797806 (10Jclark-ctr) @akosiaris could you please update preseed.yaml file? I did take care of site.pp file for codfw and eqiad [22:33:38] PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 2860 MB (0% inode=99%): /tmp 2860 MB (0% inode=99%): /var/tmp 2860 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [22:34:12] RECOVERY - MediaWiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:39:12] (03PS6) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [22:44:10] (03PS1) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 [22:48:31] !log start running migrateGuSalt.php in screen session # T364435 [22:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:36] T364435: Drop gu_salt from globaluser - https://phabricator.wikimedia.org/T364435 [23:02:27] (03PS2) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 [23:26:33] (03CR) 10Scott French: "Thanks for the follow-up, Amir! If I understand correctly, it sounds like you'd recommend something like https://gerrit.wikimedia.org/r/10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [23:30:01] (03CR) 10Scott French: configure parsercache servers via dbconfig in etcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (owner: 10Scott French) [23:43:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T352010)', diff saved to https://phabricator.wikimedia.org/P62395 and previous config saved to /var/cache/conftool/dbconfig/20240514-234337-ladsgroup.json [23:43:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:58:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P62396 and previous config saved to /var/cache/conftool/dbconfig/20240514-235844-ladsgroup.json