[00:01:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1030559 (owner: 10TrainBranchBot)
[00:06:18] <wikibugs>	 (03PS1) 10Tim Starling: ext.CodeMirror.visualEditor: don't load on RTL pages [extensions/CodeMirror] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031069 (https://phabricator.wikimedia.org/T363752)
[00:06:54] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] ext.CodeMirror.visualEditor: don't load on RTL pages [extensions/CodeMirror] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031069 (https://phabricator.wikimedia.org/T363752) (owner: 10Tim Starling)
[00:08:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:10:03] <wikibugs>	 (03PS1) 10Tim Starling: Fix exception when creating an election with the OpenSSL encryption type [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031070 (https://phabricator.wikimedia.org/T209892)
[00:10:14] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] Fix exception when creating an election with the OpenSSL encryption type [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031070 (https://phabricator.wikimedia.org/T209892) (owner: 10Tim Starling)
[00:15:47] <wikibugs>	 (03Merged) 10jenkins-bot: ext.CodeMirror.visualEditor: don't load on RTL pages [extensions/CodeMirror] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031069 (https://phabricator.wikimedia.org/T363752) (owner: 10Tim Starling)
[00:15:49] <wikibugs>	 (03Merged) 10jenkins-bot: Fix exception when creating an election with the OpenSSL encryption type [extensions/SecurePoll] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031070 (https://phabricator.wikimedia.org/T209892) (owner: 10Tim Starling)
[00:19:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance
[00:19:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance
[00:19:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T364299)', diff saved to https://phabricator.wikimedia.org/P62372 and previous config saved to /var/cache/conftool/dbconfig/20240514-001956-marostegui.json
[00:20:01] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[00:20:37] <logmsgbot>	 !log tstarling@deploy1002 Started scap: Fix SecurePoll exception T209892 and CodeMirror 5 RTL T363752
[00:20:42] <stashbot>	 T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892
[00:20:42] <stashbot>	 T363752: CodeMirror shouldn't load in the 2017 editor on RTL pages - https://phabricator.wikimedia.org/T363752
[00:34:52] <wikibugs>	 (03PS1) 10Scott French: DNM: ipiod: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638)
[00:35:34] <logmsgbot>	 !log tstarling@deploy1002 Finished scap: Fix SecurePoll exception T209892 and CodeMirror 5 RTL T363752 (duration: 14m 56s)
[00:35:40] <stashbot>	 T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892
[00:35:41] <stashbot>	 T363752: CodeMirror shouldn't load in the 2017 editor on RTL pages - https://phabricator.wikimedia.org/T363752
[00:38:16] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9792816 (10Papaul) 05Open→03Resolved All the old mgmt switch are back in place
[00:41:50] <wikibugs>	 (03PS2) 10Scott French: ipiod: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638)
[00:53:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:07:59] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.5 [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1030560 (https://phabricator.wikimedia.org/T361399)
[01:08:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.5 [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1030560 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[01:13:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:28:28] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.5 [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1030560 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[01:47:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T352010)', diff saved to https://phabricator.wikimedia.org/P62373 and previous config saved to /var/cache/conftool/dbconfig/20240514-014753-ladsgroup.json
[01:48:00] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[01:53:34] <wikibugs>	 (03PS1) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031071
[01:55:32] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:55:48] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0200)
[02:02:28] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:02:38] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.318 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:03:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P62374 and previous config saved to /var/cache/conftool/dbconfig/20240514-020301-ladsgroup.json
[02:08:23] <wikibugs>	 (03PS1) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031072
[02:18:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P62375 and previous config saved to /var/cache/conftool/dbconfig/20240514-021809-ladsgroup.json
[02:33:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T352010)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240514-023316-ladsgroup.json
[02:33:24] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[02:33:37] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[02:34:25] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[02:34:49] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:36:11] <jinxer-wm>	 FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:38:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:49] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:41:11] <jinxer-wm>	 RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:53:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0300)
[03:01:43] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031127 (https://phabricator.wikimedia.org/T361399)
[03:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031127 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[03:02:24] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031127 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[03:02:52] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.5  refs T361399
[03:03:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:03:02] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:42] <stashbot>	 T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399
[03:05:11] <jinxer-wm>	 FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:10:11] <jinxer-wm>	 RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:29:12] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:39:12] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:46:34] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1349 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[03:47:06] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[03:47:16] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1023 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[03:52:06] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Fix the loss of ParserOutput pointer in ContentDOMTransformStages [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031067 (https://phabricator.wikimedia.org/T364597) (owner: 10C. Scott Ananian)
[04:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0400)
[04:00:38] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.5  refs T361399 (duration: 57m 45s)
[04:00:42] <stashbot>	 T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399
[04:05:11] <jinxer-wm>	 FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:10:11] <jinxer-wm>	 RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:11:21] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-03-28-061726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015258 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry)
[04:12:04] <kart_>	 Deploying MinT ^^
[04:12:08] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2024-03-28-061726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015258 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry)
[04:14:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[04:16:34] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1349 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[04:17:06] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[04:17:16] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1023 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[04:18:43] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[04:23:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:25:47] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[04:33:48] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[04:45:11] <jinxer-wm>	 FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:45:49] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:48:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:50:11] <jinxer-wm>	 RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:50:49] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:59:02] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[04:59:42] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364809 (10phaultfinder) 03NEW
[04:59:45] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364810 (10phaultfinder) 03NEW
[05:08:57] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[05:15:22] <kart_>	 !log Updated MinT to 2024-03-28-061726-production (T333969)
[05:15:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:15:28] <stashbot>	 T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969
[05:16:09] <wikibugs>	 (03PS3) 10KartikMistry: Update cxserver to 2024-04-23-221507-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T363263)
[05:17:19] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-04-23-221507-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T363263) (owner: 10KartikMistry)
[05:18:22] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-04-23-221507-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T363263) (owner: 10KartikMistry)
[05:19:21] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:19:42] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:22:09] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:22:40] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:24:46] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:24:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:25:21] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:25:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:26:32] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:26:48] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:31:21] <kart_>	 !log Updated cxserver to 2024-04-23-221507-production (T363263, T333969, T360303, T360310)
[05:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:31] <stashbot>	 T363263: Post-creation work for iglwiki - https://phabricator.wikimedia.org/T363263
[05:31:32] <stashbot>	 T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969
[05:31:33] <stashbot>	 T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303
[05:31:33] <stashbot>	 T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310
[05:33:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:49:32] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364810#9793105 (10phaultfinder)
[05:50:27] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1030561 (https://phabricator.wikimedia.org/T364814)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0600)
[06:00:04] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0600). Please do the needful.
[06:05:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:08:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:10:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:18:05] <wikibugs>	 (03CR) 10Marostegui: Enable section-wide circuit breaking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[06:18:54] <wikibugs>	 (03CR) 10Marostegui: Enable section-wide circuit breaking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[06:26:18] <wikibugs>	 (03PS1) 10Marostegui: es1022: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031275
[06:26:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1022: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031275 (owner: 10Marostegui)
[06:33:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2185.codfw.wmnet with OS bookworm
[06:33:56] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db2185.codfw.wmnet with OS bookworm
[06:35:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2185.codfw.wmnet with OS bookworm
[06:36:11] <jinxer-wm>	 FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:41:01] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Mailing list for English Wiktionary admins - https://phabricator.wikimedia.org/T364731#9793207 (10Vininn126) Thank you very much!
[06:41:11] <jinxer-wm>	 RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:45:47] <wikibugs>	 (03PS1) 10Marostegui: db2185: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031293 (https://phabricator.wikimedia.org/T364296)
[06:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:54:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2185.codfw.wmnet with reason: host reimage
[06:56:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2185.codfw.wmnet with reason: host reimage
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0700).
[07:00:05] <jouncebot>	 Msz2001 and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:04:26] * kart_ is here
[07:04:46] <moritzm>	 !log installing glib2.0 security updates
[07:04:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:56] <kart_>	 Msz2001: around?
[07:04:59] <Msz2001>	 yes
[07:05:30] <kart_>	 Are you going to self deploy or looking for the deployer?
[07:05:45] <Msz2001>	 I'm looking for one
[07:07:01] <kart_>	 OK. I can deploy.
[07:07:13] <Msz2001>	 Ok, thanks
[07:07:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030978 (https://phabricator.wikimedia.org/T364769) (owner: 10Msz2001)
[07:08:31] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgSignatureValidation to 'disallow' on Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030978 (https://phabricator.wikimedia.org/T364769) (owner: 10Msz2001)
[07:09:14] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1030978|Set $wgSignatureValidation to 'disallow' on Polish Wikipedia (T364769)]]
[07:09:19] <stashbot>	 T364769: Set $wgSignatureValidation to 'disallow' on Polish Wikipedia - https://phabricator.wikimedia.org/T364769
[07:09:37] <kart_>	 Msz2001: I'll ping you when patch is available to test on mwdebug servers.
[07:09:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM. The new access group has been approved in yesterday's SRE IF meeting." [puppet] - 10https://gerrit.wikimedia.org/r/1027052 (https://phabricator.wikimedia.org/T364494) (owner: 10Dzahn)
[07:09:48] <Msz2001>	 ok
[07:11:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2185: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1031293 (https://phabricator.wikimedia.org/T364296) (owner: 10Marostegui)
[07:11:35] <Kizule>	 Hi, what's going on with refreshing special pages? https://sr.wikipedia.org/wiki/Special:BrokenRedirects wasn't updated since 7th of May.
[07:12:22] <logmsgbot>	 !log kartik@deploy1002 kartik and msz2001: Backport for [[gerrit:1030978|Set $wgSignatureValidation to 'disallow' on Polish Wikipedia (T364769)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:12:44] <kart_>	 Msz2001: Please test.
[07:13:11] <Msz2001>	 Can confirm the patch works as intended
[07:14:18] <Msz2001>	 Kizule: We have a similar problem on plwiki, eg. https://pl.wikipedia.org/wiki/Specjalna:Statystyki_oznaczania (8th May), I haven't checked if it's already filed
[07:14:45] <kart_>	 Msz2001: cool. Deploying.
[07:15:01] <logmsgbot>	 !log kartik@deploy1002 kartik and msz2001: Continuing with sync
[07:15:46] <dcausse>	 o/ I added a patch to the backport window (hope it's OK), I can deploy it once you're done the scheduled patches
[07:16:13] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] CX: Add mw.cx.UserPermissionChecker [extensions/ContentTranslation] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030325 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry)
[07:16:55] <kart_>	 (+2 my next patch for reducing CI wait time)
[07:17:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2185.codfw.wmnet with OS bookworm
[07:17:44] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1493 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:18:26] <kart_>	 dcausse: sure. 
[07:19:30] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:20:04] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1010 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:21:56] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1356 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:22:16] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:22:20] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubemaster1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:22:32] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Regex for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1031302
[07:23:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Regex for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1031302 (owner: 10Marostegui)
[07:27:43] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1030978|Set $wgSignatureValidation to 'disallow' on Polish Wikipedia (T364769)]] (duration: 18m 28s)
[07:27:46] <stashbot>	 T364769: Set $wgSignatureValidation to 'disallow' on Polish Wikipedia - https://phabricator.wikimedia.org/T364769
[07:28:05] <Msz2001>	 Thanks for delpoying!
[07:28:23] <kart_>	 Msz2001: You're welcome!
[07:28:43] <ihurbain>	 can i also add a patch to the list for this morning? O:-) (i'd need a deployer)
[07:28:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030325 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry)
[07:29:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:31:35] <ihurbain>	 kart_ i added one (1031067), if that works for you great and grateful, if not it'll wait a later window
[07:31:39] <kart_>	 ihurbain: It seems we will run out of the window, but let' see!
[07:31:44] <ihurbain>	 thank you :)
[07:32:12] <kart_>	 CI will take most of time in my patch :/
[07:32:37] <ihurbain>	 yeah :/ (and there's David in the queue before mine, so i'm not holding my breath)
[07:33:34] <dcausse>	 ihurbain: mine might be more complicated than I initially thought so I might just reschedule it for this afternoon
[07:34:53] <ihurbain>	 (afternoon also looks quite full fwiw)
[07:35:18] <dcausse>	 yes just saw that :/
[07:36:09] <kart_>	 Bakport/config window should be of 2 hours :)
[07:36:25] <dcausse>	 :)
[07:36:52] <dcausse>	 jouncebot: next
[07:36:52] <jouncebot>	 In 0 hour(s) and 23 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800)
[07:37:13] <kart_>	 I doubt we can deploy more than 3 patches given CI+deployment is taking time for each patches. Add testing in that.
[07:37:21] <ihurbain>	 ah and we have early train too
[07:38:36] <wikibugs>	 (03Merged) 10jenkins-bot: CX: Add mw.cx.UserPermissionChecker [extensions/ContentTranslation] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030325 (https://phabricator.wikimedia.org/T349959) (owner: 10KartikMistry)
[07:39:07] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1030325|CX: Add mw.cx.UserPermissionChecker (T349959)]]
[07:39:11] <stashbot>	 T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959
[07:42:54] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1030325|CX: Add mw.cx.UserPermissionChecker (T349959)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:43:36] <dcausse>	 ihurbain: moved my patch out of the way, feel free to +2 your patch while kart_ is deploying
[07:44:25] <ihurbain>	 mmh do i actually want to do that if there's a chance the train rolls before my patch is deployed? (genuine question, i don't know)
[07:44:41] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:45:15] <ihurbain>	 (and can i actually do that if i'm not deploying myself, process-wise?)
[07:45:41] <dcausse>	 ihurbain: it might possibly take a bit of the train deploy window indeed
[07:46:39] <moritzm>	 !log installing libgd2 security updates
[07:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:55] <dcausse>	 re +2 I guess it does not matter as long as the patch gets deployed soon after it's merged
[07:47:44] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1493 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:49:32] <ihurbain>	 which i can't really guarantee considering timings :/
[07:50:04] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:51:35] <dcausse>	 ihurbain: yes now it's unlikely we'll have enough time :(
[07:51:56] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1356 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:52:14] <wikibugs>	 (03PS1) 10Hashar: Revert "Gerrit: update mail soy templates to match upstream" [puppet] - 10https://gerrit.wikimedia.org/r/1031172 (https://phabricator.wikimedia.org/T364484)
[07:52:16] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:52:20] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubemaster1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:52:28] <ihurbain>	 i'll drop mine in late utc and it'll work too. :)
[07:52:37] <hashar>	 o/
[07:52:47] <hashar>	 dcausse: ihurbain: you can extend the backport window if you want
[07:52:52] <hashar>	 jouncebot: next
[07:52:52] <jouncebot>	 In 0 hour(s) and 7 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800)
[07:53:16] <hashar>	 the next one is the train and I am running this week with andre as the backup
[07:53:23] <ihurbain>	 with the train rolling at that time? (fwiw: i'm not pushing back, i'm just trying really hard to not step on anyone's toes :D )
[07:53:25] <andre>	 o/
[07:53:35] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 215887
[07:53:37] <dcausse>	 hashar: I cancelled mine, but happy to help deploy the one from Isabelle
[07:53:38] <hashar>	 but there is not much happening on Tuesday beside waiting :)
[07:53:52] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 215887
[07:54:34] <moritzm>	 !log installing PHP 7.3 security updates
[07:54:35] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[07:54:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:06] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[07:56:08] <ihurbain>	 anyway. if we can extend & deploy, then i appreciate it (it'll fix DT on wikitech :P ), if not it can wait until this evening.
[07:56:59] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1030325|CX: Add mw.cx.UserPermissionChecker (T349959)]] (duration: 17m 52s)
[07:57:03] <stashbot>	 T349959: Limit or inhibit access to machine translation for users in Chinese Wikipedia - https://phabricator.wikimedia.org/T349959
[07:57:23] <kart_>	 OK. My patch is done.
[07:57:45] <dcausse>	 kart_: ack
[07:58:15] <dcausse>	 hashar: do we have enough time for a backport on  wmf/1.43.0-wmf.4?
[08:00:04] <jouncebot>	 hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800)
[08:03:06] <andre>	 I'm cool with waiting a bit but up to hashar
[08:03:26] <hashar>	 dcausse: depends? ;)
[08:03:38] <hashar>	 yes please go ahead
[08:03:42] <dcausse>	 ok
[08:03:47] <ihurbain>	 \o/
[08:04:18] <wikibugs>	 (03CR) 10DCausse: [C:03+2] Fix the loss of ParserOutput pointer in ContentDOMTransformStages [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031067 (https://phabricator.wikimedia.org/T364597) (owner: 10C. Scott Ananian)
[08:07:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Gerrit: update mail soy templates to match upstream" [puppet] - 10https://gerrit.wikimedia.org/r/1031172 (https://phabricator.wikimedia.org/T364484) (owner: 10Hashar)
[08:09:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:04-1] "See my comment on related task" [puppet] - 10https://gerrit.wikimedia.org/r/1031050 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron)
[08:11:55] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Start monitoring es6, es7 for regular backups produced [puppet] - 10https://gerrit.wikimedia.org/r/1031387 (https://phabricator.wikimedia.org/T363812)
[08:12:05] <wikibugs>	 (03CR) 10Hashar: [C:03+1] ci: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:12:28] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "I guess am still confused by the Docker/containerd model :-]" [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:12:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAp servers to Bullseye - https://phabricator.wikimedia.org/T364823 (10MoritzMuehlenhoff) 03NEW
[08:13:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9793379 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:15:37] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubestagemaster2005.codfw.wmnet with OS bullseye
[08:15:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9793393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaste...
[08:17:40] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add the airflow profile to the statistics::explorer role [puppet] - 10https://gerrit.wikimedia.org/r/1029541 (https://phabricator.wikimedia.org/T364542) (owner: 10Btullis)
[08:19:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] dbbackups: Start monitoring es6, es7 for regular backups produced [puppet] - 10https://gerrit.wikimedia.org/r/1031387 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[08:19:30] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:21:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1031391
[08:21:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Bdgreenlee out of all services on: 2208 hosts
[08:22:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1031391 (owner: 10Muehlenhoff)
[08:22:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Bdgreenlee out of all services on: 2208 hosts
[08:25:30] <wikibugs>	 (03Merged) 10jenkins-bot: Fix the loss of ParserOutput pointer in ContentDOMTransformStages [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031067 (https://phabricator.wikimedia.org/T364597) (owner: 10C. Scott Ananian)
[08:25:46] <wikibugs>	 (03PS2) 10Klausman: role::ml_cache::storage: Add staging and cross-DC IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1031390 (https://phabricator.wikimedia.org/T360428)
[08:26:53] <logmsgbot>	 !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1031067|Fix the loss of ParserOutput pointer in ContentDOMTransformStages (T364597)]]
[08:26:58] <stashbot>	 T364597: Missing content on discussion tools on Parsoid - https://phabricator.wikimedia.org/T364597
[08:27:07] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Start monitoring es6, es7 for regular backups produced [puppet] - 10https://gerrit.wikimedia.org/r/1031387 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[08:27:53] <dcausse>	 ihurbain: started to deploy, is this something you can test on debug servers?
[08:27:59] <ihurbain>	 there is
[08:28:31] <ihurbain>	 should i look into that now? are we on eqiad?
[08:29:04] <dcausse>	 ihurbain: it's not yet there, but you mentionned wikitech and I'm not sure wikitech is run from test servers
[08:29:19] <ihurbain>	 it's not, but it's testable in other places too
[08:29:23] <dcausse>	 ok
[08:29:32] <ihurbain>	 it's just more visible on wikitech because we run parsoid by default there on DT :)
[08:29:35] <logmsgbot>	 !log dcausse@deploy1002 dcausse and cscott: Backport for [[gerrit:1031067|Fix the loss of ParserOutput pointer in ContentDOMTransformStages (T364597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:29:46] <dcausse>	 ihurbain: now it's there ^
[08:30:30] <ihurbain>	 aaaaand it works \o/ 
[08:30:33] <ihurbain>	 ship it!
[08:30:47] <dcausse>	 shipping!
[08:30:49] <logmsgbot>	 !log dcausse@deploy1002 dcausse and cscott: Continuing with sync
[08:31:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1031390 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman)
[08:31:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 10vm-requests, 07Kubernetes: Site: codfw 2 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T364740#9793453 (10JMeybohm) 05Open→03Resolved It's not clear to me what happened here. The makevm call was unable to...
[08:31:55] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] role::ml_cache::storage: Add staging and cross-DC IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1031390 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman)
[08:33:41] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "We got that due notably to Pytorch which is a fairly large installation (14G iirc), the context is T338317#9623848 and T364773 + this chan" [puppet] - 10https://gerrit.wikimedia.org/r/1031045 (https://phabricator.wikimedia.org/T364773) (owner: 10Ahmon Dancy)
[08:34:42] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Fix all-etcd, wikikube-master and wikikube-etcd aliases [puppet] - 10https://gerrit.wikimedia.org/r/1030995 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[08:34:49] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster100[345] [puppet] - 10https://gerrit.wikimedia.org/r/1030996 (https://phabricator.wikimedia.org/T364746) (owner: 10JMeybohm)
[08:37:58] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1393 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:39:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:41:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1003.eqiad.wmnet
[08:41:06] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:43:11] <logmsgbot>	 !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1031067|Fix the loss of ParserOutput pointer in ContentDOMTransformStages (T364597)]] (duration: 16m 17s)
[08:43:15] <stashbot>	 T364597: Missing content on discussion tools on Parsoid - https://phabricator.wikimedia.org/T364597
[08:43:18] <dcausse>	 ihurbain: done
[08:43:23] <ihurbain>	 yay!
[08:43:38] <dcausse>	 hashar, andre: we're done :)
[08:43:48] <ihurbain>	 dcausse: thank you very much; thank you hashar and andre too for accepting the train delay!
[08:44:19] <andre>	 thanks
[08:44:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002"
[08:44:34] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1004.eqiad.wmnet
[08:45:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002"
[08:45:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:45:21] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster1003.eqiad.wmnet on all recursors
[08:45:24] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1003.eqiad.wmnet on all recursors
[08:45:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:46:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002"
[08:47:05] <hashar>	 back with a coffee
[08:47:11] <hashar>	 andre: wanna do it over a google meet?
[08:47:17] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1003.eqiad.wmnet - jayme@cumin1002"
[08:48:09] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1005.eqiad.wmnet
[08:48:14] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002"
[08:48:41] <andre>	 hashar, would be nice to refresh my memories, feel free to join the one in your calendar
[08:49:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002"
[08:49:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:49:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster1004.eqiad.wmnet on all recursors
[08:49:08] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1004.eqiad.wmnet on all recursors
[08:49:26] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:49:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1003.eqiad.wmnet with OS bullseye
[08:49:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002"
[08:49:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793530 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast...
[08:50:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1004.eqiad.wmnet - jayme@cumin1002"
[08:52:40] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1004.eqiad.wmnet with OS bullseye
[08:52:51] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002"
[08:52:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast...
[08:54:09] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002"
[08:54:09] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:54:09] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster1005.eqiad.wmnet on all recursors
[08:54:12] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1005.eqiad.wmnet on all recursors
[08:54:40] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002"
[08:57:00] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Service mesh: rename local_service cluster (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030220 (owner: 10CDanis)
[08:57:37] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1005.eqiad.wmnet - jayme@cumin1002"
[08:58:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on serpens.wikimedia.org with reason: OS update
[08:58:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on serpens.wikimedia.org with reason: OS update
[08:58:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9793550 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=34ac3b76-436c-436c-afc2-20387cde43fb) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with...
[09:02:10] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage
[09:02:19] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage
[09:04:28] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage
[09:05:46] <wikibugs>	 (03PS3) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491)
[09:06:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[09:06:49] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage
[09:07:59] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1393 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:09:03] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031396 (https://phabricator.wikimedia.org/T361399)
[09:09:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031396 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[09:09:27] <wikibugs>	 (03PS4) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491)
[09:09:43] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031396 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[09:11:55] <wikibugs>	 (03PS7) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310)
[09:13:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:13:27] <wikibugs>	 (03CR) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service (032 comments) [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[09:14:09] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1005.eqiad.wmnet with OS bullseye
[09:14:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast...
[09:17:01] <wikibugs>	 (03PS3) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656)
[09:18:02] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:18:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1435 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:18:47] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1003.eqiad.wmnet with OS bullseye
[09:18:48] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1003.eqiad.wmnet
[09:18:51] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:19:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10...
[09:19:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[09:20:01] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1004.eqiad.wmnet with OS bullseye
[09:20:01] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1004.eqiad.wmnet
[09:20:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10...
[09:21:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:22:19] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:23:02] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:23:05] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812)
[09:24:07] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2421/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[09:24:20] <wikibugs>	 (03PS4) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656)
[09:24:45] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.5  refs T361399
[09:24:49] <stashbot>	 T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399
[09:25:50] <wikibugs>	 (03CR) 10Vgutierrez: "Idea looks good. PCC run needs to be adjusted, cp hosts aren't involved here and acme-chief ones are missing" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall)
[09:26:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[09:26:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9793591 (10MoritzMuehlenhoff) serpens has been migrated to Bullseye, seaborgium to follow in a few days.
[09:27:01] <wikibugs>	 (03PS3) 10JMeybohm: Add kubestagemaster2004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307)
[09:27:11] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage
[09:27:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[09:27:30] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[09:28:17] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:28:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9793592 (10ayounsi) @cmooney what do you think of duplicating the other POPs allocation scheme? For example looking at eqiad as example, keep 2a02:ec80:a000::/40 as "reserved for future growth" Then...
[09:31:25] <wikibugs>	 (03CR) 10Marostegui: db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui)
[09:31:32] <marostegui>	 jouncebot: now
[09:31:32] <jouncebot>	 For the next 0 hour(s) and 28 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T0800)
[09:31:35] <marostegui>	 jouncebot: next
[09:31:35] <jouncebot>	 In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1000)
[09:31:41] <wikibugs>	 (03PS3) 10Filippo Giunchedi: jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477)
[09:31:41] <wikibugs>	 (03PS3) 10Filippo Giunchedi: jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477)
[09:31:41] <wikibugs>	 (03PS3) 10Filippo Giunchedi: jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477)
[09:31:47] <hashar>	 marostegui: we have finished promoting the train :)
[09:31:55] <hashar>	 and currently browsing the log spam with andre
[09:31:55] <marostegui>	 hashar: thanks :)
[09:31:58] <hashar>	 so feel free to deploy
[09:31:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage
[09:32:03] <marostegui>	 hashar: <3
[09:32:08] <hashar>	 I have a couple database that have vanished though
[09:32:14] <marostegui>	 uh?
[09:32:21] <hashar>	 Unknown database 'wikishared' (db1223)
[09:32:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[09:32:35] <hashar>	 Error 1049: Unknown database 'cognate_wiktionary'
[09:32:35] <hashar>	 :)
[09:32:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[09:32:46] <Lucas_WMDE>	 hashar: I’m looking at the cognate_wiktionary rn
[09:32:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[09:33:03] <marostegui>	 hashar: db1223 isn't supposed to have wikishared
[09:33:06] <hashar>	 they happen frm time to time, I guess cause the code paths are not hit that often
[09:33:09] <marostegui>	 so something might be wrong with the code
[09:33:16] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:33:20] <wikibugs>	 (03PS2) 10JMeybohm: Add kubestagemaster2004 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1030958 (https://phabricator.wikimedia.org/T363307)
[09:33:28] * hashar points at DNS
[09:33:29] <hashar>	 err
[09:33:31] <hashar>	 PHP
[09:33:39] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui)
[09:33:42] <marostegui>	 db1223 is s3 and wikishared lives in x1
[09:33:44] <hashar>	 Lucas_WMDE: danke schon!
[09:33:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui)
[09:34:30] <wikibugs>	 (03Merged) 10jenkins-bot: db-production.php: Make es4 and es5 RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030918 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui)
[09:35:01] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:1030918|db-production.php: Make es4 and es5 RO (T364447)]]
[09:35:02] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2004 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1030957 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[09:35:06] <stashbot>	 T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447
[09:35:21] <hashar>	 marostegui: I will investigate a bit. Maybe it is known
[09:35:29] <marostegui>	 hashar: let me know if I can help
[09:36:14] <hashar>	 I suspect it is a regression with this week code
[09:36:18] <hashar>	 I am filing a task 
[09:37:25] <Lucas_WMDE>	 hashar: I just filed T364827 for the cognate task
[09:37:26] <stashbot>	 T364827: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'cognate_wiktionary' - https://phabricator.wikimedia.org/T364827
[09:37:30] <Lucas_WMDE>	 no idea if wikishared is related though
[09:37:38] <hashar>	 I am digging into it
[09:37:52] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Backport for [[gerrit:1030918|db-production.php: Make es4 and es5 RO (T364447)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:37:55] <hashar>	 apparently comes from a post job from LoginNotify
[09:37:58] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Continuing with sync
[09:39:59] <marostegui>	 Lucas_WMDE: just commented there
[09:40:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2004 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1030958 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[09:40:32] <marostegui>	 Both things are related, they are both looking for databases in s3, but those two databases: cognate_wiktionary and wikishared live in x1
[09:41:39] <claime>	 Is testwiki.ce_question_answers in that case as well?
[09:41:52] <claime>	 mediawiki_job_campaignevents-aggregateparticipantanswers-testwiki is failing due to not finding this table
[09:42:12] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1010 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:42:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:42:15] <marostegui>	 claime: test wiki does live in s3
[09:42:30] <marostegui>	 claime: let me check if that tablje exists
[09:42:50] <marostegui>	 claime: cumin2024@db1175.eqiad.wmnet[testwiki]> show tables like 'ce_qu%';
[09:42:50] <marostegui>	 Empty set (0.001 sec)
[09:43:28] <marostegui>	 So there is not such table in testwiki
[09:44:35] <claime>	 The timing feels weird though. The 3AM run went fine, the 6AM run failed
[09:45:07] <claime>	 Ah, the train ran in between
[09:45:13] <hashar>	 I am going to rollback the train 
[09:45:16] <hashar>	 in a few minutes
[09:45:20] <Lucas_WMDE>	 hashar: I don’t think it’s the train
[09:45:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1005.eqiad.wmnet with OS bullseye
[09:45:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1005.eqiad.wmnet
[09:45:24] <Lucas_WMDE>	 since I can also reproduce it on dewiktionary
[09:45:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10...
[09:45:29] <Lucas_WMDE>	 (I’ll leave a comment on the task in a sec)
[09:45:37] <hashar>	 ah
[09:45:56] <hashar>	 there is also https://phabricator.wikimedia.org/T364828  which is not able to find the wikishared database due to a misconfig 
[09:46:06] <hashar>	 so I suspect maybe gthe database layer might be confused / wrong
[09:46:27] <hashar>	 that other task is for the LoginNotify exdtension and I guess that breaks its feature
[09:46:45] <Lucas_WMDE>	 might be worth trying a train rollback anyway, I guess
[09:47:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1003.eqiad.wmnet to plain
[09:47:37] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of kubestagemaster1003.eqiad.wmnet to plain
[09:47:40] <wikibugs>	 (03CR) 10Volans: [C:03+2] external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans)
[09:47:46] <Lucas_WMDE>	 hashar: scratch all that, I fail at testing
[09:47:50] <Lucas_WMDE>	 dewiktionary not affected AFAICT
[09:47:52] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1003.eqiad.wmnet to plain
[09:47:54] <Lucas_WMDE>	 so probably train after all
[09:48:04] <wikibugs>	 (03CR) 10Ladsgroup: Enable section-wide circuit breaking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[09:48:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793708 (10ops-monitoring-bot) VM kubestagemaster1003.eqiad.wmnet switching disk type to plain
[09:48:28] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1003.eqiad.wmnet to plain
[09:48:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1004.eqiad.wmnet to plain
[09:48:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Sounds good, thanks a lot for working on this. This is super great" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[09:48:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793711 (10ops-monitoring-bot) VM kubestagemaster1004.eqiad.wmnet switching disk type to plain
[09:49:10] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1004.eqiad.wmnet to plain
[09:49:14] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1005.eqiad.wmnet to plain
[09:49:38] <hashar>	 Lucas_WMDE: yeah I think that issue with Cognate is similar to the one with LoginNotify
[09:49:45] <hashar>	 and probably share the same cause
[09:49:52] <hashar>	 I have marked both UBN / Blockers
[09:50:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793712 (10ops-monitoring-bot) VM kubestagemaster1005.eqiad.wmnet switching disk type to plain
[09:50:08] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1005.eqiad.wmnet to plain
[09:50:08] <hashar>	 and they should be reproducible on the test wikis
[09:50:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org
[09:50:14] <wikibugs>	 (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402
[09:50:19] <claime>	 I'll file a bug for CampaignEvents, since it's a missing table and not a db not found, seems like a different issue
[09:50:29] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:1030918|db-production.php: Make es4 and es5 RO (T364447)]] (duration: 15m 28s)
[09:50:33] <stashbot>	 T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447
[09:51:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:05:00 on 6 hosts with reason: Primary switchover es4 T364451
[09:51:48] <stashbot>	 T364451: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T364451
[09:52:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on 6 hosts with reason: Primary switchover es4 T364451
[09:52:29] <wikibugs>	 (03CR) 10Ladsgroup: "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[09:52:32] <wikibugs>	 (03PS3) 10Ladsgroup: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930)
[09:52:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 0:05:00 on 6 hosts with reason: Checking RO status
[09:52:47] <hashar>	 I am rolling back now
[09:52:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 (owner: 10Muehlenhoff)
[09:53:00] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on 6 hosts with reason: Checking RO status
[09:53:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org
[09:54:11] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031404 (https://phabricator.wikimedia.org/T361399)
[09:54:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031404 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[09:54:36] <wikibugs>	 (03Abandoned) 10Hashar: testwikis wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031404 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[09:55:41] <wikibugs>	 (03PS4) 10Filippo Giunchedi: jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477)
[09:55:41] <wikibugs>	 (03PS4) 10Filippo Giunchedi: jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477)
[09:56:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793752 (10JMeybohm) 05Open→03Resolved
[09:56:03] <wikibugs>	 (03PS1) 10Hashar: Revert "group0 wikis to 1.43.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031405 (https://phabricator.wikimedia.org/T361399)
[09:56:04] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Revert "group0 wikis to 1.43.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031405 (https://phabricator.wikimedia.org/T361399) (owner: 10Hashar)
[09:56:15] <hashar>	 andre: ^
[09:56:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.43.0-wmf.5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031405 (https://phabricator.wikimedia.org/T361399) (owner: 10Hashar)
[09:58:02] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:58:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406
[09:58:32] <wikibugs>	 (03PS2) 10Muehlenhoff: Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406
[09:58:51] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:59:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The CI failure is expected, I kept the change in aux values.yaml separate (Ie5b4213379b) to highlight what makes CI pass. Though I can mer" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1000)
[10:00:38] <Lucas_WMDE>	 (train window not done yet)
[10:00:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Service mesh: rename local_service cluster (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030220 (owner: 10CDanis)
[10:01:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 (owner: 10Muehlenhoff)
[10:01:29] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Clarify the status of each section [puppet] - 10https://gerrit.wikimedia.org/r/1031408 (https://phabricator.wikimedia.org/T364447)
[10:01:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:02:35] <wikibugs>	 (03PS2) 10Wargo: Assign applychangetags right to group "all" on plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031174 (https://phabricator.wikimedia.org/T363638)
[10:03:02] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:03:10] <hashar>	 it is rolling back
[10:04:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Clarify the status of each section [puppet] - 10https://gerrit.wikimedia.org/r/1031408 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui)
[10:04:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Service mesh: rename local_service cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030221 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[10:05:07] <wikibugs>	 (03CR) 10Marostegui: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[10:05:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2383 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:05:15] <wikibugs>	 (03PS3) 10Muehlenhoff: Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406
[10:05:34] <wikibugs>	 (03CR) 10Marostegui: "Thanks for working on this guys!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup)
[10:07:06] <wikibugs>	 (03PS3) 10Gmodena: EventStreamConfig: Add webrequest.frontend.v1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956)
[10:07:55] <wikibugs>	 (03CR) 10MVernon: [C:03+1] profile::swift::proxy_tls: Use Envoy unconditionally and drop Hiera flag [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[10:08:31] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[10:11:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ci: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1028795 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:11:38] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[10:11:58] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 (owner: 10Muehlenhoff)
[10:12:13] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:12:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Point Nova spec test to bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031406 (owner: 10Muehlenhoff)
[10:12:15] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:12:28] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.43.0-wmf.5" - T361399
[10:12:35] <stashbot>	 T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399
[10:13:30] <wikibugs>	 (03PS2) 10Wargo: Add alias for NS_PROJECT for Multilingual Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031175 (https://phabricator.wikimedia.org/T363904)
[10:13:50] <wikibugs>	 (03PS2) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402
[10:16:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:17:38] <Amir1>	 hashar: I can get the fix out of the door soon, wanna retry again soon?
[10:17:43] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=kubestagemaster2004.codfw.wmnet
[10:17:44] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/weight=10; selector: name=kubestagemaster2004.codfw.wmnet
[10:18:02] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:18:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1435 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:21:07] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::swift::proxy_tls: Use Envoy unconditionally and drop Hiera flag [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750)
[10:21:33] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 36093.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:21:49] <hashar>	 Amir1: well train is rolled back so we are all set :]   There is time to polish the patch, but potentially maybe the code should be reverted since it might have other fault and lack tests
[10:22:08] <hashar>	 or maybe it is a trivial code that is already covered by test, then one test is surely missing
[10:22:10] <hashar>	 anyway
[10:22:31] <Amir1>	 it is covered by tests, I'll add regression test soon
[10:22:38] <hashar>	 yeah that would be great :)
[10:22:50] <hashar>	 there is no rush. Train got rolled back
[10:22:56] <marostegui>	 db2152 is expected
[10:22:59] <hashar>	 I am off for lunch break with kids :)
[10:23:01] <marostegui>	 But I thought i downtimed it
[10:23:41] <wikibugs>	 (03PS1) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534)
[10:23:43] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: s8 on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 36213.57 seconds Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:24:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[10:24:20] <hashar>	 Amir1: there is another blocker for VisualEditor which needs a backport as well. i will give it a try after lunch
[10:24:32] <hashar>	 so essentially: no rush
[10:25:42] <wikibugs>	 (03PS2) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534)
[10:26:34] <wikibugs>	 (03PS5) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656)
[10:26:35] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[10:29:03] <wikibugs>	 (03PS2) 10Wargo: $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291)
[10:33:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9793897 (10MoritzMuehlenhoff) But isn't it simper to just grep in the output of a single cookbook as opposed to grep the output of multiple tools?
[10:33:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::swift::proxy_tls: Use Envoy unconditionally and drop Hiera flag [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[10:35:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2383 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:35:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Inline profile::swift::proxy_tls [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750)
[10:38:27] <wikibugs>	 (03PS1) 10Ladsgroup: rdbms: Fix picking the database from the LB domain [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827)
[10:38:39] <wikibugs>	 (03PS1) 10Klausman: ml-services: Change references to cassandra clusters from using _ to - [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428)
[10:38:40] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] rdbms: Fix picking the database from the LB domain [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827) (owner: 10Ladsgroup)
[10:39:18] <wikibugs>	 (03PS1) 10Btullis: Improve dumps::web::rsync::nginxlogs management [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820)
[10:39:19] <wikibugs>	 (03PS1) 10Btullis: Manage the directory for dumps.wikimedia.org logs on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820)
[10:40:55] <wikibugs>	 (03PS3) 10Wargo: $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291)
[10:43:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[10:44:50] <wikibugs>	 (03CR) 10Volans: "I didn't check if the ASN correspond to the existing IPs in requestctl, but they are matching the names." [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[10:45:20] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2422/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis)
[10:46:42] <wikibugs>	 (03PS3) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534)
[10:47:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[10:48:02] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:48:45] <wikibugs>	 (03CR) 10Ladsgroup: "I'd say keep it simple. We don't need to introduce too many functions. I just deploy this then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup)
[10:49:00] <wikibugs>	 (03PS4) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534)
[10:49:43] <wikibugs>	 (03PS2) 10Btullis: Improve dumps::web::rsync::nginxlogs management [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820)
[10:49:43] <wikibugs>	 (03PS2) 10Btullis: Manage the directory for dumps.wikimedia.org logs on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820)
[10:51:21] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis)
[10:52:29] <wikibugs>	 (03CR) 10Jelto: "I added all related ASNs I could find as discussed in IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[10:56:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827) (owner: 10Ladsgroup)
[10:59:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9793972 (10cmooney) >>! In T187929#9793592, @ayounsi wrote: > @cmooney what do you think of duplicating the other POPs allocation scheme? > For example looking at eqiad as example, keep 2a02:ec80:a00...
[11:00:04] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812)
[11:00:23] <wikibugs>	 (03PS1) 10Muehlenhoff: gerrit::migration: Let rsync handle the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1031423
[11:02:24] <wikibugs>	 (03Merged) 10jenkins-bot: rdbms: Fix picking the database from the LB domain [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031177 (https://phabricator.wikimedia.org/T364827) (owner: 10Ladsgroup)
[11:02:33] <wikibugs>	 (03PS5) 10Jelto: external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534)
[11:02:53] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1031177|rdbms: Fix picking the database from the LB domain (T364827)]]
[11:02:56] <stashbot>	 T364827: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'cognate_wiktionary' - https://phabricator.wikimedia.org/T364827
[11:03:17] <wikibugs>	 (03CR) 10Jelto: "let's start with a small set of ASNs first and expand if needed" [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[11:03:22] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[11:03:23] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031423 (owner: 10Muehlenhoff)
[11:03:27] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Looks reasonable to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[11:04:33] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Update the list of valid sections to check for WMFbackups [puppet] - 10https://gerrit.wikimedia.org/r/1031397 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[11:05:31] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1031177|rdbms: Fix picking the database from the LB domain (T364827)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:05:57] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[11:06:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[11:06:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[11:07:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T364299)', diff saved to https://phabricator.wikimedia.org/P62378 and previous config saved to /var/cache/conftool/dbconfig/20240514-110704-marostegui.json
[11:07:12] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[11:10:17] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1030 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:11:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:12:54] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Improve dumps::web::rsync::nginxlogs management [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis)
[11:13:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P62379 and previous config saved to /var/cache/conftool/dbconfig/20240514-111302-root.json
[11:13:41] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on db2152 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:14:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9794015 (10WDoranWMF) Approved!
[11:16:13] <icinga-wm>	 PROBLEM - dump of es6 in eqiad on backupmon1001 is CRITICAL: Last dump for es6 at eqiad (es1036) taken on 2024-05-14 06:09:20 is 1.7 GiB, but the previous one was 328 KiB, a change of +544922.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[11:18:38] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM! One open-question in line but I am happy either way.  Nice one :)" [puppet] - 10https://gerrit.wikimedia.org/r/1030185 (https://phabricator.wikimedia.org/T363702) (owner: 10Bking)
[11:18:40] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1031177|rdbms: Fix picking the database from the LB domain (T364827)]] (duration: 15m 47s)
[11:18:46] <stashbot>	 T364827: Wikimedia\Rdbms\DBQueryError: Error 1049: Unknown database 'cognate_wiktionary' - https://phabricator.wikimedia.org/T364827
[11:19:13] <icinga-wm>	 PROBLEM - dump of es7 in codfw on backupmon1001 is CRITICAL: Last dump for es7 at codfw (es2040) taken on 2024-05-14 05:35:25 is 1.7 GiB, but the previous one was 329 KiB, a change of +533277.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[11:21:58] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Manage the directory for dumps.wikimedia.org logs on stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1031417 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis)
[11:22:13] <icinga-wm>	 PROBLEM - dump of es7 in eqiad on backupmon1001 is CRITICAL: Last dump for es7 at eqiad (es1040) taken on 2024-05-14 06:10:27 is 1.7 GiB, but the previous one was 329 KiB, a change of +543674.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[11:22:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Inline profile::swift::proxy_tls [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[11:23:02] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:19] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:24:44] <wikibugs>	 (03PS5) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491)
[11:28:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P62381 and previous config saved to /var/cache/conftool/dbconfig/20240514-112807-root.json
[11:29:24] <wikibugs>	 (03PS2) 10Ladsgroup: etcd: Ignore parsercache clusters in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786)
[11:29:29] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] etcd: Ignore parsercache clusters in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup)
[11:29:43] <Amir1>	 jouncebot: nowandnext
[11:29:43] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 30 minute(s)
[11:29:43] <jouncebot>	 In 0 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1200)
[11:29:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup)
[11:30:05] <wikibugs>	 (03Merged) 10jenkins-bot: etcd: Ignore parsercache clusters in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030938 (https://phabricator.wikimedia.org/T362786) (owner: 10Ladsgroup)
[11:30:35] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1030938|etcd: Ignore parsercache clusters in externalLoads (T362786)]]
[11:30:41] <stashbot>	 T362786: Enable dbctl for parsercache - https://phabricator.wikimedia.org/T362786
[11:31:49] <wikibugs>	 (03CR) 10JMeybohm: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[11:32:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9794071 (10KOfori) a:05KOfori→03Eevans Approved.
[11:33:12] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1030938|etcd: Ignore parsercache clusters in externalLoads (T362786)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:35:05] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[11:38:03] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm. In future it would be nice to use either the external-services-networkpolicy module or a shared approach for sessionstore and echost" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans)
[11:39:56] <hnowlan>	 jouncebot: nowandnext
[11:39:56] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[11:39:56] <jouncebot>	 In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1200)
[11:40:17] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1030 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:43:13] <icinga-wm>	 PROBLEM - dump of es6 in codfw on backupmon1001 is CRITICAL: Last dump for es6 at codfw (es2036) taken on 2024-05-14 05:33:02 is 1.7 GiB, but the previous one was 328 KiB, a change of +534582.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[11:43:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P62382 and previous config saved to /var/cache/conftool/dbconfig/20240514-114314-root.json
[11:47:58] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1030938|etcd: Ignore parsercache clusters in externalLoads (T362786)]] (duration: 17m 22s)
[11:48:02] <stashbot>	 T362786: Enable dbctl for parsercache - https://phabricator.wikimedia.org/T362786
[11:58:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P62383 and previous config saved to /var/cache/conftool/dbconfig/20240514-115820-root.json
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1200)
[12:01:49] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[12:02:28] <wikibugs>	 (03Merged) 10jenkins-bot: Enable section-wide circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031021 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[12:03:18] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]]
[12:03:23] <stashbot>	 T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930
[12:06:00] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:08:02] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:08:06] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9794186 (10Clement_Goubert) We are currently holding at 85% of global traffic, and as such not reimaging anymore serv...
[12:11:43] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:11:53] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[12:12:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Zookeeper: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1031429
[12:12:27] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[12:13:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P62384 and previous config saved to /var/cache/conftool/dbconfig/20240514-121326-root.json
[12:14:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 (owner: 10Muehlenhoff)
[12:15:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031402 (owner: 10Muehlenhoff)
[12:16:15] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:16:36] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323)
[12:16:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1387 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:16:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1352 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:17:07] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2320 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:18:02] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:18:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff)
[12:20:07] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:23:19] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:24:31] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1031021|Enable section-wide circuit breaking (T360930)]] (duration: 21m 12s)
[12:24:34] <stashbot>	 T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930
[12:24:53] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[12:24:58] <wikibugs>	 (03PS2) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257)
[12:25:29] <wikibugs>	 (03CR) 10Jelto: [C:03+2] external clouds: add more cloud providers [puppet] - 10https://gerrit.wikimedia.org/r/1031412 (https://phabricator.wikimedia.org/T303534) (owner: 10Jelto)
[12:26:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host serpens.wikimedia.org
[12:27:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch serpens to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031431 (https://phabricator.wikimedia.org/T349619)
[12:29:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LG thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman)
[12:30:29] <vgutierrez>	 is CI doing ok? I've been waiting ~6 minutes already for a CI check on https://gerrit.wikimedia.org/r/c/operations/dns/+/1030939
[12:30:58] <hashar>	 vgutierrez: https://integration.wikimedia.org/zuul/
[12:31:15] <hashar>	 that gives you an overview
[12:31:34] <cdanis>	 hashar: weirdly I don't see an operations/dns patch there at all
[12:31:49] <vgutierrez>	 yeah.. it doesn't seem to be queued
[12:31:52] <hashar>	 then without looking, we are reimaging one of the server and currently run with half the capaicity
[12:32:00] <hashar>	 though in practice it is rarely used fully
[12:32:02] <taavi>	 operations/dns is not configured to run any gate-and-submit jobs on a +2
[12:32:22] <vgutierrez>	 taavi: but it should be triggerede on the rebase?
[12:32:35] <hashar>	 the bottleneck would be the `zuul-merger` process which picks the proposed patch, merge it against the tip of the branch  and the result is used by CI to run the tests
[12:32:49] <hashar>	 oh
[12:33:01] <taavi>	 not if there's a +2 applied at that point
[12:33:07] <vgutierrez>	 uh
[12:33:21] <wikibugs>	 (03CR) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[12:33:48] <wikibugs>	 (03CR) 10CDanis: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[12:33:50] <hashar>	 that repo is broken
[12:33:59] <hashar>	 the rebase should have cleared the +2
[12:34:15] <vgutierrez>	 hashar: it cleared the V:+2 not the C:+2 
[12:34:56] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1030939 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[12:34:57] * hashar shakes fist at CopyOnTrivialRebase
[12:35:14] <bblack>	 historically ops/dns had a simpler config (no gate-and-submit, etc) because we hoped it might still function when some things are broken :)
[12:35:17] <vgutierrez>	 !log depool upload@ulsfo before enabling IPIP encapsulation - T357257
[12:35:20] <hashar>	 + it is fast forward only
[12:35:28] <hashar>	 when in most case we migrated repositories to use rebase if necessary
[12:36:03] <bblack>	 ff-only kinda makes sense there too, IMHO
[12:36:42] <hashar>	 possibly yes :)
[12:38:05] <bblack>	 but a lot of this is fear-driven engineering, and we've never had a compelling story for how an SRE deploys an emergency DNS change when all the things are broken (other than a very manual and tedious way)
[12:38:47] <bblack>	 if we ever "solve" the latter with something slightly-more-elegant, maybe we can care less about the dependencies involved in the "normal" flow
[12:39:04] <hashar>	 OH I FOUND OUT
[12:39:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host serpens.wikimedia.org
[12:39:17] <hashar>	 the Code-Review label got copied cause that is just a rebase
[12:39:21] <cdanis>	 bblack: IME, the only way you ever work your way out of that cycle is by having a process and exercising it semi-regularly (even 1x-2x/yr is enough)
[12:39:33] <hashar>	 and usually we do want to copy the Code-Review+1 and carry it between rebase
[12:40:12] <hashar>	 when there is a CR+2 , that is usually intended to trigger a submit/merge so it is unlikely a rebase follow and even if that is the case, I guess we might still want to carry the CR+2
[12:40:19] <hashar>	 but for operations/dns the CR+2 does nothing yeah
[12:40:56] <hashar>	 and there is an opitimzation in CI to not bother running tests from the `test` or `test-prio` for a change that already has a CR+2 since they will be run by the `gate-and-submit` pipeline
[12:41:00] <hashar>	 so yeah that is "normal"
[12:41:06] <hashar>	 (sorry I am thinking out loud)
[12:41:38] <Lucas_WMDE>	 (FWIW, in “normal” / extension repos I find CR+2 surviving a rebase to be a useful feature ^^)
[12:41:48] <hashar>	 yeah
[12:41:55] <cdanis>	 Lucas_WMDE: sure, when it would trigger anything at all :)
[12:42:02] <Lucas_WMDE>	 yeah ^^
[12:42:09] <hashar>	 but operations/dns does not CI merging changes on a CR+2
[12:42:31] <hashar>	 I think the reason was that at the time we did not want CI to submit changes to sensible repositories such as operations/dns and operations/puppet
[12:42:54] <hashar>	 so those two repos have a different process
[12:43:40] <wikibugs>	 (03PS1) 10LSobanski: Filter out addresses handled by gsuite that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145)
[12:44:37] <hashar>	 also operations/puppet used to have "fast forward only" merge strategy which caused SRE to spend their time racing to have their change rebased on tip of the branch https://phabricator.wikimedia.org/T224033
[12:45:17] <hashar>	 that got solved by changing the strategy to "rebase if necessary" which is that Gerrit rebase it under the hood  and keep a linear strategy
[12:45:45] <bblack>	 yeah, ff-only isn't really sustainable at a higher commit rate
[12:46:01] <bblack>	 but luckily ops/dns is relatively-slow
[12:46:15] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:46:26] <hashar>	 operations/dns is still using "fast forward only". I think the only advantage for it is that if the branch received a change to one of the files touched by the change, Gerrit will mark it as being in conflict in the web ui
[12:46:41] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1387 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:46:41] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1352 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:47:03] <wikibugs>	 (03CR) 10Ladsgroup: "This is making the data flow a bit unclear to me. I prefer all etcd value overrides be set in one place. It involves setting global variab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French)
[12:47:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2320 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:48:47] <icinga-wm>	 RECOVERY - snapshot of s8 in eqiad on backupmon1001 is OK: Last snapshot for s8 at eqiad (db1171) taken on 2024-05-14 11:16:50 (1594 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[12:50:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:54:35] <wikibugs>	 (03PS23) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[12:55:00] <wikibugs>	 (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Make Translations extension work with upstream Phorge [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper)
[12:55:59] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): admin: Update deployment description [puppet] - 10https://gerrit.wikimedia.org/r/1031435
[12:56:08] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Just a little suggestion :)" [puppet] - 10https://gerrit.wikimedia.org/r/1031435 (owner: 10Lucas Werkmeister (WMDE))
[12:57:02] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Skip ferm rules if firewall provider is none [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257)
[12:57:37] <icinga-wm>	 RECOVERY - snapshot of s8 in codfw on backupmon1001 is OK: Last snapshot for s8 at codfw (db2198) taken on 2024-05-14 11:51:06 (1632 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[12:58:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[12:59:28] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[12:59:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 12:00:00 on db2114.codfw.wmnet,db1125.eqiad.wmnet with reason: Testing
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1300).
[13:00:04] <jouncebot>	 MatmaRex and Jdlrobson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 12:00:00 on db2114.codfw.wmnet,db1125.eqiad.wmnet with reason: Testing
[13:00:14] <Lucas_WMDE>	 o/
[13:00:26] <MatmaRex>	 hi
[13:00:50] <MatmaRex>	 uhh, let's skip my first patch again, i'm looking at comments in slack now that say it might not be correct
[13:01:00] <Lucas_WMDE>	 ok, sure
[13:01:01] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2426/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[13:01:30] <MatmaRex>	 the other two are good to go (and should have no effect for users, but should reduce database load a tiny bit)
[13:01:49] <Lucas_WMDE>	 is it okay to deploy them together?
[13:01:59] <Lucas_WMDE>	 to save a bit of time
[13:02:14] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221)
[13:02:19] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Use ConditionalUserOptions for "discussiontools-autotopicsub" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221)
[13:02:23] <MatmaRex>	 yeah
[13:03:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński)
[13:03:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński)
[13:04:04] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2427/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:04:12] <wikibugs>	 (03Merged) 10jenkins-bot: Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030532 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński)
[13:04:15] <wikibugs>	 (03Merged) 10jenkins-bot: Use ConditionalUserOptions for "discussiontools-autotopicsub" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030535 (https://phabricator.wikimedia.org/T357221) (owner: 10Bartosz Dziewoński)
[13:04:24] <wikibugs>	 (03CR) 10Jelto: [V:03+1] gitlab: enable custom exporter on all instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[13:04:31] * Lucas_WMDE subscribes to the slack thread
[13:04:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1030532|Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (T357221)]], [[gerrit:1030535|Use ConditionalUserOptions for "discussiontools-autotopicsub" (T357221)]]
[13:04:48] <stashbot>	 T357221: Handle preferences for new users using "ConditionalUserOptions" config instead of "LocalUserCreated" hook inserting preference rows - https://phabricator.wikimedia.org/T357221
[13:05:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2152.codfw.wmnet
[13:05:56] <wikibugs>	 (03PS3) 10Jelto: prometheus::ops: scrape custom gitlab exporter [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656)
[13:06:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2152 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031438 (https://phabricator.wikimedia.org/T349619)
[13:07:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:1030532|Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (T357221)]], [[gerrit:1030535|Use ConditionalUserOptions for "discussiontools-autotopicsub" (T357221)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:08:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2152 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031438 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:08:26] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.dhcp for host netmon2002.wikimedia.org
[13:08:31] <Lucas_WMDE>	 MatmaRex: can you test the two conditional options?
[13:08:52] <Lucas_WMDE>	 or do they not make a difference until https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/1030533/1/includes/Hooks/PreferenceHooks.php is merged?
[13:08:54] <wikibugs>	 (03CR) 10Klausman: [C:03+2] ml-services: Change references to cassandra clusters from using _ to - [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman)
[13:09:13] <wikibugs>	 (03PS1) 10Jforrester: Convert function to arrow function to fix context [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031180 (https://phabricator.wikimedia.org/T364783)
[13:09:19] <MatmaRex>	 Lucas_WMDE: not really, the extensions/DiscussionTools code redundantly does the same thing
[13:09:45] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Change references to cassandra clusters from using _ to - [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031415 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman)
[13:09:51] <Lucas_WMDE>	 ok
[13:10:01] <hashar>	 well I am back
[13:10:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 matmarex and lucaswerkmeister-wmde: Continuing with sync
[13:10:18] <hashar>	 I have made the mistake to open Slack and attempt to catch up with 2 weeks worth of backlog
[13:10:23] <Lucas_WMDE>	 oh no
[13:10:39] <wikibugs>	 (03PS1) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324)
[13:10:53] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:11:10] <wikibugs>	 (03PS2) 10Vgutierrez: lvs: Skip ferm rules if firewall provider is not ferm [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257)
[13:11:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2152.codfw.wmnet
[13:12:23] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2429/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:12:44] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:15:10] <wikibugs>	 (03PS2) 10Muehlenhoff: Zookeeper: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1031429
[13:15:10] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2428/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[13:15:16] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on upload and upload-https services [puppet] - 10https://gerrit.wikimedia.org/r/1030022 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:16:13] <wikibugs>	 (03PS2) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324)
[13:16:51] <wikibugs>	 (03CR) 10Jelto: [V:03+1] prometheus::ops: scrape custom gitlab exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[13:17:36] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:17:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:18:19] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] "Thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1031436 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:18:20] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837)
[13:18:29] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[13:18:44] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[13:19:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1031435 (owner: 10Lucas Werkmeister (WMDE))
[13:19:19] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[13:19:28] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff)
[13:19:29] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: fix for long rev IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031442
[13:19:30] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP encapsulation on high-traffic2@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030021 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:19:55] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: fix for long rev IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031442 (owner: 10Peter Fischer)
[13:19:56] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: Space out ferm icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855)
[13:19:58] <vgutierrez>	 moritzm: ok to merge Moritz Mühlenhoff: admin: Update deployment description (af05b685a9) :?
[13:20:30] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[13:20:51] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: fix for long rev IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031442 (owner: 10Peter Fischer)
[13:21:34] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: Raise saturation threshold to 75% [alerts] - 10https://gerrit.wikimedia.org/r/1031430 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[13:22:15] <hashar>	 andre: so A.mir fixed the database issue :)
[13:22:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:22:37] <andre>	 hashar, yay
[13:22:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1030532|Use ConditionalUserOptions for "echo-subscriptions-email-dt-subscription" (T357221)]], [[gerrit:1030535|Use ConditionalUserOptions for "discussiontools-autotopicsub" (T357221)]] (duration: 17m 59s)
[13:22:46] <stashbot>	 T357221: Handle preferences for new users using "ConditionalUserOptions" config instead of "LocalUserCreated" hook inserting preference rows - https://phabricator.wikimedia.org/T357221
[13:22:54] <hashar>	 and it looks like the train blocker was ... not a train blocker :)
[13:23:06] <andre>	 hashar: second deployment attempt? :) (should probably move to releng)
[13:23:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: Move Swift on thanos-fe1001 to PKI TLS cert (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:24:03] <hashar>	 andre: we do the mediawiki train sync up here :]
[13:24:07] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply
[13:24:08] <wikibugs>	 (03PS4) 10Vgutierrez: cache: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257)
[13:24:18] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply
[13:24:19] <andre>	 hashar, ah, alright. I'll shut up and watch.
[13:24:21] <hashar>	 albeit it is really spammy nowadays with all those bots :-\
[13:24:25] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host netmon2002.wikimedia.org
[13:24:46] <hashar>	 jouncebot: now
[13:24:46] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1300)
[13:25:02] <wikibugs>	 (03PS6) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491)
[13:25:11] <Lucas_WMDE>	 Jdlrobson: around?
[13:25:18] <wikibugs>	 (03CR) 10Effie Mouzeli: (WIP) flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[13:25:22] <wikibugs>	 (03CR) 10Muehlenhoff: kubernetes: Space out ferm icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[13:25:32] <Lucas_WMDE>	 otherwise I would be done with the window at the moment (fyi hashar)
[13:25:36] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[13:25:42] <Lucas_WMDE>	 but I’d wait a few minutes to see if jon shows up
[13:25:43] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[13:25:52] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2433/co" [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:25:53] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "Just for precaution, I checked the list of IPs in thanos-fe1001 hitting the 443 port, and compared them with k8s eqiad IPs. This is the li" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:26:14] <hashar>	 the depends-on is not even correct
[13:26:44] <hashar>	 anyway I digress
[13:26:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski)
[13:27:51] <MatmaRex>	 James_F seemed like he wanted to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1031180
[13:28:07] <moritzm>	 vgutierrez: sorry, yes please
[13:28:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2154.codfw.wmnet
[13:28:49] <wikibugs>	 (03PS7) 10Effie Mouzeli: flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491)
[13:29:02] <wikibugs>	 (03PS3) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324)
[13:29:37] <wikibugs>	 (03PS2) 10Clément Goubert: kubernetes: Space out ferm icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855)
[13:29:37] <Lucas_WMDE>	 @seen James_F 
[13:29:54] <Lucas_WMDE>	 meh, I don’t remember the magic IRC incantation to see if it’s his working time or not ^^
[13:30:04] <Lucas_WMDE>	 but yeah MatmaRex that looks like a reasonable change to backport
[13:30:21] <Jdlrobson>	 hey Lucas_WMDE 
[13:30:22] <wikibugs>	 (03PS3) 10Hashar: Add notheme class to Echo [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson)
[13:30:23] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] "De-scheduled, I'm no longer sure that this is correct. See discussion in https://wikimedia.slack.com/archives/C01R06P8D1B/p171564938410743" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński)
[13:30:27] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:04-1] Update wgCdnMaxAge value and documentation to match Varnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński)
[13:30:29] <Jdlrobson>	 sorry i got the time wrong my an hour
[13:30:41] <Jdlrobson>	 (and relying on jetlag haha)
[13:31:09] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2434/" [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[13:31:44] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+1] kubernetes: Space out ferm icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[13:31:59] <Lucas_WMDE>	 hi!
[13:32:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "kicking off gate-and-submit" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson)
[13:32:09] <wikibugs>	 (03CR) 10Hashar: "I have removed the `Depends-On` which I guess was for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1031068  and then if th" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson)
[13:32:20] <wikibugs>	 (03Abandoned) 10Hashar: Suppress phan errors caused by UserMerge undeploy [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031068 (https://phabricator.wikimedia.org/T364610) (owner: 10Jdlrobson)
[13:32:40] <hashar>	 Lucas_WMDE: I removed the depends-on on that patch
[13:32:42] <Jdlrobson>	 hashar: thanks!
[13:32:43] <hashar>	 it was confusing 
[13:32:43] <vgutierrez>	 !log disable puppet on A:cp before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030051 - T357257
[13:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:47] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[13:32:47] <hashar>	 and would have prevented the merge I believe
[13:32:49] <Jdlrobson>	 that was driving me mad yesterday
[13:33:05] <hashar>	 + that was unrelated to Echo or the proposed patch but an issue in CI configuration :)
[13:33:35] <hashar>	 then
[13:34:06] <Lucas_WMDE>	 I’m confused by https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1030200
[13:34:09] <hashar>	 we have code in production still relying on the undeployed UserMerge
[13:34:25] <hashar>	 my guess is that those code paths are never reached in prod :)
[13:34:42] <Lucas_WMDE>	 there are some differences between what https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1030200/3/wmf-config/InitialiseSettings.php removes and what https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1030289/2/skin.json does in Vector
[13:34:57] <Lucas_WMDE>	 do we no longer need the Special:UserLogin / Special:CreateAccount part?
[13:34:59] <Jdlrobson>	 looking
[13:35:06] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "please give me a shout when you merge this" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:35:10] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] cache: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1030051 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:35:35] <hashar>	 andre: I will run the train immediately after the backport window has completed
[13:35:45] <andre>	 ok
[13:36:05] <Jdlrobson>	 Lucas_WMDE: oh it looks like Kim changed the patchset. Let's revert back to patchset 2.
[13:36:07] <icinga-wm>	 PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:08] <Jdlrobson>	 thanks for checking that
[13:36:13] <icinga-wm>	 RECOVERY - Host ps1-c6-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.83 ms
[13:36:18] <vgutierrez>	 !log re-enable puppet on A:cp-text - T357257
[13:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:37:03] <wikibugs>	 (03PS4) 10Jdlrobson: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia)
[13:37:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[13:37:10] <wikibugs>	 (03PS5) 10Jdlrobson: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia)
[13:37:21] <Jdlrobson>	 Lucas_WMDE: amended
[13:37:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2154 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031445 (https://phabricator.wikimedia.org/T349619)
[13:38:19] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch db2154 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031445 (https://phabricator.wikimedia.org/T349619)
[13:38:24] <Lucas_WMDE>	 Jdlrobson: is it okay to deploy both config changes at once?
[13:38:33] <Jdlrobson>	 yep
[13:38:43] <icinga-wm>	 RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.26 ms
[13:39:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia)
[13:39:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031047 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson)
[13:39:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2154 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031445 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:39:50] <moritzm>	 vgutierrez: ok to merge the upload@ulsfo ipip patch along?
[13:40:01] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy disabled limited width on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia)
[13:40:01] <vgutierrez>	 moritzm: errr I merged both
[13:40:05] <wikibugs>	 (03Merged) 10jenkins-bot: Phase 5: Vector-2022.js should no longer load legacy Vector code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031047 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson)
[13:40:18] <vgutierrez>	 moritzm: or not...
[13:40:21] <moritzm>	 it's still being shown to me?
[13:40:22] <vgutierrez>	 moritzm: yes please :)
[13:40:24] <moritzm>	 I'll merge
[13:40:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1030200|Deploy disabled limited width on main page (T357706)]], [[gerrit:1031047|Phase 5: Vector-2022.js should no longer load legacy Vector code (T301212)]]
[13:40:42] <stashbot>	 T357706: [config] Disable limited width on the main page  and associated history page - https://phabricator.wikimedia.org/T357706
[13:40:43] <stashbot>	 T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212
[13:40:57] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "Kim: I Reverted to PS2 as deleting the config here as it had other consequences." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030200 (https://phabricator.wikimedia.org/T357706) (owner: 10Kimberly Sarabia)
[13:41:18] <moritzm>	 vgutierrez: puppet merge complete
[13:41:19] <icinga-wm>	 PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[13:41:22] <vgutierrez>	 moritzm: thx
[13:42:28] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[13:42:35] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[13:43:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 jdlrobson and ksarabia and lucaswerkmeister-wmde: Backport for [[gerrit:1030200|Deploy disabled limited width on main page (T357706)]], [[gerrit:1031047|Phase 5: Vector-2022.js should no longer load legacy Vector code (T301212)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:43:20] <Lucas_WMDE>	 Jdlrobson: both should be ready to test with WikimediaDebug now
[13:43:24] <Jdlrobson>	 thanks checking
[13:43:26] <James_F>	 Lucas_WMDE: Oh, hey, sorry, wasn't looking at IRC.
[13:43:34] <Lucas_WMDE>	 hi!
[13:43:41] <James_F>	 Yes, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1031180 would be nice to reduce logspam and make Jdlrobson happy.
[13:43:43] <Lucas_WMDE>	 the backport window is looking fuller now than a few minutes ago, I’m afraid
[13:43:47] <James_F>	 No worries.
[13:43:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2154.codfw.wmnet
[13:44:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2161.codfw.wmnet
[13:44:07] <Lucas_WMDE>	 and I don’t want to overrun too much today as hashar is waiting ^^
[13:44:14] <Lucas_WMDE>	 but I guess I could deploy it out-of-window after the train is done
[13:44:27] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on db2154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 9378.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:44:29] <hashar>	 if at all possible yeah
[13:44:29] <James_F>	 Sure, no rush on my part.
[13:44:35] <hashar>	 I need to leave early today
[13:44:39] <Lucas_WMDE>	 hm, ok
[13:44:42] <Jdlrobson>	 Lucas_WMDE: both look great! please sync
[13:44:44] <Lucas_WMDE>	 then maybe I should remove my +2 on the Echo patch
[13:44:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 jdlrobson and ksarabia and lucaswerkmeister-wmde: Continuing with sync
[13:44:50] <Lucas_WMDE>	 and postpone that too
[13:44:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch 2161 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031447 (https://phabricator.wikimedia.org/T349619)
[13:45:00] <Jdlrobson>	 James_F: I can also deploy it later today if that's helpful
[13:45:07] <Jdlrobson>	 (I do like being happy! haha)
[13:45:18] <Jdlrobson>	 Which I'll need to do if the Echo one doesn't merge
[13:45:19] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Rescinding +2 – let’s delay this a bit so the train isn’t postponed even more than necessary. We can deploy it later." [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson)
[13:46:09] <Jdlrobson>	 Lucas_WMDE: so later will be 1pm PST (UTC late backport window) or were you thinking  an out of window after the train? 
[13:46:14] <vgutierrez>	 !log re-enable puppet on A:cp-upload - T357257
[13:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:20] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[13:46:23] <Lucas_WMDE>	 I was thinking out of window
[13:46:43] <Lucas_WMDE>	 maybe already before the SRE Collaboration Services office hours
[13:47:03] <Jdlrobson>	 okay cool. I just need to grab some breakfast but I'll be back in 2hrs
[13:47:08] <Lucas_WMDE>	 to me both fixes seem obvious enough that I’d be okay syncing them without a test
[13:47:30] <Lucas_WMDE>	 or see how much I can test myself, maybe
[13:47:41] <Jdlrobson>	 I'll be around no problem. Thanks for your help this morning, the config catch, and for waiting for me!
[13:48:12] <wikibugs>	 (03PS4) 10Elukey: Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324)
[13:49:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Move Swift on thanos-fe1001 to PKI TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:49:11] <hashar>	 Jdlrobson: have a good breakfast! :)
[13:50:17] <papaul>	 i think it went down and came back up
[13:50:29] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364809#9794704 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated. pings on mgmt.
[13:50:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1425 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:50:52] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Enable IPIP encapsulation on high-traffic2@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1031182
[13:51:14] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "hiera: Enable IPIP encapsulation on high-traffic2@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1031182 (https://phabricator.wikimedia.org/T357257)
[13:51:34] <wikibugs>	 (03PS1) 10Elukey: Revert "services: move Tegola's Swift config in staging to local envoy proxy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031183
[13:52:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch 2161 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031447 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:52:39] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Disable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031450 (https://phabricator.wikimedia.org/T357257)
[13:53:08] <wikibugs>	 (03CR) 10Krinkle: db-production: Generate sectionsByDB on the fly (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe)
[13:53:41] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert "services: move Tegola's Swift config in staging to local envoy proxy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031183 (owner: 10Elukey)
[13:53:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1451 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:54:17] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2435/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031450 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:54:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1382 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:54:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/1031451
[13:54:45] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Enable IPIP encapsulation on high-traffic2@ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1031182 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:56:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[13:56:26] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Disable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031450 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:57:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1030200|Deploy disabled limited width on main page (T357706)]], [[gerrit:1031047|Phase 5: Vector-2022.js should no longer load legacy Vector code (T301212)]] (duration: 16m 32s)
[13:57:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2161.codfw.wmnet
[13:57:15] <stashbot>	 T357706: [config] Disable limited width on the main page  and associated history page - https://phabricator.wikimedia.org/T357706
[13:57:15] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:57:15] <stashbot>	 T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212
[13:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:20] <Lucas_WMDE>	 hashar: all yours
[13:57:24] <hashar>	 \o/
[13:57:33] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Remove obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/1031451 (owner: 10Muehlenhoff)
[13:57:36] <hashar>	 andre: I am running the train
[13:57:37] <wikibugs>	 (03CR) 10Ladsgroup: db-production: Generate sectionsByDB on the fly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe)
[13:57:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2162.codfw.wmnet
[13:58:05] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031452 (https://phabricator.wikimedia.org/T361399)
[13:58:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031452 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[13:58:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2162 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031453 (https://phabricator.wikimedia.org/T349619)
[13:58:56] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031452 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[13:59:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2162 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031453 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:59:58] <andre>	 hashar: yay! A bit of unexpected real-life interference over here but I'm gonna check logstash too
[14:00:31] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on parse1018 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:01:53] <wikibugs>	 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T364863 (10phaultfinder) 03NEW
[14:03:17] <hashar>	 scap is restart fpm
[14:03:21] <hashar>	 ing
[14:03:22] <hashar>	 oh my
[14:04:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2162.codfw.wmnet
[14:04:17] <wikibugs>	 (03CR) 10Volans: [C:04-1] "Many functionalities are available via wmflib that is already installed in all systems." [puppet] - 10https://gerrit.wikimedia.org/r/1030185 (https://phabricator.wikimedia.org/T363702) (owner: 10Bking)
[14:05:03] <wikibugs>	 (03PS1) 10JMeybohm: Add kubestagemaster2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1031457 (https://phabricator.wikimedia.org/T363307)
[14:05:53] <wikibugs>	 (03PS1) 10Jdlrobson: Disable last remaining projects using share user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031458 (https://phabricator.wikimedia.org/T301212)
[14:05:54] <wikibugs>	 (03PS1) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212)
[14:06:09] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:06:47] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184
[14:06:51] <wikibugs>	 (03PS1) 10JMeybohm: Add kubestagemaster2005 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1031460 (https://phabricator.wikimedia.org/T363307)
[14:06:53] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply
[14:07:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184 (owner: 10Vgutierrez)
[14:07:18] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply
[14:07:22] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+1 C:03+2] kubernetes: Space out ferm icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1031440 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[14:08:45] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1031457 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[14:09:31] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184
[14:10:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2163.codfw.wmnet
[14:11:22] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add kubestagemaster2005 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1031460 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[14:11:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031184 (owner: 10Vgutierrez)
[14:11:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2163 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031461 (https://phabricator.wikimedia.org/T349619)
[14:12:25] <vgutierrez>	 !log repool upload@ulsfo IPIP encapsulation NOT enabled - T357257
[14:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:28] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[14:14:18] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.5  refs T361399
[14:14:19] <Lucas_WMDE>	 is scap still restarting php-fpm?
[14:14:22] <stashbot>	 T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399
[14:14:25] <Lucas_WMDE>	 ah ^^
[14:14:32] <Lucas_WMDE>	 impeccable timing
[14:15:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2163 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031461 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:16:37] <hashar>	 yeah
[14:16:57] <hashar>	 Lucas_WMDE: yeah we still restart php-fpm on baremetal hosts 
[14:17:04] <hashar>	 I guess cause php 7.4 still get some opcache corruption
[14:17:07] <hashar>	 or to clear some cache
[14:17:09] <hashar>	 or whatever
[14:17:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9794867 (10Jclark-ctr)
[14:17:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9794868 (10Jclark-ctr) @Volans  i also see this as a learning opportunity most of these are just logs.  Some dcops members are very light on linux and we could be expanding  knowledge  and cou...
[14:17:27] <Lucas_WMDE>	 yeah, I remember we disabled automatically rereading PHP files based on mtime or something like that
[14:17:32] <Lucas_WMDE>	 it just took longer than I expected
[14:17:45] <hashar>	 and there is some stuff being off by one
[14:18:08] <hashar>	 like class magically changing from I say Vector2022 to Uector2022
[14:18:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2428 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:18:44] <hashar>	 I don't think we ever tried to reproduce the issue or investigated the root cause
[14:18:53] <hashar>	 given restarting / clearing the cache fixes it
[14:19:25] <Lucas_WMDE>	 I thought our best guess was cosmic rays
[14:19:28] <Lucas_WMDE>	 but I might be imagining that
[14:19:52] <Lucas_WMDE>	 are there more things to deploy for the train or could I do some more backports now?
[14:20:06] <Lucas_WMDE>	 (also happy to wait if you want to verify first whether a rollback is needed or not)
[14:20:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462
[14:20:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463
[14:20:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464
[14:20:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465
[14:20:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi)
[14:20:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1425 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:21:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2163.codfw.wmnet
[14:23:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462
[14:23:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463
[14:23:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464
[14:23:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465
[14:23:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1451 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:24:06] <vgutierrez>	 !log depool cp4049
[14:24:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:35] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1382 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:25:46] <wikibugs>	 (03PS1) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861)
[14:27:46] <wikibugs>	 (03PS3) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463
[14:27:46] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464
[14:27:46] <wikibugs>	 (03PS3) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465
[14:28:45] <vgutierrez>	 !log repool cp4049
[14:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:51] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:30:31] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on parse1018 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:31:27] <vgutierrez>	 !log depool cp4049
[14:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:02] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:33:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job lvs_realserver in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:33:55] <vgutierrez>	 !log repool cp4049
[14:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:18] <vgutierrez>	 lvs_realserver in ops@ulsfo is a side effect of me reverting the IPIP encapsulation change on upload@ulsfo
[14:35:22] <wikibugs>	 (03CR) 10Muehlenhoff: "Or we simply remove the option? All Kafka brokers use PKI these days and given that the variable was misnamed that also shows that no clou" [puppet] - 10https://gerrit.wikimedia.org/r/1031463 (owner: 10Filippo Giunchedi)
[14:35:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2165.codfw.wmnet
[14:36:09] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:36:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2436/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi)
[14:37:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2165 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031489 (https://phabricator.wikimedia.org/T349619)
[14:38:02] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job lvs_realserver in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:09] <moritzm>	 !log installing dav1d security updates
[14:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2165 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031489 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:39:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9794965 (10Marostegui) @btullis what is the status of this? I can see the host is up, but not yet provisioned? ` root@an-redacteddb1001:~# df -hT /srv Filesystem            Type  Size  Used A...
[14:39:50] <hashar>	 well train looks fine this time :]
[14:39:52] <hashar>	 andre: ^
[14:40:13] <wikibugs>	 (03CR) 10Bking: [C:03+1] cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse)
[14:40:34] <wikibugs>	 (03CR) 10Herron: [C:03+1] pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi)
[14:41:06] <hashar>	 Lucas_WMDE: train looks fine, so if you want to do backport you can do them now!
[14:41:08] <hashar>	 thanks :)
[14:41:24] <Lucas_WMDE>	 ok, thanks!
[14:41:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for dav1d [puppet] - 10https://gerrit.wikimedia.org/r/1031490
[14:41:41] <Lucas_WMDE>	 jouncebot: next
[14:41:41] <jouncebot>	 In 0 hour(s) and 18 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1500)
[14:41:50] <Lucas_WMDE>	 but 18 minutes is a bit short for non-config CI, I think
[14:41:58] <Lucas_WMDE>	 I’ll wait for the window to start and see if anyone’s using it
[14:42:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9794972 (10Jclark-ctr) kafka-main1010 Rack: E 5 U 26  Cableid : 2013339101771 Port : 6
[14:43:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2165.codfw.wmnet
[14:44:39] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ops: Filter lvs_realserver_clamper by enabled parameters [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257)
[14:45:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for dav1d [puppet] - 10https://gerrit.wikimedia.org/r/1031490 (owner: 10Muehlenhoff)
[14:45:43] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1027052 (https://phabricator.wikimedia.org/T364494) (owner: 10Dzahn)
[14:46:41] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364810#9795004 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted. all in C6 up now.
[14:47:31] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 224 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:48:02] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:48:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2428 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:48:43] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2437/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:49:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2166.codfw.wmnet
[14:50:37] <wikibugs>	 (03PS1) 10Herron: pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645)
[14:50:45] <wikibugs>	 10ops-codfw, 06SRE: InterfaceSpeedError - https://phabricator.wikimedia.org/T364863#9795023 (10Jhancock.wm) the cable or the 1G SFP might need to be replaced. can we downtime the server for a small window to test the cabling?
[14:51:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] prometheus::ops: Filter lvs_realserver_clamper by enabled parameters [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:51:58] <wikibugs>	 (03PS2) 10Herron: pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645)
[14:52:17] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] "Thanks for the review sukhe!" [puppet] - 10https://gerrit.wikimedia.org/r/1031491 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:53:23] <wikibugs>	 (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2438/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron)
[14:55:05] <moritzm>	 !loh installing openjdk-17/jetty9 security updates
[14:56:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031493 (https://phabricator.wikimedia.org/T349619)
[14:57:31] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 40 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:57:52] <mutante>	 moritzm: it didn't log due to typo. thanks for the group approval
[14:57:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I audited instance-puppet.git and the variable is mistyped there too unfortunately:" [puppet] - 10https://gerrit.wikimedia.org/r/1031463 (owner: 10Filippo Giunchedi)
[14:58:02] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:58:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2166 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031493 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:58:14] <moritzm>	 yw :-)
[15:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1500). Please do the needful.
[15:00:23] <wikibugs>	 (03PS1) 10David Caro: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494
[15:01:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2166.codfw.wmnet
[15:01:44] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870 (10RobH) 03NEW p:05Triage→03High
[15:01:49] <wikibugs>	 (03PS1) 10Jdlrobson: Enable night mode on Vector on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T363814)
[15:02:29] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9795094 (10RobH)
[15:02:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable night mode on Vector on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[15:03:14] <wikibugs>	 (03PS2) 10Jdlrobson: Enable night mode on Vector on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T363814)
[15:03:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2167.codfw.wmnet
[15:03:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[15:03:47] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge update
[15:04:01] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge update
[15:04:10] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850
[15:04:14] <stashbot>	 T364850: Deploy Phabricator/Phorge 2024-05-14 - https://phabricator.wikimedia.org/T364850
[15:04:44] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850 (duration: 00m 33s)
[15:04:45] <wikibugs>	 (03PS13) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706)
[15:04:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:04:57] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:05:03] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850
[15:05:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2167 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031496 (https://phabricator.wikimedia.org/T349619)
[15:05:53] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@7d858df]: test deploy phab2002 for T364850 (duration: 00m 50s)
[15:09:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Filter out addresses handled by gsuite that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski)
[15:10:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2167 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031496 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:11:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:11:33] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:11:40] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2439/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:12:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:12:12] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:12:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[15:12:37] <icinga-wm>	 RECOVERY - snapshot of s5 in codfw on backupmon1001 is OK: Last snapshot for s5 at codfw (db2201) taken on 2024-05-14 14:16:13 (659 GiB, -0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:13:37] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:13:39] <moritzm>	 !log installing expat security updates
[15:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:44] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[15:13:57] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Add CertProvider to hot reload TLS certs for gRPC service (032 comments) [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:15:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2167.codfw.wmnet
[15:15:21] <wikibugs>	 (03PS1) 10Scott French: aqs-http-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978)
[15:16:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry
[15:16:11] <icinga-wm>	 PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:35] <mutante>	 ^ hmm.. i'll tell Papaul
[15:16:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2181.codfw.wmnet
[15:18:07] <wikibugs>	 (03PS2) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031071
[15:18:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:18:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:18:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T352010)', diff saved to https://phabricator.wikimedia.org/P62387 and previous config saved to /var/cache/conftool/dbconfig/20240514-151838-ladsgroup.json
[15:18:43] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[15:20:22] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse)
[15:20:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2181 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031498 (https://phabricator.wikimedia.org/T349619)
[15:21:08] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: fix the error topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031441 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse)
[15:21:13] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1031423/2441/gerrit2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1031423 (owner: 10Muehlenhoff)
[15:22:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2181 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031498 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:23:42] <sukhe>	 mutante: thanks!
[15:24:05] <wikibugs>	 (03PS6) 10BCornwall: hieradata: Move acme certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189)
[15:25:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry
[15:25:11] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:25:12] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:25:28] <wikibugs>	 (03CR) 10Volans: testing, please ignore (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1031071 (owner: 10BCornwall)
[15:25:39] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2442/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall)
[15:26:38] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:26:39] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:26:49] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:26:50] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:29:46] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron)
[15:32:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2181.codfw.wmnet
[15:32:27] <Lucas_WMDE>	 jouncebot: nowandnext
[15:32:27] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1500)
[15:32:27] <jouncebot>	 In 0 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1600)
[15:32:51] <Lucas_WMDE>	 does anyone mind if I do some backports now?
[15:34:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2195.codfw.wmnet
[15:35:01] <icinga-wm>	 RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.52 ms
[15:35:18] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy)
[15:36:03] <Lucas_WMDE>	 I’ll start them now, but they’ll need a while in CI, so you have plenty of time to tell me to cancel the deployment :)
[15:36:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hieradata: Move acme certificates to its own file [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall)
[15:36:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db2195 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031501 (https://phabricator.wikimedia.org/T349619)
[15:36:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson)
[15:36:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031180 (https://phabricator.wikimedia.org/T364783) (owner: 10Jforrester)
[15:36:47] <Lucas_WMDE>	 ^ deploying those two backports
[15:37:01] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi)
[15:37:09] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:37:31] <icinga-wm>	 PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:02] <mutante>	 ^ aware that some codfw maintenance is going on
[15:38:18] <Jdlrobson>	 logmsgbot: here :)
[15:38:54] <wikibugs>	 (03CR) 10Muehlenhoff: "We can also file a task to move these Kafkak hosts in deployment-prep to PKI as well and then simply remove the option if there's no react" [puppet] - 10https://gerrit.wikimedia.org/r/1031463 (owner: 10Filippo Giunchedi)
[15:38:58] <wikibugs>	 (03CR) 10Scott French: "Thank you both in advance for the review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[15:39:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db2195 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031501 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:40:23] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:27] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:55] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Configure Docker builder GC settings for CI [puppet] - 10https://gerrit.wikimedia.org/r/1031045 (https://phabricator.wikimedia.org/T364773) (owner: 10Ahmon Dancy)
[15:42:14] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[15:42:25] <wikibugs>	 (03CR) 10Scott French: [C:03+2] benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[15:42:39] <icinga-wm>	 RECOVERY - Host ps1-d2-codfw is UP: PING WARNING - Packet loss = 71%, RTA = 31.13 ms
[15:42:43] <icinga-wm>	 RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms
[15:42:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2195.codfw.wmnet
[15:43:35] <wikibugs>	 (03Merged) 10jenkins-bot: benthos: adopt securityContext and base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[15:44:34] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] Filter out addresses handled by gsuite that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1031432 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski)
[15:47:53] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=kubestagemaster2005.codfw.wmnet
[15:47:53] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/weight=10; selector: name=kubestagemaster2005.codfw.wmnet
[15:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:48:55] <Jdlrobson>	 Lucas_WMDE: sorry that ping was meant for you not logmsgbot^
[15:49:02] <rzl>	 mutante: thanks for merging the patches from the puppet window :) jhathaway and I have a conflicting meeting so I was going to get an early start, pleasant surprise to see them already done
[15:49:16] <Lucas_WMDE>	 Jdlrobson: ah, thanks, I missed that ^^
[15:49:32] <Lucas_WMDE>	 rzl: is it okay if my deploy runs a bit into your window then?
[15:49:44] <rzl>	 Lucas_WMDE: yep, mine'll be a no-op
[15:49:45] <Lucas_WMDE>	 (Zuul still predicts 8 mins ETA before CI is even done)
[15:49:47] <Lucas_WMDE>	 yay
[15:49:57] <mutante>	 rzl: you're welcome. well, for me it was like that I was looking at merging those and it had first no relation to the window :)
[15:50:11] <mutante>	 only then noticed they are the same ones, heh
[15:50:30] <rzl>	 (cc dancy, no need to do anything in the window but feel free to grab us if you need a rollback or followup or anything) 
[15:56:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron)
[15:57:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi)
[15:57:25] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464
[15:57:42] <wikibugs>	 (03Merged) 10jenkins-bot: Add notheme class to Echo [extensions/Echo] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1030984 (https://phabricator.wikimedia.org/T363779) (owner: 10Jdlrobson)
[15:57:45] <wikibugs>	 (03Merged) 10jenkins-bot: Convert function to arrow function to fix context [extensions/VisualEditor] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031180 (https://phabricator.wikimedia.org/T364783) (owner: 10Jforrester)
[15:58:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: refactor hiera settings in their own files [puppet] - 10https://gerrit.wikimedia.org/r/1031464 (owner: 10Filippo Giunchedi)
[15:58:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1030984|Add notheme class to Echo (T363779)]], [[gerrit:1031180|Convert function to arrow function to fix context (T364783)]]
[15:58:23] <stashbot>	 T363779: [Bug] Echo not compatible with desktop night theme - https://phabricator.wikimedia.org/T363779
[15:58:24] <stashbot>	 T364783: Large amount of errors in animateToolbarIntoView function in VisualEditor - https://phabricator.wikimedia.org/T364783
[16:00:05] <jouncebot>	 jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1600).
[16:00:05] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:01:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Support VM BGP automation using Netbox flag for L3 POPs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) (owner: 10Cathal Mooney)
[16:01:32] <Lucas_WMDE>	 hm, scap tells me that something failed
[16:02:05] <Lucas_WMDE>	 it’s expecting https://totoro.wikimedia.org/wiki/Main_Page to redirect to https://foundation.wikimedia.org/wiki/Main_Page on mwdebug2002
[16:02:33] <wikibugs>	 (03CR) 10Herron: [V:03+1 C:03+2] pyrra-filesystem: set prom url to local thanos rule instance [puppet] - 10https://gerrit.wikimedia.org/r/1031492 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron)
[16:02:40] <Lucas_WMDE>	 I can’t even resolve that host o_O
[16:03:06] <mutante>	 what.. I look quite a bit at DNS and never seen that
[16:03:21] <wikibugs>	 (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[16:03:56] <Lucas_WMDE>	 https://gerrit.wikimedia.org/g/operations/puppet/+/08b0b935c4578b12fdadc6f1bc13df0adc207c2e/modules/profile/files/httpbb/appserver/test_wikimania_wikimedia.yaml#46
[16:04:04] * Lucas_WMDE peeks at blame
[16:04:35] <sukhe>	 it's an NXDOMAIN for toroto
[16:04:38] <sukhe>	 *totoro
[16:04:51] <wikibugs>	 (03PS2) 10David Caro: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494
[16:04:59] <Lucas_WMDE>	 I’m guessing the scap check isn’t even meant to use DNS
[16:04:59] <wikibugs>	 (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[16:05:04] <Lucas_WMDE>	 let me retry the check and see what happens…
[16:05:11] <mutante>	 it's not appearing in "git log" in DNS repo either
[16:05:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 jdlrobson and jforrester and lucaswerkmeister-wmde: Backport for [[gerrit:1030984|Add notheme class to Echo (T363779)]], [[gerrit:1031180|Convert function to arrow function to fix context (T364783)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:05:35] <Lucas_WMDE>	 (and it seemingly didn’t use DNS either, as it actually got a 503 status code)
[16:05:39] <stashbot>	 T363779: [Bug] Echo not compatible with desktop night theme - https://phabricator.wikimedia.org/T363779
[16:05:39] <stashbot>	 T364783: Large amount of errors in animateToolbarIntoView function in VisualEditor - https://phabricator.wikimedia.org/T364783
[16:05:43] <Lucas_WMDE>	 that still doesn’t explain why we even check for this bizarre host name
[16:05:51] <Lucas_WMDE>	 but anyway – looks like the recheck worked 🤷
[16:05:54] <Lucas_WMDE>	 Jdlrobson: can you test the changes?
[16:06:10] <mutante>	 rzl: ever heard of a "totoro.wikimedia.org" vhost on appservers?
[16:06:20] <Jdlrobson>	 Lucas_WMDE: yep
[16:06:22] <rzl>	 in a meeting, back to you in a bit
[16:06:32] <mutante>	 no rush
[16:06:37] <Lucas_WMDE>	 (test was seemingly introduced in https://gerrit.wikimedia.org/r/c/operations/puppet/+/444908/3/modules/profile/files/mediawiki/web_testing/tests/test_wikimania_wikimedia FWIW)
[16:06:44] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified wmf homer plugin - cmooney@cumin1002 - T364480
[16:06:48] <stashbot>	 T364480: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480
[16:07:03] <mutante>	 Lucas_WMDE: in .. 2018 ?:)
[16:07:18] <Lucas_WMDE>	 yeah
[16:07:25] <Lucas_WMDE>	 I think the scap failure was just a flake
[16:07:26] <mutante>	 but scap only complains now? odd
[16:07:31] <Lucas_WMDE>	 but I am now curious what the test even means
[16:07:41] <Lucas_WMDE>	 and whether we can just remove it
[16:07:47] <sukhe>	 Lucas_WMDE: it's a test for https://wikitech.wikimedia.org/wiki/Httpbb
[16:07:49] <wikibugs>	 (03PS1) 10David Caro: cirrus_streaming_updater_cloudelastic: fix missing job_name [alerts] - 10https://gerrit.wikimedia.org/r/1031503
[16:07:51] <Lucas_WMDE>	 but apparently the redirect must exist somewhere, if the check worked after a retry
[16:07:58] <dancy>	 scap runs `httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 --retry_on_timeout`
[16:08:00] <Jdlrobson>	 notifications = good to sync. 
[16:08:23] <Jdlrobson>	 Arrow function => good to sync Lucas_WMDE 
[16:08:24] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified wmf homer plugin - cmooney@cumin1002 - T364480
[16:08:27] <Lucas_WMDE>	 ack
[16:08:29] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 jdlrobson and jforrester and lucaswerkmeister-wmde: Continuing with sync
[16:08:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480#9795483 (10ops-monitoring-bot) Deployed homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified...
[16:08:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795486 (10Papaul)
[16:08:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus_streaming_updater_cloudelastic: fix missing job_name [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro)
[16:09:04] <wikibugs>	 (03CR) 10David Caro: "Adding you as reviewer as you added those tests :), feel free to direct me to someone else." [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro)
[16:09:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795489 (10Papaul)
[16:09:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Increase timeout for Netbox Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1029226 (owner: 10Cathal Mooney)
[16:10:15] <wikibugs>	 (03CR) 10David Caro: "Oh, wait, they seem to only fail in my local, maybe it's the version of promtool/pint" [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro)
[16:10:23] <wikibugs>	 (03Merged) 10jenkins-bot: Increase timeout for Netbox Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1029226 (owner: 10Cathal Mooney)
[16:10:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795496 (10Papaul)
[16:11:52] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[16:12:18] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:12:37] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[16:12:40] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[16:13:24] <wikibugs>	 (03CR) 10David Caro: "Yep, promtool 2.45.0 fails, 2.52 works 🎉" [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro)
[16:13:34] <wikibugs>	 (03Abandoned) 10David Caro: cirrus_streaming_updater_cloudelastic: fix missing job_name [alerts] - 10https://gerrit.wikimedia.org/r/1031503 (owner: 10David Caro)
[16:13:44] <dancy>	 mutante:  Scap does a retry/continue/exit interaction loop around the testserver and canary checks as of version 4.70.0 (07 Mar 2024).
[16:14:14] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:14:19] <dancy>	 (if a tty is available)
[16:14:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[16:14:45] <mutante>	 dancy: soo.. I ran the same httpbb command that scap runs and that like.. always PASSes
[16:14:46] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[16:14:56] <mutante>	 dancy: it even passes with "cookiemonster.wikimedia.org"
[16:15:03] <dancy>	 Nod.  I have nothing to say about that.
[16:16:30] <mutante>	 it also passes when I use --hosts=mwdebug1002.eqiad.wmnet instead of the discovery service name and drop the port
[16:16:37] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 250.21 ms
[16:16:39] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 244.41 ms
[16:17:02] <dancy>	 Lucas_WMDE: Can you provide the full transcript (ideally in  phab ticket) ?
[16:17:27] <Lucas_WMDE>	 sure
[16:17:51] <dancy>	 thx
[16:20:23] <Lucas_WMDE>	 dancy (and mutante, rzl, sukhe if interested): https://phabricator.wikimedia.org/T364880
[16:20:28] <Lucas_WMDE>	 absolutely no idea which tags to put on it
[16:20:57] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880)
[16:20:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1030984|Add notheme class to Echo (T363779)]], [[gerrit:1031180|Convert function to arrow function to fix context (T364783)]] (duration: 22m 43s)
[16:21:04] <stashbot>	 T363779: [Bug] Echo not compatible with desktop night theme - https://phabricator.wikimedia.org/T363779
[16:21:05] <stashbot>	 T364783: Large amount of errors in animateToolbarIntoView function in VisualEditor - https://phabricator.wikimedia.org/T364783
[16:21:12] <mutante>	 slaps "SRE" on it to start with
[16:21:13] <Lucas_WMDE>	 Jdlrobson, James_F: should be deployed now
[16:21:18] <James_F>	 <3
[16:21:39] <wikibugs>	 06SRE, 13Patch-For-Review: Failed scap check for totoro.wikimedia.org during deployment - https://phabricator.wikimedia.org/T364880#9795759 (10Dzahn)
[16:22:02] <wikibugs>	 06SRE, 13Patch-For-Review: Failed scap check for totoro.wikimedia.org during deployment - https://phabricator.wikimedia.org/T364880#9795784 (10Lucas_Werkmeister_WMDE)
[16:23:01] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:23:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:23:03] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[16:23:08] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE))
[16:23:14] <dancy>	 Mystery solved.  Nice work Lucas
[16:23:16] * Lucas_WMDE done deploying btw
[16:23:47] <wikibugs>	 06SRE, 10Scap, 13Patch-For-Review: Failed scap check for totoro.wikimedia.org during deployment - https://phabricator.wikimedia.org/T364880#9795831 (10dancy)
[16:24:28] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Retry kube-publish-sa-certs 5 times [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307)
[16:27:04] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2443/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[16:27:39] <wikibugs>	 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during sacp deployment - https://phabricator.wikimedia.org/T364880#9795868 (10Dzahn) scap runs `httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 --retry_on_timeout`  This...
[16:27:48] <wikibugs>	 06SRE, 06serviceops, 06Traffic-Icebox, 06Trust and Safety Product Team: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933#9795878 (10TAdeleye_WMF)
[16:28:52] <wikibugs>	 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during sacp deployment - https://phabricator.wikimedia.org/T364880#9795900 (10dancy)
[16:30:19] <wikibugs>	 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9795958 (10dancy)
[16:30:57] <wikibugs>	 06SRE, 10Scap, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9795971 (10Lucas_Werkmeister_WMDE) > ^ It seems wrong that this doesn't fail.  I’m not sure why it should fail? It seem to match the behavior I can see...
[16:32:24] <mutante>	 dancy: Lucas_WMDE: confirmed. check passes whatever the virtual host is, as long as it's whatever.wikimedia.org and the path stays: /wiki/Main_Page . as soon as the path changes .. then it starts behaving as expected
[16:33:15] <mutante>	 Main_Page isn't in the rewrite rules directly though
[16:34:39] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[16:37:30] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1006 - vriley@cumin1002"
[16:38:05] <wikibugs>	 (03CR) 10Andrew Bogott: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[16:38:27] <Lucas_WMDE>	 unfortunately I don’t see the error that scap got in logstash
[16:38:37] <Lucas_WMDE>	 nothing around the right time in host:mwdebug2002 AFAICT
[16:38:42] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1006 - vriley@cumin1002"
[16:38:42] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:38:44] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: name=mw2286.codfw.wmnet
[16:39:01] <Lucas_WMDE>	 (unfortunately scap didn’t print the response body nor any other response headers, so there’s not much to go on…)
[16:39:11] <wikibugs>	 06SRE, 10Scap, 06serviceops-radar, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9796100 (10Dzahn) Lucas is right. I can confirm the test passes with any wikimedia.org subdomain as long as the path stays /wiki/M...
[16:39:17] <mutante>	 !log depooled mw2286.codfw.wmnet because of interface error / needed cable replacement T364863
[16:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:20] <stashbot>	 T364863: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863
[16:39:25] <wikibugs>	 06SRE, 10Scap, 06serviceops-radar, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9796101 (10Dzahn)
[16:39:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1006.mgmt.eqiad.wmnet with reboot policy FORCED
[16:40:46] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw2286.codfw.wmnet with reason: T364863
[16:41:00] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw2286.codfw.wmnet with reason: T364863
[16:41:10] <dancy>	 mutante/Lucas_WMDE: Please file bugs against httpbb if you want changes
[16:41:13] <wikibugs>	 10ops-codfw, 06SRE, 06serviceops: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863#9796105 (10Dzahn)
[16:41:53] <dancy>	 hmm. I wonder if it already takes a flag to be more spammy
[16:42:28] <dancy>	 Not that I see.
[16:43:09] <Lucas_WMDE>	 me neither
[16:44:05] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:44:30] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "Looks like CI is failing due to an unrelated puppet repo issue. I'll rebase/rerun later." [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall)
[16:44:47] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1031046 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall)
[16:44:57] <wikibugs>	 10ops-codfw, 06SRE, 06serviceops: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863#9796143 (10Dzahn) @Jhancock.wm  cc: @RLazarus   I depooled the server and set a downtime of 24 hours.
[16:46:37] <Lucas_WMDE>	 dancy: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/httpbb/+/a84d0d2703cfad340e0e479dc42582efd4ce893b/httpbb/main.py#162 and the following lines don’t look like the response is logged anywhere in general, only the parts that are relevant to the failed test
[16:46:40] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[16:46:57] <Lucas_WMDE>	 should I make a separate task for that? that it should drop details in a file or something?
[16:48:03] <wikibugs>	 (03CR) 10BCornwall: testing, please ignore (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1031071 (owner: 10BCornwall)
[16:48:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[16:48:48] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1007 - vriley@cumin1002"
[16:48:51] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 246.77 ms
[16:48:53] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 230.65 ms
[16:49:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Notice: /Stage[main]/Gitlab/Systemd::Service[gitlab-exporter]/Service[gitlab-exporter]/ensure: ensure changed 'stopped' to 'running'" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[16:49:42] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1007 - vriley@cumin1002"
[16:49:42] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:50:11] <dancy>	 Lucas_WMDE:  Yeah, a separate ticket as a subtask of T364880 would be good.    I think an option to print error details to stdout is sufficient.  
[16:50:12] <stashbot>	 T364880: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880
[16:50:29] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1007.mgmt.eqiad.wmnet with reboot policy FORCED
[16:51:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9796178 (10VRiley-WMF)
[16:51:36] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1006.mgmt.eqiad.wmnet with reboot policy FORCED
[16:54:16] <Lucas_WMDE>	 dancy: ok, filed https://phabricator.wikimedia.org/T364886
[16:54:16] <wikibugs>	 06SRE, 10Scap, 06serviceops-radar: httpbb should show more information / details about failed checks - https://phabricator.wikimedia.org/T364886 (10Lucas_Werkmeister_WMDE) 03NEW
[16:55:16] <dancy>	 Thanks!
[16:55:40] <sukhe>	 thakns Lucas_WMDE!
[16:55:54] <sukhe>	 (for the previous one as well)
[16:56:03] <wikibugs>	 (03CR) 10Jsn.sherman: Dont recalculate winners from scratch each round (031 comment) [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) (owner: 10Driedmueller)
[16:56:16] <dancy>	 mutante: Thanks for the merges!
[16:56:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "Antoine, any concerns?" [puppet] - 10https://gerrit.wikimedia.org/r/1029212 (https://phabricator.wikimedia.org/T333029) (owner: 10Addshore)
[16:57:13] <mutante>	 dancy: yw! it was mostly unrelated to the window
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T1700)
[17:00:09] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons.
[17:02:33] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1007.mgmt.eqiad.wmnet with reboot policy FORCED
[17:05:48] <wikibugs>	 (03PS1) 10Santiago Faci: Bumping mpic version: v.0.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031517 (https://phabricator.wikimedia.org/T364170)
[17:06:56] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Bumping mpic version: v.0.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031517 (https://phabricator.wikimedia.org/T364170) (owner: 10Santiago Faci)
[17:08:05] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping mpic version: v.0.0.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031517 (https://phabricator.wikimedia.org/T364170) (owner: 10Santiago Faci)
[17:08:51] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:09:37] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons.
[17:11:10] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[17:11:31] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[17:12:00] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
[17:15:17] <wikibugs>	 (03CR) 10Scott French: "Just to make sure I understand the motivation: does systemd-sysv-generator not work on bookworm, or are you doing this in advance of its d" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi)
[17:16:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9796326 (10Papaul)
[17:18:25] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
[17:19:33] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
[17:24:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480#9796348 (10cmooney) 05Open→03Resolved Patch to Homer wmf plugin merged now, so BGP to VMs at POPs / on L3 switches now under automation too.
[17:25:58] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
[17:27:14] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
[17:33:35] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Sounds reasonable to me, though I don't have a good understanding of how long it might take for the local etcd node to become ready in thi" [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[17:33:38] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
[17:38:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9796396 (10MMiller_WMF) I approve!
[17:39:54] <wikibugs>	 (03PS1) 10DCausse: cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837)
[17:40:44] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891 (10RobH) 03NEW
[17:41:16] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9796419 (10RobH)
[17:41:35] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9796420 (10RobH)
[17:42:04] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9796422 (10RobH)
[17:43:30] <wikibugs>	 (03PS2) 10DCausse: cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837)
[17:45:01] <wikibugs>	 (03PS3) 10Jdlrobson: Enable night mode on Vector on testwiki, disable on Special:Homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699)
[17:47:21] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs6003 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[17:47:47] <wikibugs>	 (03PS2) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094)
[17:47:51] <wikibugs>	 (03CR) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[17:52:49] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs7003 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[17:53:21] <sukhe>	 ^ we are looking into this, we see what has changed so figuring out a reversal
[17:55:06] <wikibugs>	 (03PS1) 10Herron: pyrra: linkrecommendation: onboard slo from grizzly [puppet] - 10https://gerrit.wikimedia.org/r/1031527 (https://phabricator.wikimedia.org/T302995)
[18:01:27] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: Enable IPIP on upload and upload-https services" [puppet] - 10https://gerrit.wikimedia.org/r/1031470
[18:04:00] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "hiera: Enable IPIP on upload and upload-https services" [puppet] - 10https://gerrit.wikimedia.org/r/1031470 (owner: 10Ssingh)
[18:07:35] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs4010 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[18:08:19] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:30] <wikibugs>	 (03PS2) 10Herron: pyrra: linkrecommendation: onboard slo from grizzly [puppet] - 10https://gerrit.wikimedia.org/r/1031527 (https://phabricator.wikimedia.org/T302995)
[18:10:13] <icinga-wm>	 PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[18:10:52] <sukhe>	 !log sudo cumin -b1 -s120 'A:lvs' 'systemctl restart pybal.service': clearing up alert for reverted pybal.conf CR 1031470
[18:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:37] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs3010 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[18:14:38] <logmsgbot>	 !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@6270c72]: (no justification provided)
[18:15:12] <logmsgbot>	 !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@6270c72]: (no justification provided) (duration: 00m 34s)
[18:17:46] <sukhe>	 !log [CORRECTION] above pybal restart was NOT run
[18:17:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:00] <sukhe>	 !log restart pybal on backup LVSes
[18:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:19] <sukhe>	 now I realize it should have been [below]. oh well
[18:18:55] <wikibugs>	 (03Abandoned) 10Herron: pyrra-filesystem: increase StartLimits and delay notified unit [puppet] - 10https://gerrit.wikimedia.org/r/1031050 (https://phabricator.wikimedia.org/T364645) (owner: 10Herron)
[18:22:11] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[18:22:33] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:22:45] <wikibugs>	 (03PS1) 10Dreamrimmer: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667)
[18:23:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:24:33] <icinga-wm>	 RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2198) taken on 2024-05-14 17:24:03 (1244 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[18:26:52] <wikibugs>	 (03PS3) 10Scott French: conftool-data: bootstrap parser-cache sections and instances [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786)
[18:33:07] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs3010 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[18:33:07] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs6003 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[18:33:07] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs7003 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[18:33:14] <sukhe>	 all should recover now
[18:46:21] <wikibugs>	 (03PS7) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (https://phabricator.wikimedia.org/T362786)
[18:47:18] <wikibugs>	 (03Abandoned) 10Scott French: WIP: etcd.php: ignore pc sections in externalLoads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030496 (owner: 10Scott French)
[18:48:35] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs4010 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[18:52:45] <wikibugs>	 (03PS1) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031541 (https://phabricator.wikimedia.org/T363861)
[18:53:42] <wikibugs>	 (03PS2) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861)
[18:54:00] <wikibugs>	 (03Abandoned) 10Jdlrobson: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031541 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson)
[18:54:38] <wikibugs>	 (03PS1) 10CDanis: otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407)
[19:07:30] <wikibugs>	 (03CR) 10Krinkle: varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[19:11:24] <wikibugs>	 (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031175 (https://phabricator.wikimedia.org/T363904) (owner: 10Wargo)
[19:13:56] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:14:06] <wikibugs>	 (03CR) 10Krinkle: db-production: Generate sectionsByDB on the fly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe)
[19:16:02] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1008 - vriley@cumin1002"
[19:16:51] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1008 - vriley@cumin1002"
[19:16:51] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:17:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:18:13] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1008.mgmt.eqiad.wmnet with reboot policy FORCED
[19:18:27] <cdanis>	 !log T364907 💔cdanis@apt1002.wikimedia.org ~ 🕞🍵 sudo -i reprepro --keepunreferencedfiles includedeb bullseye-wikimedia ~/otelcol-contrib_0.100.0_linux_amd64.deb 
[19:18:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:31] <stashbot>	 T364907: upgrade to latest stable version of otelcol-contrib - https://phabricator.wikimedia.org/T364907
[19:18:53] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:19:50] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[19:20:29] <wikibugs>	 (03PS1) 10Ryan Kemper: CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543
[19:21:10] <wikibugs>	 (03PS1) 10Jclark-ctr: add kafka-main[12]006-10 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1031544 (https://phabricator.wikimedia.org/T363212)
[19:21:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 (owner: 10Ryan Kemper)
[19:21:58] <wikibugs>	 (03CR) 10Jclark-ctr: [C:03+2] add kafka-main[12]006-10 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1031544 (https://phabricator.wikimedia.org/T363212) (owner: 10Jclark-ctr)
[19:23:57] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main1006']
[19:24:39] <wikibugs>	 (03PS2) 10Ryan Kemper: CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543
[19:25:13] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kafka-main1006']
[19:26:32] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye
[19:26:44] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9796969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye
[19:30:04] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1008.mgmt.eqiad.wmnet with reboot policy FORCED
[19:30:33] <wikibugs>	 (03PS1) 10CDanis: otelcol: bump to v0.100.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1031546 (https://phabricator.wikimedia.org/T364907)
[19:32:31] <icinga-wm>	 PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:32:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:32:53] <icinga-wm>	 PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:33:23] <jinxer-wm>	 FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:43] <icinga-wm>	 RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:34:21] <icinga-wm>	 RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Sat 08 Jun 2024 03:25:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:37:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1010 - vriley@cumin1002"
[19:38:15] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  kafka-main1010 - vriley@cumin1002"
[19:38:15] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:38:23] <jinxer-wm>	 RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:39:18] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] otelcol: bump to v0.100.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1031546 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis)
[19:39:28] <wikibugs>	 (03CR) 10Hashar: [C:03+1] Clarify totoro.wikimedia.org test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE))
[19:39:35] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[19:41:37] <wikibugs>	 (03PS4) 10Jdlrobson: Enable night mode on Vector on testwiki, disable on Special:Homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699)
[19:41:49] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage
[19:45:10] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage
[19:46:00] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[19:47:20] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[19:47:32] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[19:47:55] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[19:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:50:46] <wikibugs>	 (03PS2) 10CDanis: otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407)
[19:50:46] <wikibugs>	 (03PS1) 10CDanis: otelcol: do service name transform first of all [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031551 (https://phabricator.wikimedia.org/T363407)
[19:51:16] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[19:51:23] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: do service name transform first of all [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031551 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[19:51:29] <wikibugs>	 (03CR) 10Ladsgroup: "hmm, the svg files, specially the mediawiki ones are way too big." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester)
[19:52:10] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: tweak rollout params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031542 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[19:52:13] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: do service name transform first of all [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031551 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[19:53:10] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[19:53:31] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240514T2000).
[20:00:04] <jouncebot>	 ebernhardson and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:47] <kostajh>	 hi. I don't have a patch listed, but was wondering if I could debug two patches on an mwdebug host during this time slot per https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debug_via_Gerrit_and_Scap
[20:01:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[20:01:25] <wikibugs>	 (03PS1) 10CDanis: otelcol: bump version to v0.100.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031554 (https://phabricator.wikimedia.org/T363407)
[20:02:13] <cjming>	 o/
[20:02:15] <cjming>	 i can deploy
[20:02:29] <Jdlrobson>	 o/
[20:02:55] <cjming>	 kostajh: that sounds fine to me - guessing it won't interfere much with deploying the other patches?
[20:03:14] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: bump version to v0.100.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031554 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[20:03:49] <cjming>	 Jdlrobson: i'll start with yours unless ebernhardson is around?
[20:04:07] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson)
[20:04:09] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: bump version to v0.100.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031554 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[20:04:24] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[20:04:33] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[20:04:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699) (owner: 10Jdlrobson)
[20:05:08] <ebernhardson>	 \o
[20:05:20] <wikibugs>	 (03Merged) 10jenkins-bot: Enable night mode on Vector on testwiki, disable on Special:Homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031495 (https://phabricator.wikimedia.org/T357699) (owner: 10Jdlrobson)
[20:05:34] <cjming>	 hi ebernhardson: hope it's ok i started with Jon's patches -- i'll do yours imminently
[20:05:49] <ebernhardson>	 cjming: yea no worries
[20:05:54] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1031495|Enable night mode on Vector on testwiki, disable on Special:Homepage (T357699 T363814)]]
[20:05:59] <stashbot>	 T357699: Prepare Special:Homepage for night mode - https://phabricator.wikimedia.org/T357699
[20:05:59] <stashbot>	 T363814: Release dark mode as a beta feature on desktop (May 15th)  - https://phabricator.wikimedia.org/T363814
[20:06:37] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[20:06:41] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[20:06:54] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[20:07:50] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[20:08:00] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[20:08:34] <logmsgbot>	 !log cjming@deploy1002 jdlrobson and cjming: Backport for [[gerrit:1031495|Enable night mode on Vector on testwiki, disable on Special:Homepage (T357699 T363814)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:08:55] <cjming>	 Jdlrobson: can you check your 2nd patch? still waiting for 1st patch to merge
[20:09:26] <kostajh>	 cjming: I'm not sure, I am not super familiar with the process
[20:09:56] <Jdlrobson>	 cjming: on it
[20:10:14] <cjming>	 kostajh: in theory, we should be done in about 15-20 minutes if all goes zippy
[20:10:58] <kostajh>	 ok
[20:11:29] <Jdlrobson>	 cjming: lgtm please sync but i might need a follow up
[20:11:43] <cjming>	 ok
[20:11:45] <logmsgbot>	 !log cjming@deploy1002 jdlrobson and cjming: Continuing with sync
[20:13:23] <cjming>	 kostajh: just reading that wikitech page -- i think we have to stagger our scaps -- aiui scap cmds need to be consecutive - can't be run in parallel
[20:14:05] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@ecf603d]: update discolytics to 0.18.0
[20:14:05] <wikibugs>	 (03PS5) 10Ebernhardson: cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475)
[20:14:32] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@ecf603d]: update discolytics to 0.18.0 (duration: 00m 27s)
[20:14:37] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558
[20:15:02] <kostajh>	 cjming: yeah, I can wait until the end
[20:15:10] <kostajh>	 or I might pick this up again tomorrow when I'm more awake
[20:15:41] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797204 (10Ladsgroup) a:03Ladsgroup
[20:16:05] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558 (owner: 10Peter Fischer)
[20:16:20] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797205 (10Ladsgroup) We will go with the name wikipedia-pl-admins to be consistent with other wikis. Hope that's fine with you.
[20:16:54] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797209 (10Msz2001) Sure, no problem
[20:19:03] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for plwiki sysops - https://phabricator.wikimedia.org/T364906#9797224 (10Ladsgroup) 05Open→03Resolved {{done}} https://lists.wikimedia.org/postorius/lists/wikipedia-pl-admins.lists.wikimedia.org/
[20:20:28] <kostajh>	 cjming: two things I don't know how to do from reading https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debug_via_Gerrit_and_Scap -- exact steps to follow to cherry pick the patches, and what does "clean up the deployment server" mean?
[20:20:40] <kostajh>	 it's in bold, so it sounds important :)
[20:21:55] <ebernhardson>	 kostajh: kostajh `scap pull` is what cleans it up, but its indended from mwdebug* not mwmaint*
[20:22:42] <cjming>	 kostajh: i think it means just run `scap pull` on the debug server
[20:22:47] <ebernhardson>	 i think the idea here is put patch on deploy server, pull it over to debug host, to testing, remove patch from deploy server, pull again
[20:23:29] <ebernhardson>	 oh, i guess the highlighted bit is about fixing the git repo to match what it was before
[20:24:34] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1031495|Enable night mode on Vector on testwiki, disable on Special:Homepage (T357699 T363814)]] (duration: 18m 40s)
[20:24:39] <stashbot>	 T357699: Prepare Special:Homepage for night mode - https://phabricator.wikimedia.org/T357699
[20:24:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson)
[20:24:39] <stashbot>	 T363814: Release dark mode as a beta feature on desktop (May 15th)  - https://phabricator.wikimedia.org/T363814
[20:25:18] <cjming>	 ebernhardson: doing yours next while waiting for Jon's other patch to merge - just go ahead and sync then the time comes?
[20:25:52] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Shift 25% of public wikis writes in eqiad to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031029 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson)
[20:25:57] <cjming>	 Jdlrobson: 2nd patch should be live
[20:26:22] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1031029|cirrus: Shift 25% of public wikis writes in eqiad to replacement updater (T363475)]]
[20:26:26] <stashbot>	 T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475
[20:26:44] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:26:59] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558 (owner: 10Peter Fischer)
[20:27:22] <ebernhardson>	 cjming: yea, we can only see it from the jobqueue
[20:27:28] <ebernhardson>	 cjming: go ahead and push when ready
[20:27:29] <wikibugs>	 (03Merged) 10jenkins-bot: Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031466 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson)
[20:27:41] <cjming>	 alrighty
[20:27:50] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921 (10Eevans) 03NEW
[20:27:51] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: prepare eqiad rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031558 (owner: 10Peter Fischer)
[20:28:02] <logmsgbot>	 !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:28:08] <logmsgbot>	 !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:28:16] <wikibugs>	 (03CR) 10CDanis: [C:03+1] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[20:28:21] <wikibugs>	 (03CR) 10CDanis: [C:03+1] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[20:28:25] <wikibugs>	 (03CR) 10CDanis: [C:03+1] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[20:28:58] <logmsgbot>	 !log cjming@deploy1002 cjming and ebernhardson: Backport for [[gerrit:1031029|cirrus: Shift 25% of public wikis writes in eqiad to replacement updater (T363475)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:29:03] <logmsgbot>	 !log cjming@deploy1002 cjming and ebernhardson: Continuing with sync
[20:31:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[20:34:34] <icinga-wm>	 RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.43 ms
[20:36:16] <Jdlrobson>	 cjming: ready to test?
[20:36:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[20:37:05] <ebernhardson>	 cjming: i suspect we messed this up, the log spam isn't resolving as i expected. might nede a revert, but checking a few more things
[20:37:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:38:10] <cjming>	 Jdlrobson: almost - your backport was still merging so i snuck in Erik's patch
[20:38:35] <ebernhardson>	 cjming: oh never mind, the logs did die down as expected. just took another minute
[20:38:42] <ebernhardson>	 everything look sreasonable here
[20:38:59] <cjming>	 nice!
[20:39:46] <Jdlrobson>	 ok cjming np
[20:41:25] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1031029|cirrus: Shift 25% of public wikis writes in eqiad to replacement updater (T363475)]] (duration: 15m 02s)
[20:41:29] <stashbot>	 T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475
[20:42:01] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:1031466|Override VE overlays in night-mode (T363861)]]
[20:42:05] <stashbot>	 T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861
[20:42:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:42:38] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "just creates an empty directory we can start rsyncing to - will follow-up with rsync::quickdatacopy  https://puppet-compiler.wmflabs.org/o" [puppet] - 10https://gerrit.wikimedia.org/r/1022193 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[20:44:19] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:44:43] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:1031466|Override VE overlays in night-mode (T363861)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:44:43] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:44:56] <cjming>	 Jdlrobson: 1st patch ready for testing
[20:45:00] <wikibugs>	 (03PS1) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031476
[20:45:20] <wikibugs>	 (03Abandoned) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031476 (owner: 10BCornwall)
[20:46:41] <Jdlrobson>	 cjming: looking now
[20:47:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9797373 (10VRiley-WMF)
[20:47:20] <Jdlrobson>	 cjming: so backports to deploy branches are taking 40 mins these days?
[20:47:45] <cjming>	 Jdlrobson: merging takes like 20+ minutes
[20:47:59] <Jdlrobson>	 cjming: please sync that one
[20:48:05] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Continuing with sync
[20:48:22] <Jdlrobson>	 I have a  follow up to my config flag but it's not going to fit into the last 10 mins of the window so I guess I need to schedule it tomorrow?
[20:49:55] <cjming>	 kostajh: not sure if you're still around - do you still want to use the remaining time in this window for your testing?  sorry it all took a bit longer than i was hoping - should be done in another minute
[20:49:56] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[20:52:40] <cjming>	 Jdlrobson: if Kosta is N/A i'm happy to squeeze in one more config patch
[20:54:16] <wikibugs>	 (03PS1) 10Jdlrobson: [Follow-up] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861)
[20:56:06] <wikibugs>	 (03PS1) 10Jdlrobson: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814)
[20:56:17] <wikibugs>	 (03PS1) 10Jdlrobson: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814)
[20:56:44] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:58:54] <wikibugs>	 (03PS1) 10Jdlrobson: Enable night mode as a desktop beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814)
[21:00:45] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1031466|Override VE overlays in night-mode (T363861)]] (duration: 18m 44s)
[21:00:48] <cjming>	 Jdlrobson: backport is live
[21:00:50] <stashbot>	 T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861
[21:02:01] <wikibugs>	 (03PS1) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202)
[21:02:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[21:03:59] <wikibugs>	 (03PS2) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202)
[21:04:27] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "followed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031565" [puppet] - 10https://gerrit.wikimedia.org/r/1022193 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[21:11:22] <wikibugs>	 (03CR) 10Dzahn: [V:04-1 C:04-1] "https://puppet-compiler.wmflabs.org/output/1031565/2446/stewards2001.codfw.wmnet/change.stewards2001.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[21:12:53] <wikibugs>	 (03CR) 10Dzahn: [V:04-1 C:04-1] "arr.. we would first have to move the definition of the lists server primary host to common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[21:13:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:13:38] <icinga-wm>	 PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 9360 MB (2% inode=99%): /tmp 9360 MB (2% inode=99%): /var/tmp 9360 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops
[21:16:20] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797524 (10Scott_French) Thanks, @Eevans. If you can drive development of the new data gateway (i.e., base...
[21:18:21] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "looks good, any reason you dropped the `--`?" [puppet] - 10https://gerrit.wikimedia.org/r/1031462 (owner: 10Filippo Giunchedi)
[21:19:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[21:20:08] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on db2154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:20:29] <wikibugs>	 (03PS3) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202)
[21:20:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[21:20:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364299)', diff saved to https://phabricator.wikimedia.org/P62390 and previous config saved to /var/cache/conftool/dbconfig/20240514-212052-marostegui.json
[21:20:59] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[21:22:57] <wikibugs>	 (03PS4) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202)
[21:24:52] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797608 (10Scott_French)
[21:35:01] <wikibugs>	 (03Abandoned) 10C. Scott Ananian: Enable ParserMigration extension on commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007394 (owner: 10C. Scott Ananian)
[21:36:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P62391 and previous config saved to /var/cache/conftool/dbconfig/20240514-213601-marostegui.json
[21:36:35] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797656 (10Eevans)
[21:47:06] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797672 (10Eevans)
[21:48:33] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797670 (10CodeReviewBot) eevans opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/74...
[21:51:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P62392 and previous config saved to /var/cache/conftool/dbconfig/20240514-215109-marostegui.json
[21:53:35] <wikibugs>	 (03CR) 10Jdlrobson: "recheck" [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[21:58:07] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797746 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/74  A...
[22:02:14] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING WARNING - Packet loss = 75%, RTA = 87.78 ms
[22:02:44] <icinga-wm>	 PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.193.0.18 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[22:06:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T364299)', diff saved to https://phabricator.wikimedia.org/P62393 and previous config saved to /var/cache/conftool/dbconfig/20240514-220617-marostegui.json
[22:06:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[22:06:23] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[22:06:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[22:06:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T364299)', diff saved to https://phabricator.wikimedia.org/P62394 and previous config saved to /var/cache/conftool/dbconfig/20240514-220640-marostegui.json
[22:08:38] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[22:22:34] <zabe>	 !log zabe@mwmaint1002:/tmp/upload$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="Yann" . # T364877
[22:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:40] <stashbot>	 T364877: Server side upload for Yann - https://phabricator.wikimedia.org/T364877
[22:25:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9797806 (10Jclark-ctr) @akosiaris   could you please update preseed.yaml file?    I did take care of site.pp file for codfw and eqiad
[22:33:38] <icinga-wm>	 PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 2860 MB (0% inode=99%): /tmp 2860 MB (0% inode=99%): /var/tmp 2860 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops
[22:34:12] <icinga-wm>	 RECOVERY - MediaWiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[22:39:12] <wikibugs>	 (03PS6) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647)
[22:44:10] <wikibugs>	 (03PS1) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583
[22:48:31] <zabe>	 !log start running migrateGuSalt.php in screen session # T364435
[22:48:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:36] <stashbot>	 T364435: Drop gu_salt from globaluser - https://phabricator.wikimedia.org/T364435
[23:02:27] <wikibugs>	 (03PS2) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583
[23:26:33] <wikibugs>	 (03CR) 10Scott French: "Thanks for the follow-up, Amir! If I understand correctly, it sounds like you'd recommend something like https://gerrit.wikimedia.org/r/10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French)
[23:30:01] <wikibugs>	 (03CR) 10Scott French: configure parsercache servers via dbconfig in etcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (owner: 10Scott French)
[23:43:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T352010)', diff saved to https://phabricator.wikimedia.org/P62395 and previous config saved to /var/cache/conftool/dbconfig/20240514-234337-ladsgroup.json
[23:43:41] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[23:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:58:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P62396 and previous config saved to /var/cache/conftool/dbconfig/20240514-235844-ladsgroup.json