[00:05:07] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1015152|cirrus: Move small wiki traffic to eqiad]] (duration: 15m 27s) [00:05:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015157 (owner: 10Ebernhardson) [00:05:25] (SystemdUnitFailed) firing: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:04] (03Merged) 10jenkins-bot: cirrus: Move small wiki traffic to eqiad (take two) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015157 (owner: 10Ebernhardson) [00:06:34] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:1015157|cirrus: Move small wiki traffic to eqiad (take two)]] [00:08:58] !log ebernhardson@deploy1002 ebernhardson: Backport for [[gerrit:1015157|cirrus: Move small wiki traffic to eqiad (take two)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:09:06] !log ebernhardson@deploy1002 ebernhardson: Continuing with sync [00:10:25] 06SRE, 06Infrastructure-Foundations, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9667628 (10andrea.denisse) I'm removing the observability tag because directly addressing emails generated by services owned by... [00:11:51] 06SRE, 06Infrastructure-Foundations, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9667630 (10andrea.denisse) [00:14:46] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2050-2054].codfw.wmnet [00:15:04] (03PS1) 10Ladsgroup: Avoid left join when getting templates needing review [extensions/FlaggedRevs] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015065 (https://phabricator.wikimedia.org/T361166) [00:15:10] (03CR) 10Ladsgroup: [C:03+2] Avoid left join when getting templates needing review [extensions/FlaggedRevs] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015065 (https://phabricator.wikimedia.org/T361166) (owner: 10Ladsgroup) [00:15:20] 06SRE, 06Infrastructure-Foundations, 10MediaWiki-Email: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9667636 (10andrea.denisse) [00:16:03] (03PS1) 10Andrew Bogott: pcc-db1001.yaml: fix the hostname for cloudinfra-internal-puppetserver-1 [puppet] - 10https://gerrit.wikimedia.org/r/1015162 [00:20:04] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1015157|cirrus: Move small wiki traffic to eqiad (take two)]] (duration: 13m 30s) [00:21:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015065 (https://phabricator.wikimedia.org/T361166) (owner: 10Ladsgroup) [00:22:31] (03Merged) 10jenkins-bot: Avoid left join when getting templates needing review [extensions/FlaggedRevs] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015065 (https://phabricator.wikimedia.org/T361166) (owner: 10Ladsgroup) [00:23:03] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1015065|Avoid left join when getting templates needing review (T361166)]] [00:23:10] T361166: All (or almost all) pages in plwikisource are displayed as "pending review" - https://phabricator.wikimedia.org/T361166 [00:25:27] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1015065|Avoid left join when getting templates needing review (T361166)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:26:17] (03CR) 10Andrew Bogott: [C:03+2] pcc-db1001.yaml: fix the hostname for cloudinfra-internal-puppetserver-1 [puppet] - 10https://gerrit.wikimedia.org/r/1015162 (owner: 10Andrew Bogott) [00:26:28] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [00:37:01] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014654 [00:37:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014654 (owner: 10TrainBranchBot) [00:37:59] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1015065|Avoid left join when getting templates needing review (T361166)]] (duration: 14m 56s) [00:38:05] T361166: All (or almost all) pages in plwikisource are displayed as "pending review" - https://phabricator.wikimedia.org/T361166 [00:44:16] (03PS1) 10Andrea Denisse: Added Apache 2.0 License to repository and test_alerts.py tool [alerts] - 10https://gerrit.wikimedia.org/r/1015164 (https://phabricator.wikimedia.org/T361010) [00:57:46] (03PS1) 10Andrew Bogott: pcc-db1001.yaml: remove meaningless cloudinfra-internal realm [puppet] - 10https://gerrit.wikimedia.org/r/1015165 [00:58:38] (03CR) 10Andrew Bogott: [C:03+2] pcc-db1001.yaml: remove meaningless cloudinfra-internal realm [puppet] - 10https://gerrit.wikimedia.org/r/1015165 (owner: 10Andrew Bogott) [00:59:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014654 (owner: 10TrainBranchBot) [01:01:18] dzahn@cumin1002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade. [01:04:36] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9667728 (10andrea.denisse) >>! In T358506#9666364, @Volans wrote: > The premise seems to mix differe... [01:08:35] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: security release [01:24:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1002 using scap backport" [extensions/Linter] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015054 (https://phabricator.wikimedia.org/T360865) (owner: 10Tim Starling) [01:24:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1002 using scap backport" [extensions/Linter] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1015055 (https://phabricator.wikimedia.org/T360865) (owner: 10Tim Starling) [01:27:11] (03Merged) 10jenkins-bot: Fix index usage when searching for page titles [extensions/Linter] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015054 (https://phabricator.wikimedia.org/T360865) (owner: 10Tim Starling) [01:27:14] (03Merged) 10jenkins-bot: Fix index usage when searching for page titles [extensions/Linter] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1015055 (https://phabricator.wikimedia.org/T360865) (owner: 10Tim Starling) [01:27:57] !log tstarling@deploy1002 Started scap: Backport for [[gerrit:1015054|Fix index usage when searching for page titles (T360865)]], [[gerrit:1015055|Fix index usage when searching for page titles (T360865)]] [01:28:02] T360865: Slow query in Special:LintErrors - https://phabricator.wikimedia.org/T360865 [01:32:04] !log tstarling@deploy1002 tstarling: Backport for [[gerrit:1015054|Fix index usage when searching for page titles (T360865)]], [[gerrit:1015055|Fix index usage when searching for page titles (T360865)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:36:33] !log tstarling@deploy1002 tstarling: Continuing with sync [01:48:02] !log tstarling@deploy1002 Finished scap: Backport for [[gerrit:1015054|Fix index usage when searching for page titles (T360865)]], [[gerrit:1015055|Fix index usage when searching for page titles (T360865)]] (duration: 20m 04s) [01:48:06] T360865: Slow query in Special:LintErrors - https://phabricator.wikimedia.org/T360865 [02:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:34:34] (03CR) 10David Martin: [C:03+1] Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [02:37:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:02:20] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:38:25] (03PS1) 10Tim Starling: Hooks: restore respect of $wgCodeMirrorLineNumberingNamespaces in CM5 [extensions/CodeMirror] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015186 (https://phabricator.wikimedia.org/T347211) [03:48:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1002 using scap backport" [extensions/CodeMirror] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015186 (https://phabricator.wikimedia.org/T347211) (owner: 10Tim Starling) [03:49:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.179s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:54:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.161s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:56:09] (03Merged) 10jenkins-bot: Hooks: restore respect of $wgCodeMirrorLineNumberingNamespaces in CM5 [extensions/CodeMirror] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015186 (https://phabricator.wikimedia.org/T347211) (owner: 10Tim Starling) [03:56:39] !log tstarling@deploy1002 Started scap: Backport for [[gerrit:1015186|Hooks: restore respect of $wgCodeMirrorLineNumberingNamespaces in CM5 (T347211)]] [03:56:44] T347211: Enable line numbering in all namespaces for all wikis - https://phabricator.wikimedia.org/T347211 [03:59:09] !log tstarling@deploy1002 tstarling: Backport for [[gerrit:1015186|Hooks: restore respect of $wgCodeMirrorLineNumberingNamespaces in CM5 (T347211)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [03:59:48] !log tstarling@deploy1002 tstarling: Continuing with sync [04:05:40] (SystemdUnitFailed) firing: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:11:09] !log tstarling@deploy1002 Finished scap: Backport for [[gerrit:1015186|Hooks: restore respect of $wgCodeMirrorLineNumberingNamespaces in CM5 (T347211)]] (duration: 14m 30s) [04:11:24] T347211: Enable line numbering in all namespaces for all wikis - https://phabricator.wikimedia.org/T347211 [04:12:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:15:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:17:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:37:16] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:39] 06SRE, 10Maps: Allow Wikimedia Maps usage on academic researches - https://phabricator.wikimedia.org/T361146#9667958 (10Bugreporter) You don't need permission to use static images from Wikimedia Maps. Task should only be filed if (1) you want to use interactive map from Wikimedia Maps, and (2) there are reason... [05:52:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.354s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:57:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 1.026s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T0600). [06:12:37] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: remove redundant deployments from ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015080 (https://phabricator.wikimedia.org/T361117) (owner: 10Ilias Sarantopoulos) [06:13:39] (03Merged) 10jenkins-bot: ml-services: remove redundant deployments from ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015080 (https://phabricator.wikimedia.org/T361117) (owner: 10Ilias Sarantopoulos) [06:15:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:20:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:28:29] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [06:29:41] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [06:32:01] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:32:18] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [06:33:26] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [06:49:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:57:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from proton_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [06:57:57] <_joe_> siigh [06:58:04] <_joe_> so this scraper is really determined [06:59:24] <_joe_> well I'm oncall in 30 minutes and I need to finish preparing for the day, I hope someone else gets around to look into it [06:59:33] <_joe_> probably needs adapting the requestctl rule [07:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:36] (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from proton_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [07:28:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 2:00:00 on db[2115,2215].codfw.wmnet with reason: Downtime until tuesday (T361133) [07:28:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 2:00:00 on db[2115,2215].codfw.wmnet with reason: Downtime until tuesday (T361133) [07:28:49] T361133: replication failure on db2115 and db2215 - https://phabricator.wikimedia.org/T361133 [07:29:21] (03CR) 10Giuseppe Lavagetto: [C:03+1] "I think we never got to merge this change because what we needed it for went away as an issue for us." [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [07:29:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:05:40] (SystemdUnitFailed) firing: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:25] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Verify Upgrade cookbook on GitLab Replica [08:19:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Verify Upgrade cookbook on GitLab Replica [08:20:46] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:27:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:29:47] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [09:26:44] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4038.ulsfo.wmnet [09:26:54] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4046.ulsfo.wmnet [09:27:16] !log temp depooled cp4038 and cp4046 to install benthos (T358109) [09:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:19] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [09:30:02] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4046.ulsfo.wmnet [09:30:10] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet [09:34:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to reimage db2178', diff saved to https://phabricator.wikimedia.org/P58963 and previous config saved to /var/cache/conftool/dbconfig/20240328-093424-arnaudb.json [09:40:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2178.codfw.wmnet with reason: Silence for reimaging [09:40:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2178.codfw.wmnet with reason: Silence for reimaging [09:41:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2178.codfw.wmnet with OS bookworm [09:44:18] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4046.ulsfo.wmnet [09:44:53] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4046.ulsfo.wmnet [09:58:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2178.codfw.wmnet with reason: host reimage [10:01:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2178.codfw.wmnet with reason: host reimage [10:01:37] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [10:01:53] (03CR) 10Jaime Nuche: [C:03+1] "From what I can tell LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1014611 (owner: 10Dzahn) [10:03:26] (03PS1) 10KartikMistry: Update MinT to 2024-03-28-061726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015258 (https://phabricator.wikimedia.org/T333969) [10:04:29] (03PS1) 10Gmodena: webrequest: disable canary events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015260 (https://phabricator.wikimedia.org/T314956) [10:04:37] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1015164 (https://phabricator.wikimedia.org/T361010) (owner: 10Andrea Denisse) [10:05:09] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9668177 (10Volans) Yes but to which endpoint is trying to connect? Please try to use `puppetdb-api.d... [10:05:35] (03CR) 10Brouberol: [C:03+2] hue: remove manifests and configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:06:47] (03PS2) 10SD hehua: zhwiki:Add centralauth-createlocal to ipblock exempt granter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) [10:06:59] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9668223 (10fgiunchedi) I've been working on debugging this too, here's my understanding: * naggen2 i... [10:07:41] (03PS1) 10Fabfur: benthos: enable benthos on two new hosts (text|upload) [puppet] - 10https://gerrit.wikimedia.org/r/1015262 (https://phabricator.wikimedia.org/T358109) [10:08:11] (03CR) 10Fabfur: [C:03+2] benthos: enable benthos on two new hosts (text|upload) [puppet] - 10https://gerrit.wikimedia.org/r/1015262 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:09:17] (03CR) 10JMeybohm: envoy: Add missing service mesh listeners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) (owner: 10Clément Goubert) [10:09:58] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9668310 (10Volans) Sorry, ignore my previous comments, there was some misunderstanding: * naggen ru... [10:13:52] (03CR) 10Joal: webrequest: disable canary events. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015260 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [10:16:37] (03PS4) 10Clément Goubert: envoy: Add missing service mesh listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) [10:17:01] (03PS1) 10Jon Harald Søby: Add 'mainpage-title-loggedin' to $wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015151 (https://phabricator.wikimedia.org/T361171) [10:21:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2178.codfw.wmnet with OS bookworm [10:23:17] (03PS1) 10Btullis: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) [10:24:25] (03CR) 10Clément Goubert: envoy: Add missing service mesh listeners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) (owner: 10Clément Goubert) [10:25:02] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:25:28] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015276 (https://phabricator.wikimedia.org/T356933) [10:26:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 1%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58965 and previous config saved to /var/cache/conftool/dbconfig/20240328-102615-arnaudb.json [10:27:05] (03CR) 10CI reject: [V:04-1] Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [10:27:31] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015276 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [10:28:36] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015276 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [10:29:44] (03PS2) 10Btullis: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) [10:31:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:31:56] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:59] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:35:20] (03PS1) 10Clément Goubert: trafficserver: move 65% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1015277 (https://phabricator.wikimedia.org/T360763) [10:36:56] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:38:44] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:41:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 2%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58966 and previous config saved to /var/cache/conftool/dbconfig/20240328-104121-arnaudb.json [10:42:05] (03PS1) 10Clément Goubert: mw-api-int: Double envoy concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) [10:44:52] (03PS2) 10Clément Goubert: mw-api-int: Double envoy concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) [10:49:23] (03PS1) 10Btullis: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) [10:49:28] (03PS1) 10Btullis: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) [10:49:29] (03PS1) 10Btullis: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) [10:49:31] (03PS1) 10Btullis: Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) [10:49:32] (03PS1) 10Btullis: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) [10:49:34] (03PS1) 10Btullis: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) [10:49:35] (03PS1) 10Btullis: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) [10:49:37] (03PS1) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [10:50:21] (03CR) 10JMeybohm: [C:04-1] "Also: Would it make sense to incorporate this somehow with the external-services stuff?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 (owner: 10Hnowlan) [10:53:30] (03CR) 10JMeybohm: [C:04-1] mw-api-int: Double envoy concurrency (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [10:54:25] (SystemdUnitFailed) firing: php7.4-fpm.service on mwdebug1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:29] (03CR) 10Clément Goubert: mw-api-int: Double envoy concurrency (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [10:54:50] (03PS3) 10Clément Goubert: mw-api-int: Double envoy concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) [10:55:16] (03CR) 10Clément Goubert: mw-api-int: Double envoy concurrency (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [10:56:16] (03CR) 10JMeybohm: [C:03+1] mw-api-int: Double envoy concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [10:56:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 4%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58967 and previous config saved to /var/cache/conftool/dbconfig/20240328-105626-arnaudb.json [10:57:39] (03CR) 10Clément Goubert: [C:03+2] mw-api-int: Double envoy concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [10:58:30] (03Merged) 10jenkins-bot: mw-api-int: Double envoy concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015278 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [10:59:25] (SystemdUnitFailed) resolved: php7.4-fpm.service on mwdebug1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:42] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:00:05] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1100). [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1100) [11:00:05] claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:52] (03PS2) 10Gmodena: webrequest: disable canary events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015260 (https://phabricator.wikimedia.org/T314956) [11:01:41] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:01:50] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:03:44] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:04:35] (03PS1) 10JMeybohm: eventrouter: Increase memory limit by 150M [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015286 [11:04:37] !log RESTbase: Migrate backend traffic to mw-api-int - T358213 [11:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:42] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [11:04:44] !log Disabling puppet on P:restbase - T358213 [11:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:37] (03CR) 10Kamila Součková: [C:03+1] eventrouter: Increase memory limit by 150M [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015286 (owner: 10JMeybohm) [11:06:57] (03CR) 10Clément Goubert: [C:03+2] restbase: Migrate backend traffic to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1014493 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [11:09:40] (03CR) 10Kamila Součková: [C:03+1] trafficserver: move 65% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1015277 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert) [11:09:50] !log enabling and running puppet on restbase2021.codfw.wmnet - T358213 [11:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:54] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [11:11:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 8%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58968 and previous config saved to /var/cache/conftool/dbconfig/20240328-111132-arnaudb.json [11:12:27] !log enabling and running puppet on restbase1035.eqiad.wmnet - T358213 [11:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:26] !log enabling and running puppet on P:restbase - T358213 [11:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:31] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [11:17:23] claime: can you ping me when you're done! tnx [11:18:25] (SystemdUnitFailed) firing: php7.4-fpm.service on mwdebug1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:34] mvolz: sure! [11:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:02] (03CR) 10JMeybohm: [C:03+2] eventrouter: Increase memory limit by 150M [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015286 (owner: 10JMeybohm) [11:20:15] (03CR) 10Gmodena: webrequest: disable canary events. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015260 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [11:20:44] (03CR) 10Btullis: [C:03+2] Migrate datahub to use external-services for CAS IDP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014065 (https://phabricator.wikimedia.org/T331894) (owner: 10Btullis) [11:21:22] (03CR) 10Brouberol: Create a new aqs-http-gateway chart (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [11:22:50] (03Merged) 10jenkins-bot: eventrouter: Increase memory limit by 150M [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015286 (owner: 10JMeybohm) [11:22:59] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:23:25] (SystemdUnitFailed) resolved: php7.4-fpm.service on mwdebug1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:56] (03CR) 10Brouberol: "Btullis, I just saw your comment in the phab ticket" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [11:25:03] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:25:21] (03CR) 10Phuedx: Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [11:25:28] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:25:33] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Lock backups/restores on switch_from host after backup creation [cookbooks] - 10https://gerrit.wikimedia.org/r/1014103 (owner: 10EoghanGaffney) [11:25:38] (03CR) 10Phuedx: [C:03+1] Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [11:25:47] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:26:06] (03PS17) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [11:26:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 16%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58969 and previous config saved to /var/cache/conftool/dbconfig/20240328-112638-arnaudb.json [11:26:51] (03CR) 10Klausman: [C:03+2] charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:26:52] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:27:10] !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["76082583"]' 2>&1 | tee -a ~/T315510-enwiki-4; date [11:27:24] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:27:59] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:28:10] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:28:16] um, stashbot’s gone [11:28:32] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:28:33] last SAL message 11:25, so only a few minutes ago [11:28:35] (03PS4) 10Dreamy Jazz: Move checkuser grant configuration to CheckUser extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009865 (https://phabricator.wikimedia.org/T359537) (owner: 10Gergő Tisza) [11:28:42] (03Merged) 10jenkins-bot: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:28:48] mvolz: you can go ahead [11:30:18] (03Merged) 10jenkins-bot: [gitlab] Lock backups/restores on switch_from host after backup creation [cookbooks] - 10https://gerrit.wikimedia.org/r/1014103 (owner: 10EoghanGaffney) [11:30:54] stashbot is back, jayme and arnaudb might want to re-log a few missed !log messages [11:30:55] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [11:31:00] !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["76082583"]' 2>&1 | tee -a ~/T315510-enwiki-4; date [11:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:14] thanks Lucas_WMDE [11:31:48] (03CR) 10Dreamy Jazz: [C:03+1] Move checkuser grant configuration to CheckUser extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009865 (https://phabricator.wikimedia.org/T359537) (owner: 10Gergő Tisza) [11:32:13] !log deployed helmfile.d/admin to staging-codfw,staging-eqiad,codfw,eqiad [11:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:05] tnx! [11:34:10] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014054 (owner: 10PipelineBot) [11:35:02] (03CR) 10S8321414: [C:03+1] zhwiki:Add centralauth-createlocal to ipblock exempt granter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [11:35:12] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014054 (owner: 10PipelineBot) [11:36:18] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:36:41] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:39:41] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:40:16] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:40:40] (03CR) 10JMeybohm: [C:03+1] envoy: Add missing service mesh listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) (owner: 10Clément Goubert) [11:40:52] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:41:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P58970 and previous config saved to /var/cache/conftool/dbconfig/20240328-114110-ladsgroup.json [11:41:22] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:41:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58971 and previous config saved to /var/cache/conftool/dbconfig/20240328-114144-arnaudb.json [11:42:59] (03Abandoned) 10Majavah: lxc: Rely on default network config [puppet] - 10https://gerrit.wikimedia.org/r/1002357 (https://phabricator.wikimedia.org/T356551) (owner: 10Majavah) [11:43:42] (03PS1) 10Klausman: ml-servics/experimental: Fix transposed app name for netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015292 (https://phabricator.wikimedia.org/T360428) [11:45:09] (03PS5) 10Hnowlan: cassandra-http-gateway: use cassandra module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 [11:45:42] (03CR) 10Hnowlan: "In theory yes, but I'd say outside of this CR. It's a weird one because the cassandra_client module gives us the networking stuff coupled " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 (owner: 10Hnowlan) [11:51:10] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [11:53:20] (03PS1) 10Kosta Harlan: beta: Disable wgWikimediaEventsIPoidUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015295 (https://phabricator.wikimedia.org/T354597) [11:53:32] (03CR) 10CI reject: [V:04-1] zhwiki:Add centralauth-createlocal to ipblock exempt granter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [11:55:06] (03PS1) 10Kosta Harlan: WikimediaEvents: Set IPoid URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015296 (https://phabricator.wikimedia.org/T354597) [11:55:09] Hello [11:55:53] I will be deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1015146 [11:56:13] (03CR) 10Mabualruz: [C:03+1] Revert donatewiki and thankyouwiki for fundraising [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015146 (https://phabricator.wikimedia.org/T360628) (owner: 10Kimberly Sarabia) [11:56:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P58972 and previous config saved to /var/cache/conftool/dbconfig/20240328-115616-ladsgroup.json [11:56:29] 06SRE, 10MW-on-K8s, 10RESTBase, 06serviceops, 13Patch-For-Review: 14Migrate restbase from mwapi-async to mw-api-int - 14https://phabricator.wikimedia.org/T358213#9668903 (10Clement_Goubert) 05In progress→03Resolved [11:56:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58973 and previous config saved to /var/cache/conftool/dbconfig/20240328-115649-arnaudb.json [11:56:52] (03PS1) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [11:58:30] (03PS1) 10Kosta Harlan: EventStreamConfig: Register ip_reputation/score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015299 (https://phabricator.wikimedia.org/T354597) [11:58:58] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9668916 (10Clement_Goubert) [11:59:27] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9668917 (10Clement_Goubert) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1200) [12:00:22] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014052 (owner: 10PipelineBot) [12:00:33] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1011438 (owner: 10PipelineBot) [12:02:03] (03PS1) 10Clément Goubert: Revert "mw-api-int: Double envoy concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015190 [12:02:21] (03CR) 10Clément Goubert: [C:03+2] envoy: Add missing service mesh listeners [puppet] - 10https://gerrit.wikimedia.org/r/1013300 (https://phabricator.wikimedia.org/T360625) (owner: 10Clément Goubert) [12:04:15] !log trafficserver: move 65% of traffic to mw on k8s - T360763 [12:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:22] (03CR) 10Clément Goubert: [C:03+2] trafficserver: move 65% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1015277 (https://phabricator.wikimedia.org/T360763) (owner: 10Clément Goubert) [12:04:29] T360763: Move 70% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T360763 [12:05:40] (SystemdUnitFailed) firing: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:38] (03CR) 10Hnowlan: [C:03+1] Revert "mw-api-int: Double envoy concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015190 (owner: 10Clément Goubert) [12:08:24] (03CR) 10Clément Goubert: [C:03+2] Revert "mw-api-int: Double envoy concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015190 (owner: 10Clément Goubert) [12:09:15] (03Merged) 10jenkins-bot: Revert "mw-api-int: Double envoy concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015190 (owner: 10Clément Goubert) [12:10:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:11:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P58974 and previous config saved to /var/cache/conftool/dbconfig/20240328-121122-ladsgroup.json [12:11:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58975 and previous config saved to /var/cache/conftool/dbconfig/20240328-121155-arnaudb.json [12:12:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:12:20] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:13:35] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Move 70% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T360763#9668960 (10Clement_Goubert) [12:14:31] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:19:58] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1015304 (owner: 10L10n-bot) [12:26:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P58976 and previous config saved to /var/cache/conftool/dbconfig/20240328-122628-ladsgroup.json [12:27:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58977 and previous config saved to /var/cache/conftool/dbconfig/20240328-122701-arnaudb.json [12:32:33] (03CR) 10A2093064: [C:04-1] zhwiki:Add centralauth-createlocal to ipblock exempt granter (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [12:36:44] (03PS3) 10SD hehua: zhwiki:Add centralauth-createlocal to ipblock exempt granter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) [12:41:49] (03CR) 10SD hehua: zhwiki:Add centralauth-createlocal to ipblock exempt granter (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015187 (https://phabricator.wikimedia.org/T361184) (owner: 10SD hehua) [12:57:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to reimage db1200', diff saved to https://phabricator.wikimedia.org/P58978 and previous config saved to /var/cache/conftool/dbconfig/20240328-125721-arnaudb.json [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1300). [13:00:05] Daimona, tgr, Dreamy_Jazz, and mo_abualruz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] \o [13:00:18] Hello again I will be deploying 1015146 in a minute [13:00:22] (03PS1) 10Majavah: P:toolforge::proxy: remove unused hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1015321 [13:00:22] I can self deploy [13:00:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mabualruz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015146 (https://phabricator.wikimedia.org/T360628) (owner: 10Kimberly Sarabia) [13:00:52] Can this wait for the other config patches? [13:01:19] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1754/console" [puppet] - 10https://gerrit.wikimedia.org/r/1015321 (owner: 10Majavah) [13:01:27] I can cancel sure [13:01:34] Hiiiiiiii sorry I'm late for the deployment [13:01:39] It is no big issue, just generally it goes first to last on that list. [13:02:04] (03Merged) 10jenkins-bot: Revert donatewiki and thankyouwiki for fundraising [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015146 (https://phabricator.wikimedia.org/T360628) (owner: 10Kimberly Sarabia) [13:02:13] Hi there Daimona. [13:02:23] Hi Daimona [13:02:36] !log mabualruz@deploy1002 Started scap: Backport for [[gerrit:1015146|Revert donatewiki and thankyouwiki for fundraising (T360628)]] [13:02:50] T360628: Deploy Vector 2022 skin to Wikisource wikis, internal wikis, and wikipedias - https://phabricator.wikimedia.org/T360628 [13:03:22] should I cancel or keep it going? [13:03:40] Keep going, otherwise you'd need to revert the patch in operations/mediawiki-config [13:04:08] So it'll probably be faster to continue than cancel. [13:04:24] * Lucas_WMDE around now [13:04:31] ah ok sorry I was hasty to run that I did not know that we should go top to bottom of the list [13:04:37] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::proxy: remove unused hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1015321 (owner: 10Majavah) [13:05:11] !log mabualruz@deploy1002 ksarabia and mabualruz: Backport for [[gerrit:1015146|Revert donatewiki and thankyouwiki for fundraising (T360628)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1200.eqiad.wmnet with reason: Silence for reimaging [13:05:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1200.eqiad.wmnet with reason: Silence for reimaging [13:05:45] checking the sites I will proceed in a minute [13:06:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1200.eqiad.wmnet with OS bookworm [13:06:55] 10SRE-tools, 10Cloud-VPS, 10Spicerack: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; exiting - https://phabricator.wikimedia.org/T361218 (10taavi) 03NEW [13:07:02] ok seems good I will proceed [13:07:05] !log mabualruz@deploy1002 ksarabia and mabualruz: Continuing with sync [13:07:34] !log temporarily depooling wdqs2009 (test query rate when depooled T360993) [13:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:39] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [13:09:22] (03CR) 10Kosta Harlan: extension-list: Add IPReputation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [13:09:24] (03PS1) 10EoghanGaffney: [gitlab] Narrow scope of gitlab backup rsync commands [puppet] - 10https://gerrit.wikimedia.org/r/1015323 [13:09:56] (03PS2) 10EoghanGaffney: [gitlab] Narrow scope of gitlab backup rsync commands [puppet] - 10https://gerrit.wikimedia.org/r/1015323 (https://phabricator.wikimedia.org/T361219) [13:15:02] (03PS2) 10Dreamy Jazz: extension-list: Add IPReputation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [13:17:10] !log repooling wdqs2009 (test query rate when depooled T360993) [13:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:18] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [13:17:21] (03PS1) 10Filippo Giunchedi: puppetserver: use client certs for naggen2 puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) [13:17:29] (03CR) 10EoghanGaffney: "jelto" [puppet] - 10https://gerrit.wikimedia.org/r/1015323 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [13:17:47] (03PS2) 10Klausman: ml-servics/experimental: Fix transposed app name and chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015292 (https://phabricator.wikimedia.org/T360428) [13:17:49] ... oops [13:18:15] ? [13:18:31] !log mabualruz@deploy1002 Finished scap: Backport for [[gerrit:1015146|Revert donatewiki and thankyouwiki for fundraising (T360628)]] (duration: 15m 55s) [13:18:33] Was it your comment on the gerrit change? [13:18:35] T360628: Deploy Vector 2022 skin to Wikisource wikis, internal wikis, and wikipedias - https://phabricator.wikimedia.org/T360628 [13:18:40] Dreamy_Jazz: My gerrit change above, commented instead of assigned. [13:18:48] Ah I see. Thanks. [13:19:01] At least the intended recipient now has two chances to see it :D [13:19:03] Daimona: I think you are next. [13:19:11] Sure [13:19:23] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9669160 (10Papaul) [13:19:23] I don't mind doing the deploy. [13:19:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage [13:19:30] Thanks all is good here [13:21:02] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: 14Decom asw-a-codfw switch stack - 14https://phabricator.wikimedia.org/T358244#9669161 (10Papaul) 05Open→03Resolved a:03Papaul 14complete  [13:21:08] (03CR) 10Filippo Giunchedi: [C:03+1] logstash: remove configuration for logstash101[012] [puppet] - 10https://gerrit.wikimedia.org/r/1014048 (https://phabricator.wikimedia.org/T360950) (owner: 10Cwhite) [13:21:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:21:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:21:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:21:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage [13:22:31] (03CR) 10Dreamy Jazz: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:22:41] (03CR) 10CI reject: [V:04-1] Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:22:42] (03CR) 10CI reject: [V:04-1] Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:23:06] (03PS2) 10Dreamy Jazz: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) [13:23:12] Fixing a small spelling mistake in a comment [13:23:32] (03PS2) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) [13:23:35] (03PS2) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) [13:23:40] (03PS3) 10Dreamy Jazz: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) [13:23:46] (03PS3) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) [13:23:51] (03PS3) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) [13:23:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:23:55] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:23:56] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:25:24] (03Merged) 10jenkins-bot: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:25:28] (03Merged) 10jenkins-bot: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:25:34] (03Abandoned) 10Paladox: letsencrypt: Sync acme-tiny upstream [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [13:25:39] (03Abandoned) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 (owner: 10Paladox) [13:26:09] (03PS1) 10JMeybohm: admin/namespaces: Remove net.beta.kubernetes.io/network-policy annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015329 [13:26:25] (03Merged) 10jenkins-bot: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [13:26:56] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1015042|Add setting to determine if CampaignEvents should use the global DB (T348281)]], [[gerrit:1015043|Add virtual domain mapping for CampaignEvents (prod) (T348281)]], [[gerrit:1015044|Add virtual domain mapping for CampaignEvents (beta) (T348281)]] [13:27:06] T348281: Make the CampaignEvents database configuration use the new DatabaseVirtualDomains config - https://phabricator.wikimedia.org/T348281 [13:28:01] (03PS2) 10JMeybohm: admin/namespaces: Remove net.beta.kubernetes.io/network-policy annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015329 [13:28:11] Daimona: Is there anything that can be tested for these config changes? [13:28:16] (03CR) 10Brouberol: [C:03+1] "Technically, you _can_ split that in two: the release of the new chart version, and the change of a release configuration value, as these " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015292 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:28:28] They should all be no-ops, but I can test that nothing caught fire [13:28:33] 👍 [13:28:53] (03CR) 10Klausman: [C:03+2] ml-servics/experimental: Fix transposed app name and chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015292 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:29:20] Just poke me when it's ready to test, I'm multitasking (v badly) as usual [13:29:25] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1015042|Add setting to determine if CampaignEvents should use the global DB (T348281)]], [[gerrit:1015043|Add virtual domain mapping for CampaignEvents (prod) (T348281)]], [[gerrit:1015044|Add virtual domain mapping for CampaignEvents (beta) (T348281)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:29:30] Daimona: Now :) [13:29:45] Oh ok :9 [13:29:47] :) [13:30:43] (03Merged) 10jenkins-bot: ml-servics/experimental: Fix transposed app name and chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015292 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:32:09] (03Abandoned) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/492314 (owner: 10Paladox) [13:33:12] Things seem fine to me. Have you finished testing Daimona? [13:34:11] Yup, just finished testing and it looks good [13:34:21] Great. [13:34:23] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [13:35:30] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:41:16] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9669229 (10Andrew) I suspect that on re-run partman isn't properly zeroing out the partition table before it starts so we're accumula... [13:42:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1200.eqiad.wmnet with OS bookworm [13:42:36] !log brouberol@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:45:45] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1015042|Add setting to determine if CampaignEvents should use the global DB (T348281)]], [[gerrit:1015043|Add virtual domain mapping for CampaignEvents (prod) (T348281)]], [[gerrit:1015044|Add virtual domain mapping for CampaignEvents (beta) (T348281)]] (duration: 18m 49s) [13:45:52] T348281: Make the CampaignEvents database configuration use the new DatabaseVirtualDomains config - https://phabricator.wikimedia.org/T348281 [13:46:28] Tgr: Do you want to do your patch? [13:47:09] Thanks Dreamy_Jazz! [13:47:20] Np [13:48:45] (03PS3) 10EoghanGaffney: [gitlab] Narrow scope of gitlab backup rsync commands [puppet] - 10https://gerrit.wikimedia.org/r/1015323 (https://phabricator.wikimedia.org/T361219) [13:51:32] I'm going to do my patch as it seems tgr is away [13:52:21] (03PS5) 10Dreamy Jazz: Move checkuser grant configuration to CheckUser extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009865 (https://phabricator.wikimedia.org/T359537) (owner: 10Gergő Tisza) [13:52:25] (03PS6) 10Dreamy Jazz: Move checkuser grant configuration to CheckUser extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009865 (https://phabricator.wikimedia.org/T359537) (owner: 10Gergő Tisza) [13:52:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009865 (https://phabricator.wikimedia.org/T359537) (owner: 10Gergő Tisza) [13:53:29] (03Merged) 10jenkins-bot: Move checkuser grant configuration to CheckUser extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009865 (https://phabricator.wikimedia.org/T359537) (owner: 10Gergő Tisza) [13:53:57] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1009865|Move checkuser grant configuration to CheckUser extension (T359537)]] [13:54:04] T359537: Special:BotPasswords grant for "access checkuser data" should have the "grants with security risk" icon - https://phabricator.wikimedia.org/T359537 [13:54:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 1%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58980 and previous config saved to /var/cache/conftool/dbconfig/20240328-135450-arnaudb.json [13:55:51] (03PS1) 10Brouberol: rbac: grant RBAC perms on calico networkpolicis to the kserve-deploy clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015333 (https://phabricator.wikimedia.org/T360428) [13:56:20] (03CR) 10Volans: [C:03+1] "I didn't test it but looks sane, one question inline. Does it work for the old setup too?" [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) (owner: 10Filippo Giunchedi) [13:56:23] !log dreamyjazz@deploy1002 tgr and dreamyjazz: Backport for [[gerrit:1009865|Move checkuser grant configuration to CheckUser extension (T359537)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:56:41] (03CR) 10Klausman: [C:03+1] rbac: grant RBAC perms on calico networkpolicis to the kserve-deploy clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015333 (https://phabricator.wikimedia.org/T360428) (owner: 10Brouberol) [13:57:10] Testing.... [13:57:26] (03CR) 10Kosta Harlan: "There is no urgency to deploy this, just removing the -2 as I believe this is no longer blocked" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [13:57:38] 10ops-codfw, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229 (10RobH) 03NEW p:05Triage→03Medium [13:58:54] !log dreamyjazz@deploy1002 tgr and dreamyjazz: Continuing with sync [13:59:02] 10ops-codfw, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9669350 (10RobH) [13:59:09] We may go over the window by a few minutes [13:59:13] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015333 (https://phabricator.wikimedia.org/T360428) (owner: 10Brouberol) [13:59:31] (03CR) 10Brouberol: [C:03+2] rbac: grant RBAC perms on calico networkpolicis to the kserve-deploy clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015333 (https://phabricator.wikimedia.org/T360428) (owner: 10Brouberol) [13:59:39] Is anyone wanting to do something at the end of this window? If so, I can ping you when I'm done. [14:00:24] !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:01:01] !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:01:04] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:01:35] Tgr: We are at the end of the deployment window. Does your patch need merging now or can be it be delayed until another window? [14:02:57] !log brouberol@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:03:20] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1015323 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [14:05:44] !log brouberol@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:08:17] !log brouberol@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:09:31] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) (owner: 10Filippo Giunchedi) [14:09:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 2%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58981 and previous config saved to /var/cache/conftool/dbconfig/20240328-140956-arnaudb.json [14:10:05] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1009865|Move checkuser grant configuration to CheckUser extension (T359537)]] (duration: 16m 08s) [14:10:09] !log brouberol@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:10:09] T359537: Special:BotPasswords grant for "access checkuser data" should have the "grants with security risk" icon - https://phabricator.wikimedia.org/T359537 [14:10:19] !log Afternoon UTC backport window done [14:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:51] (03PS1) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) [14:13:54] Dreamy_Jazz: sorry, got distracted. We'll deploy a fix for the train blocker soon, I'll deploy it along with that. [14:15:22] 👍 [14:15:23] !log re-enabling puppet on wdqs1013 [14:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to reimage db2157 (T360116)', diff saved to https://phabricator.wikimedia.org/P58982 and previous config saved to /var/cache/conftool/dbconfig/20240328-141844-arnaudb.json [14:18:51] T360116: Upgrade s5 to MariaDB 10.6 - https://phabricator.wikimedia.org/T360116 [14:19:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2157.codfw.wmnet with reason: T360116 [14:19:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: T360116 [14:21:08] (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [14:21:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2157.codfw.wmnet with OS bookworm [14:22:36] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, and 3 others: db2100 crashed (memory error) - https://phabricator.wikimedia.org/T361037#9669479 (10Jhancock.wm) @jcrespo The host is at eol as of tomorrow. Is it possible to decommission this host? [14:23:19] (03PS2) 10Andrea Denisse: Added Apache 2.0 License to repository and test_alerts.py tool [alerts] - 10https://gerrit.wikimedia.org/r/1015164 (https://phabricator.wikimedia.org/T361010) [14:23:46] (03PS1) 10Ayounsi: Example cookbook using gNMI module [cookbooks] - 10https://gerrit.wikimedia.org/r/1015335 (https://phabricator.wikimedia.org/T344325) [14:23:49] (03CR) 10Andrea Denisse: Added Apache 2.0 License to repository and test_alerts.py tool (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1015164 (https://phabricator.wikimedia.org/T361010) (owner: 10Andrea Denisse) [14:25:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 4%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58983 and previous config saved to /var/cache/conftool/dbconfig/20240328-142502-arnaudb.json [14:26:50] !log brouberol@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:26:52] (03CR) 10Filippo Giunchedi: [C:03+1] Added Apache 2.0 License to repository and test_alerts.py tool [alerts] - 10https://gerrit.wikimedia.org/r/1015164 (https://phabricator.wikimedia.org/T361010) (owner: 10Andrea Denisse) [14:27:19] (03CR) 10Andrea Denisse: [C:03+2] Added Apache 2.0 License to repository and test_alerts.py tool [alerts] - 10https://gerrit.wikimedia.org/r/1015164 (https://phabricator.wikimedia.org/T361010) (owner: 10Andrea Denisse) [14:28:22] (03CR) 10CI reject: [V:04-1] Example cookbook using gNMI module [cookbooks] - 10https://gerrit.wikimedia.org/r/1015335 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [14:28:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:29:07] expected ^ [14:30:07] !log brouberol@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:30:40] 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9669564 (10Jhancock.wm) I have located and set aside the parts to be installed. I am available every week day between 1300 UTC and 1700 UTC. Please let me know what time/day in that wo... [14:32:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:28] (03CR) 10Elukey: [C:03+1] "Makes sense, is there any follow up to do or the functionality is totally not needed? Meaning: were we relying on the annotation doing som" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015329 (owner: 10JMeybohm) [14:32:57] (ProbeDown) firing: Service search-omega-https:9443 has failed probes (http_search-omega-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#search-omega-https:9443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:05] hello [14:33:08] ACKing [14:33:25] !incidents [14:33:25] 4554 (ACKED) ProbeDown sre (10.2.1.30 ip4 search-omega-https:9443 probes/service http_search-omega-https_ip4 codfw) [14:33:26] 4553 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [14:33:26] inflatador: ^ expected? [14:33:26] 4552 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [14:33:26] 4551 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [14:33:48] this is codfw, so going by your email [14:34:20] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015337 (https://phabricator.wikimedia.org/T356933) [14:34:28] > We have depooled codfw omega for now [14:34:35] Yeah, I recall seeing Bryans email about it. [14:34:37] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015337 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [14:35:40] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015337 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [14:36:33] I am just going to ACK it for now on Karma while we wait to hear back [14:37:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:28] (ProbeDown) firing: Service search-omega-https:9443 has failed probes (http_search-omega-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#search-omega-https:9443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:40] ^ denisse I silenced that for two hours for now [14:38:03] Thank you! [14:38:03] ryankemper: inflatador: let us know if that should be extended, or if this is not expected. but going by your email, it is [14:38:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2157.codfw.wmnet with reason: host reimage [14:39:55] (03PS1) 10Effie Mouzeli: php7.4-fpm: introa [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) [14:40:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 8%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58984 and previous config saved to /var/cache/conftool/dbconfig/20240328-144008-arnaudb.json [14:40:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: host reimage [14:41:34] (03CR) 10Herron: [C:03+1] logstash: remove configuration for logstash101[012] [puppet] - 10https://gerrit.wikimedia.org/r/1014048 (https://phabricator.wikimedia.org/T360950) (owner: 10Cwhite) [14:42:12] (HelmReleaseBadStatus) firing: Helm release experimental/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=experimental - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:44:28] (03PS1) 10Klausman: charts/kserve-inference: move netpol generation outside the service loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015340 (https://phabricator.wikimedia.org/T360428) [14:47:08] (03CR) 10Brouberol: [C:03+1] charts/kserve-inference: move netpol generation outside the service loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015340 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:47:09] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:47:12] (HelmReleaseBadStatus) resolved: Helm release experimental/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=experimental - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:47:17] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:47:22] (03CR) 10Klausman: [C:03+2] charts/kserve-inference: move netpol generation outside the service loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015340 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:47:32] (03PS2) 10Ayounsi: Example cookbook using gNMI module [cookbooks] - 10https://gerrit.wikimedia.org/r/1015335 (https://phabricator.wikimedia.org/T344325) [14:48:50] 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9669671 (10RobH) a:05Jhancock.wm→03fgiunchedi >>! In T361229#9669564, @Jhancock.wm wrote: > I have located and set aside the parts to be installed. > > I am available every week da... [14:49:19] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, and 3 others: db2100 crashed (memory error) - https://phabricator.wikimedia.org/T361037#9669674 (10Jhancock.wm) @Marostegui I was informed that the above question should be directed at you, no jcrespo. my bad. What do you think? [14:49:31] (03Merged) 10jenkins-bot: charts/kserve-inference: move netpol generation outside the service loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015340 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:50:00] (03PS1) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) [14:51:24] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:52:45] (03PS2) 10Filippo Giunchedi: puppetserver: use client certs for naggen2 puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) [14:53:48] (03PS2) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) [14:54:34] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1755/console" [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) (owner: 10Filippo Giunchedi) [14:55:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 16%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58985 and previous config saved to /var/cache/conftool/dbconfig/20240328-145514-arnaudb.json [14:55:22] (03PS2) 10Effie Mouzeli: php7.4-fpm: introa [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) [14:56:16] (03PS1) 10JMeybohm: Remove flink RBAC snowflakes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015343 (https://phabricator.wikimedia.org/T326409) [14:56:38] (03PS3) 10JMeybohm: admin/namespaces: Remove net.beta.kubernetes.io/network-policy annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015329 [14:56:39] (03PS2) 10JMeybohm: Remove flink RBAC snowflakes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015343 (https://phabricator.wikimedia.org/T326409) [14:56:43] (03PS2) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) [14:57:20] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:28] (ProbeDown) resolved: Service search-omega-https:9443 has failed probes (http_search-omega-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#search-omega-https:9443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:03] denisse: ^ it resolved :) [14:58:07] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1756/console" [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) (owner: 10Filippo Giunchedi) [14:58:09] how? I don't know! [14:58:16] I can't see anything that has changed [14:58:47] (03Abandoned) 10Ayounsi: Initial gNMI support for network automation cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/924896 (owner: 10Ayounsi) [14:59:23] (03CR) 10Filippo Giunchedi: [V:03+1] "For some reason I can't preview this change in PCC" [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) (owner: 10Filippo Giunchedi) [15:00:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2157.codfw.wmnet with OS bookworm [15:00:59] (03CR) 10JMeybohm: "The annotation did not had any effect since k8s >=1.7 - so it's safe to remove." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015329 (owner: 10JMeybohm) [15:01:24] Thanks sukhe! [15:02:45] (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [15:06:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:10:19] ^ ACKing it. [15:10:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58986 and previous config saved to /var/cache/conftool/dbconfig/20240328-151019-arnaudb.json [15:10:34] It looks like it's related to the omega depool. [15:10:34] !incidents [15:10:34] 4554 (ACKED) ProbeDown sre (10.2.1.30 ip4 search-omega-https:9443 probes/service http_search-omega-https_ip4 codfw) [15:10:34] 4553 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [15:10:35] 4552 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [15:12:30] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325#9669742 (10ayounsi) [15:14:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: 14hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - 14https://phabricator.wikimedia.org/T360446#9669745 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm 14new drive has been inserted and the alert has cleared. retu... [15:15:17] (03PS3) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) [15:16:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 5%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58987 and previous config saved to /var/cache/conftool/dbconfig/20240328-151603-arnaudb.json [15:16:43] (03PS3) 10Btullis: Create a new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) [15:16:43] (03PS2) 10Btullis: Migrate editor-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014656 (https://phabricator.wikimedia.org/T360531) [15:16:43] (03PS2) 10Btullis: Migrate edit-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014657 (https://phabricator.wikimedia.org/T360531) [15:16:44] (03PS2) 10Btullis: Migrate device-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014658 (https://phabricator.wikimedia.org/T360531) [15:16:45] (03PS2) 10Btullis: Migrate geo-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014659 (https://phabricator.wikimedia.org/T360531) [15:16:46] (03PS2) 10Btullis: Migrate image-suggestions to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) [15:16:50] (03PS2) 10Btullis: Migrate media-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014661 (https://phabricator.wikimedia.org/T360531) [15:16:54] (03PS2) 10Btullis: Migrate page-analytics to use the new aqs-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014662 (https://phabricator.wikimedia.org/T360531) [15:16:58] (03PS2) 10Btullis: Remove separate charts for druid and cassandra AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014663 (https://phabricator.wikimedia.org/T360531) [15:17:20] (03PS3) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [15:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:02] (03PS4) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [15:21:32] (03CR) 10Btullis: Create a new aqs-http-gateway chart (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014655 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [15:22:19] (03PS3) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) [15:23:48] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [15:23:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9669818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [15:24:03] (03CR) 10DCausse: [C:04-1] "needs MW 1.42.0-wmf.25 to be deployed first" [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [15:25:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58989 and previous config saved to /var/cache/conftool/dbconfig/20240328-152525-arnaudb.json [15:31:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 10%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58990 and previous config saved to /var/cache/conftool/dbconfig/20240328-153109-arnaudb.json [15:33:48] (03CR) 10Hashar: "We twice had the issue of a deployment failing due to that page taking longer than 10 seconds." [puppet] - 10https://gerrit.wikimedia.org/r/1014425 (https://phabricator.wikimedia.org/T360867) (owner: 10Hashar) [15:37:09] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:39:39] 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9669905 (10RobH) @Jhancock.wm: I put a typo in the top, it should be (3) dimms per host not 2, not changing it but updating in this comment so you can acknowledge and update the task des... [15:40:15] 06SRE, 10Maps: 14Allow Wikimedia Maps usage on academic researches - 14https://phabricator.wikimedia.org/T361146#9669902 (10Aklapper) 05Open→03Invalid 14> If you do not fill in all three fields your request will be declined. Thus closing. [15:40:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58991 and previous config saved to /var/cache/conftool/dbconfig/20240328-154031-arnaudb.json [15:41:00] (03CR) 10JMeybohm: mediawiki: add MW__MCROUTER_SERVER variable in chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:42:19] (03CR) 10JMeybohm: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:46:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 15%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58992 and previous config saved to /var/cache/conftool/dbconfig/20240328-154615-arnaudb.json [15:46:59] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T360862#9669941 (10Jclark-ctr) Replaced failed ssd with extra from onhands at eqiad [15:47:26] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-03-25-122226-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015349 [15:50:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:05] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1010.eqiad.wmnet [15:52:16] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-03-25-122226-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015349 (owner: 10BryanDavis) [15:53:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS bookworm [15:53:53] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9669968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudbackup2003.codfw.... [15:55:02] I'm going to do my developer-portal deployment about an hour early today because of meeting conflicts at the scheduled time. [15:55:11] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [15:55:28] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [15:55:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58993 and previous config saved to /var/cache/conftool/dbconfig/20240328-155537-arnaudb.json [15:55:40] (03CR) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:55:47] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [15:56:20] (03PS4) 10Effie Mouzeli: mediawiki: add MW__MCROUTER_SERVER variable in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015342 (https://phabricator.wikimedia.org/T346690) [15:56:23] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [15:57:13] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [15:58:14] (03PS5) 10DCausse: updateQueryServiceLag: tune the min query rate of a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) [15:58:46] (03CR) 10DCausse: "needs MW 1.42.0-wmf.25 to be deployed first" [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [15:59:31] jouncebot nowandnext [15:59:31] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [15:59:31] In 0 hour(s) and 0 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1600) [16:00:04] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:33] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:00:34] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts logstash1010.eqiad.wmnet [16:01:18] jhathaway, rzl: i'm doing a train rollback, per Krinkle (see #wikimedia-releng for context) [16:01:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58994 and previous config saved to /var/cache/conftool/dbconfig/20240328-160121-arnaudb.json [16:01:45] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015350 (https://phabricator.wikimedia.org/T360156) [16:01:47] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015350 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [16:01:48] (03CR) 10Hashar: ci: switch envoy SSL provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [16:02:45] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015350 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [16:03:15] !log brennen@deploy1002 Started scap: testwikis wikis to 1.42.0-wmf.24 refs T360156 [16:03:22] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [16:03:24] !incidents [16:03:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:03:28] 4554 (ACKED) ProbeDown sre (10.2.1.30 ip4 search-omega-https:9443 probes/service http_search-omega-https_ip4 codfw) [16:03:28] 4553 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [16:03:28] 4552 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [16:05:40] (SystemdUnitFailed) firing: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:37] brennen: ack, nothing for us to do in the puppet window today anyway [16:09:23] !resolve 4554 [16:09:24] 4554 (ACKED) ProbeDown sre (10.2.1.30 ip4 search-omega-https:9443 probes/service http_search-omega-https_ip4 codfw) [16:09:32] 10ops-eqiad, 10observability: titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251 (10RobH) 03NEW p:05Triage→03Medium [16:09:45] 10ops-eqiad, 10observability: titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9670065 (10RobH) [16:10:52] 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9670069 (10Jhancock.wm) [16:11:17] 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9670074 (10Jhancock.wm) retrieved the extra sticks. all good. ty for update. [16:13:00] (03CR) 10DCausse: [C:03+1] Remove flink RBAC snowflakes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015343 (https://phabricator.wikimedia.org/T326409) (owner: 10JMeybohm) [16:16:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 50%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58997 and previous config saved to /var/cache/conftool/dbconfig/20240328-161627-arnaudb.json [16:17:30] jouncebot: next [16:17:30] In 0 hour(s) and 42 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1700) [16:17:30] In 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1700) [16:17:33] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.42.0-wmf.24 refs T360156 (duration: 14m 17s) [16:17:33] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] puppetserver: use client certs for naggen2 puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/1015326 (https://phabricator.wikimedia.org/T358506) (owner: 10Filippo Giunchedi) [16:17:45] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [16:18:05] jeena: rollback finished, train ops back over to you. :) [16:20:13] (03PS4) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) [16:22:24] brennen: 👍 thanks again [16:23:45] !log btullis@deploy1002 Started deploy [analytics/refinery@9c2ca38]: Regular analytics weekly train [analytics/refinery@9c2ca387] [16:23:56] (03PS3) 10Effie Mouzeli: WIP: mcrouter: update comments in mcrouter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692371 [16:25:02] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [16:25:25] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [16:26:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage [16:26:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage [16:26:32] !log btullis@deploy1002 Finished deploy [analytics/refinery@9c2ca38]: Regular analytics weekly train [analytics/refinery@9c2ca387] (duration: 02m 46s) [16:27:29] 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9670141 (10fgiunchedi) a:05fgiunchedi→03herron Thank you @RobH, I've coordinated with @herron and he'll be helping with this [16:30:11] !log btullis@deploy1002 Started deploy [analytics/refinery@9c2ca38]: Regular analytics weekly train [analytics/refinery@9c2ca387] [16:31:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 75%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P58998 and previous config saved to /var/cache/conftool/dbconfig/20240328-163132-arnaudb.json [16:33:08] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2052-2054].codfw.wmnet [16:33:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:33:32] (03CR) 10Clément Goubert: [C:04-1] "A small build breaking issues with the changelog bump for `php7.4-fpm-multiversion-base` and a nit" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [16:34:39] sukhe: I'm not sure precisely when, but inflatador started resurrecting the omega cluster ~a couple hours ago, so that likely explains the alerts resolving [16:34:51] ryankemper: yeah thanks, I ran it by him and resolved it [16:35:03] kk [16:35:58] (03CR) 10DCausse: cirrus: More reliable reporting of reindexing status (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 (owner: 10Ebernhardson) [16:39:25] !log btullis@deploy1002 Finished deploy [analytics/refinery@9c2ca38]: Regular analytics weekly train [analytics/refinery@9c2ca387] (duration: 09m 13s) [16:41:53] (03PS5) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) [16:42:37] (03CR) 10Effie Mouzeli: php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015338 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [16:43:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:45:01] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1006.eqiad.wmnet with OS bullseye [16:45:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9670250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed w... [16:45:27] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1010.eqiad.wmnet [16:46:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 100%: Post reimage repool', diff saved to https://phabricator.wikimedia.org/P59001 and previous config saved to /var/cache/conftool/dbconfig/20240328-164639-arnaudb.json [16:47:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:47:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup2003.codfw.wmnet with OS bookworm [16:47:25] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9670266 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudbackup2003.codfw.wmne... [16:47:49] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9670267 (10Jhancock.wm) [16:50:13] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: 14Q#:rack/setup/install (2) cloudbackup hosts - 14https://phabricator.wikimedia.org/T356216#9670269 (10Jhancock.wm) 05Open→03Resolved 14issue fixed and ready to go @Andrew  [16:50:34] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [16:53:33] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [16:54:26] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [16:54:26] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:54:27] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts logstash1010.eqiad.wmnet [16:56:00] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1012.eqiad.wmnet [16:56:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 846.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:00:04] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1700). [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1700) [17:00:55] (03CR) 10EoghanGaffney: [C:03+2] [gitlab] Narrow scope of gitlab backup rsync commands [puppet] - 10https://gerrit.wikimedia.org/r/1015323 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [17:01:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 846.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:03:18] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [17:04:16] (03PS1) 10Filippo Giunchedi: Revert "puppetserver: use client certs for naggen2 puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/1015203 [17:05:25] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [17:06:37] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [17:06:37] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:06:39] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1012.eqiad.wmnet [17:07:18] (03CR) 10Filippo Giunchedi: [C:03+2] Revert "puppetserver: use client certs for naggen2 puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/1015203 (owner: 10Filippo Giunchedi) [17:09:33] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1005.eqiad.wmnet with OS bullseye [17:09:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9670392 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye [17:11:40] (03PS1) 10JMeybohm: k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) [17:13:40] (03PS1) 10Gergő Tisza: objectcache: Restore default keyspace for LocalServerCache service [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015204 (https://phabricator.wikimedia.org/T358346) [17:14:03] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [17:14:19] 06SRE, 06Infrastructure-Foundations: Reduce 'root' Email Noise by Migrating Reprepro Emails to Google Group - https://phabricator.wikimedia.org/T361262#9670409 (10andrea.denisse) [17:14:48] (03CR) 10CI reject: [V:04-1] k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [17:15:03] (03PS2) 10JMeybohm: k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) [17:15:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:27] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic[2052-2054].codfw.wmnet [17:16:50] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic2037.codfw.wmnet [17:17:06] (03PS3) 10JMeybohm: k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) [17:17:38] !log btullis@deploy1002 Started deploy [analytics/refinery@9c2ca38]: Analytics refinery deploy to test git-lfs [analytics/refinery@9c2ca387] [17:17:58] !log btullis@deploy1002 Finished deploy [analytics/refinery@9c2ca38]: Analytics refinery deploy to test git-lfs [analytics/refinery@9c2ca387] (duration: 00m 19s) [17:18:27] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1758/co" [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [17:19:02] !log btullis@deploy1002 Started deploy [analytics/refinery@9c2ca38] (thin): Analytics refinery deploy to test git-lfs THIN [analytics/refinery@9c2ca387] [17:20:37] (03CR) 10CI reject: [V:04-1] k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [17:21:16] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [17:22:29] !log btullis@deploy1002 Finished deploy [analytics/refinery@9c2ca38] (thin): Analytics refinery deploy to test git-lfs THIN [analytics/refinery@9c2ca387] (duration: 03m 26s) [17:23:07] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic2037.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [17:23:57] !log btullis@deploy1002 Started deploy [analytics/refinery@9c2ca38] (hadoop-test): Analytics refinery deploy to test git-lfs TEST [analytics/refinery@9c2ca387] [17:26:18] !log btullis@deploy1002 Finished deploy [analytics/refinery@9c2ca38] (hadoop-test): Analytics refinery deploy to test git-lfs TEST [analytics/refinery@9c2ca387] (duration: 02m 20s) [17:27:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic2037.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [17:27:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:24] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic2037.codfw.wmnet [17:27:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1005.eqiad.wmnet with OS bullseye [17:27:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9670462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed w... [17:35:27] (03CR) 10Dzahn: ci: switch envoy SSL provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:39:03] James_F: Can you have a look at https://gerrit.wikimedia.org/r/c/integration/config/+/1015079 ? [17:39:28] !log joal@deploy1002 Started deploy [airflow-dags/analytics@f64680f]: Regular deploy of Analytics airflow dags [airflow-dags/analytics@f64680fc] [17:39:56] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@f64680f]: Regular deploy of Analytics airflow dags [airflow-dags/analytics@f64680fc] (duration: 00m 27s) [17:41:08] 10ops-eqiad, 06SRE, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9670481 (10RKemper) 05Resolved→03In progress @VRiley-WMF In netbox I see `cloudelastic1003` still listed as decommissioning, whereas the other cloudelastic hosts are ma... [17:53:59] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission cloudelastic100[1-4].wikimedia.org - 14https://phabricator.wikimedia.org/T358046#9670564 (10VRiley-WMF) 14@RKemper Thanks for bringing this up! I missed running the script for this device. It's been run and decommissioned. [17:54:00] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission cloudelastic100[1-4].wikimedia.org - 14https://phabricator.wikimedia.org/T358046#9670577 (10VRiley-WMF) 05In progress→03Resolved [17:54:13] 10ops-eqiad, 06SRE, 10decommission-hardware: 14decommission cloudelastic100[1-4].wikimedia.org - 14https://phabricator.wikimedia.org/T358046#9670578 (10RKemper) 14>>! In T358046#9670564, @VRiley-WMF wrote: > @RKemper Thanks for bringing this up! I missed running the script for this device. It's been run... [17:57:08] (03PS1) 10Andrew Bogott: etcd:v3: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015363 (https://phabricator.wikimedia.org/T349207) [17:57:37] (03CR) 10CI reject: [V:04-1] etcd:v3: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015363 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [17:59:19] (03PS2) 10Gergő Tisza: objectcache: Restore default keyspace for LocalServerCache service [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015204 (https://phabricator.wikimedia.org/T358346) [17:59:27] (03PS2) 10Andrew Bogott: etcd:v3: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015363 (https://phabricator.wikimedia.org/T349207) [18:00:04] jeena and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1800). [18:00:42] (03PS3) 10Andrew Bogott: etcd:v3: Don't rely on 'etcd' group or user before installing etcd package [puppet] - 10https://gerrit.wikimedia.org/r/1015363 (https://phabricator.wikimedia.org/T349207) [18:00:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1015363 (https://phabricator.wikimedia.org/T349207) (owner: 10Andrew Bogott) [18:06:53] jeena: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1013392 is a train blocker fix, will need backport. We ran into some CI difficulties with the master version, hopefully it will merge now. [18:07:20] tgr I am standing by whenever you are ready [18:10:05] (03CR) 10Dzahn: [C:03+2] releases: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014611 (owner: 10Dzahn) [18:11:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:11:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:16:28] (03PS2) 10Dzahn: releases: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014611 [18:20:26] (03CR) 10Dzahn: [C:04-1] "the certificate removal wasn't supposed to be related to this.. should be on another change" [puppet] - 10https://gerrit.wikimedia.org/r/1014610 (owner: 10Dzahn) [18:20:31] (03PS2) 10Dzahn: doc: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014610 [18:22:59] (03PS3) 10Ebernhardson: cirrus: More reliable reporting of reindexing status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 [18:22:59] (03CR) 10Ebernhardson: cirrus: More reliable reporting of reindexing status (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 (owner: 10Ebernhardson) [18:25:24] (03CR) 10Dzahn: [V:03+2 C:03+2] releases: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014611 (owner: 10Dzahn) [18:25:30] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1005.eqiad.wmnet with OS bullseye [18:25:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9670709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye [18:29:13] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:30:56] (03CR) 10Gmodena: [C:03+1] webrequest: disable canary events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015260 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [18:31:44] (03CR) 10Ebernhardson: cirrus: More reliable reporting of reindexing status (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 (owner: 10Ebernhardson) [18:32:03] jeena: I think best bet is to deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1015204 without merging into master (which would break postgres CI) [18:32:14] we can figure out later how to make that test pass [18:32:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:32:40] okay [18:33:14] are you deploying the patch or should I? [18:33:27] yes I will backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1015204 now [18:34:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015204 (https://phabricator.wikimedia.org/T358346) (owner: 10Gergő Tisza) [18:43:42] (03CR) 10Dzahn: [V:03+2] "This installed the package and service. To actually get data on dashboards the next step would be to add config to prometheus to scrape it" [puppet] - 10https://gerrit.wikimedia.org/r/1014611 (owner: 10Dzahn) [18:44:53] (03PS6) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [18:46:07] (03PS1) 10Dzahn: prometheus: add config to scrape apache data from releases servers [puppet] - 10https://gerrit.wikimedia.org/r/1015372 [18:47:59] (03PS1) 10Tchanders: Deploy partial action blocks everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015373 (https://phabricator.wikimedia.org/T353496) [18:48:01] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1005.eqiad.wmnet with OS bullseye [18:48:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9670854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed w... [18:49:16] (03CR) 10Dzahn: [V:03+2 C:03+2] "next: https://gerrit.wikimedia.org/r/1015372" [puppet] - 10https://gerrit.wikimedia.org/r/1014611 (owner: 10Dzahn) [18:51:17] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:51:19] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:52:08] (03PS3) 10Dzahn: ci: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) [18:52:13] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:52:21] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:52:42] (03CR) 10Dzahn: ci: switch envoy SSL provider to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:53:47] (03CR) 10Scott French: Improve support for mirroring the full keyspace (031 comment) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [18:53:50] (03Merged) 10jenkins-bot: objectcache: Restore default keyspace for LocalServerCache service [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015204 (https://phabricator.wikimedia.org/T358346) (owner: 10Gergő Tisza) [18:54:24] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1015204|objectcache: Restore default keyspace for LocalServerCache service (T358346 T361177)]] [18:54:29] T358346: Introduce ObjectCacheFactory to MediaWiki core - https://phabricator.wikimedia.org/T358346 [18:54:30] T361177: APCU cache mixup across wikis (Incorrect namespace displayed as title on a zh.wikivoyage page) - https://phabricator.wikimedia.org/T361177 [18:56:52] !log jhuneidi@deploy1002 tgr and jhuneidi: Backport for [[gerrit:1015204|objectcache: Restore default keyspace for LocalServerCache service (T358346 T361177)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:57:11] tgr do you need to check anything on mwdebug? [18:58:30] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1005.eqiad.wmnet with OS bullseye [18:58:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9670900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye [19:01:12] 10ops-codfw, 06Data-Platform-SRE: Fatal error detected on elastic2088 - https://phabricator.wikimedia.org/T361286 (10bking) 03NEW [19:03:36] (03CR) 10Dzahn: [C:03+2] ci: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:03:42] jouncebot: nowandenext [19:03:46] jouncebot: nowandnext [19:03:46] For the next 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T1800) [19:03:47] In 0 hour(s) and 56 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T2000) [19:04:14] !log CI (contint) - replacing envoy SSL cert (puppet CA -> cfssl) [19:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:53] jeena: will do some tests [19:06:40] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1015372 (owner: 10Dzahn) [19:06:59] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9670979 (10Aklapper) T330944 is a task which requires being a member of #WMF-NDA on Phab. @DBu-WMF: I've made you a member now (after verifying your account via https://meta.wikimedia.org/wiki/Special:... [19:08:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:08:51] (03CR) 10Dzahn: [C:03+2] "before:" [puppet] - 10https://gerrit.wikimedia.org/r/1014132 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:08:57] ugh, I have no idea if https://phabricator.wikimedia.org/T361177 is fixed [19:09:06] umm yeah I tried to check too [19:09:09] none of the strings mentioned in the task seem to appear on the page [19:09:41] but since it was rolled back, and it seems like there's no difference between the two, that's good right? [19:10:14] oh right, stupid me. That code is only on testwikis now. [19:11:02] :P I forgot about that too [19:11:02] still seems not ideal that neither the "expected" or the "actual" part of the task matches the page. Looks like the expected version isn't actually expected... [19:11:29] I can't seem to find where is [19:12:21] <tgr> nvm, it does work when logged out [19:12:35] <jeena> okay cool, so should I go ahead and continue sync? [19:12:53] <tgr> let me check if anything is on fire on testwiki [19:12:58] <jeena> okay [19:14:19] <logmsgbot> !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1005.eqiad.wmnet with OS bullseye [19:14:38] <wikibugs> 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9671011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed w... [19:16:41] <tgr> jeena: seems OK [19:16:48] <jeena> thanks tgr! [19:16:55] <logmsgbot> !log jhuneidi@deploy1002 tgr and jhuneidi: Continuing with sync [19:17:18] <mutante> I just replaced SSL certs on CI machines.. nobody noticed anything :) [19:17:28] <jeena> :D [19:19:40] <jinxer-wm> (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:23:26] <jinxer-wm> (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:28:24] <logmsgbot> !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1015204|objectcache: Restore default keyspace for LocalServerCache service (T358346 T361177)]] (duration: 34m 00s) [19:28:30] <stashbot> T358346: Introduce ObjectCacheFactory to MediaWiki core - https://phabricator.wikimedia.org/T358346 [19:28:30] <stashbot> T361177: APCU cache mixup across wikis (Incorrect namespace displayed as title on a zh.wikivoyage page) - https://phabricator.wikimedia.org/T361177 [19:30:24] <jeena> the train blocker has been deployed, so I'll start a gradual rollout to all wikis now [19:31:03] <wikibugs> (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015376 (https://phabricator.wikimedia.org/T360156) [19:31:04] <wikibugs> (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015376 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [19:31:49] <wikibugs> (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015376 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [19:35:55] <logmsgbot> !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:36:04] <logmsgbot> !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:45:04] <logmsgbot> !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.24 refs T360156 [19:45:09] <stashbot> T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [19:48:56] <ryankemper> !log T353878 Updated cross cluster remote seed conf with latest master info: `ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst` [19:49:00] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:00] <stashbot> T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [19:49:44] <wikibugs> (03PS1) 10Dzahn: delete contint.wikimedia.org dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1015377 (https://phabricator.wikimedia.org/T360413) [19:51:02] <jeena> proceeding to group1 deployment [19:51:17] <logmsgbot> !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:51:25] <logmsgbot> !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:51:28] <wikibugs> (03PS2) 10Dzahn: delete contint.wikimedia.org dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1015377 (https://phabricator.wikimedia.org/T360413) [19:51:38] <wikibugs> (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015378 (https://phabricator.wikimedia.org/T360156) [19:51:39] <wikibugs> (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015378 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [19:52:25] <wikibugs> (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015378 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [19:54:03] <wikibugs> 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9671111 (10herron) >>! In T361229#9669564, @Jhancock.wm wrote: > I have located and set aside the parts to be installed. > > I am available every week day between 1300 UTC and 1700 UTC... [19:54:19] <wikibugs> (03CR) 10Dzahn: [V:03+2 C:03+2] delete contint.wikimedia.org dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1015377 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:56:41] <wikibugs> (03PS1) 10Bking: elasticsearch: remove elastic2090 from psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/1015379 (https://phabricator.wikimedia.org/T353878) [19:57:28] <wikibugs> (03PS2) 10Bking: elasticsearch: remove elastic2090 from psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/1015379 (https://phabricator.wikimedia.org/T353878) [19:58:04] <wikibugs> (03PS1) 10Dzahn: ssl: delete contint.wikimedia.org cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1015380 (https://phabricator.wikimedia.org/T360413) [19:58:16] <wikibugs> (03PS1) 10Ryan Kemper: elastic: move elastic2088 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1015381 (https://phabricator.wikimedia.org/T353878) [19:58:33] <wikibugs> (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1015381 (https://phabricator.wikimedia.org/T353878) (owner: 10Ryan Kemper) [19:58:59] <wikibugs> (03CR) 10Dzahn: [C:03+2] ssl: delete contint.wikimedia.org cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1015380 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:59:30] <wikibugs> (03CR) 10Bking: [C:03+2] elastic: move elastic2088 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1015381 (https://phabricator.wikimedia.org/T353878) (owner: 10Ryan Kemper) [19:59:45] <wikibugs> (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: remove elastic2090 from psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/1015379 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:00:05] <jouncebot> RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T2000). [20:00:05] <jouncebot> tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:44] <wikibugs> (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1015379 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:00:45] <jeena> tgr: I'm sure you know already but just in case, I'm still rolling out the train [20:01:15] <tgr> no rush, I can deploy that patch another day if needed [20:02:30] <wikibugs> 10ops-codfw, 06SRE, 06Data-Platform-SRE: Fatal error detected on elastic2088 - https://phabricator.wikimedia.org/T361286#9671174 (10RKemper) a:03Papaul [20:04:01] <wikibugs> (03CR) 10Bking: [C:03+2] elasticsearch: remove elastic2090 from psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/1015379 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:05:40] <jinxer-wm> (SystemdUnitFailed) firing: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:49] <wikibugs> 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9671198 (10Dzahn) [20:06:03] <logmsgbot> !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.24 refs T360156 [20:06:09] <wikibugs> 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9671199 (10Dzahn) [20:06:17] <stashbot> T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [20:06:23] <logmsgbot> !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [20:06:25] <logmsgbot> !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [20:07:12] <wikibugs> (03PS1) 10Andrew Bogott: gitlab-runners hiera: switch to newer puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1015382 (https://phabricator.wikimedia.org/T351452) [20:07:47] <logmsgbot> !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878 [20:07:49] <logmsgbot> !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878 [20:07:53] <stashbot> T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [20:08:04] <wikibugs> (03CR) 10Andrew Bogott: [C:03+2] gitlab-runners hiera: switch to newer puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1015382 (https://phabricator.wikimedia.org/T351452) (owner: 10Andrew Bogott) [20:12:18] <wikibugs> (03Abandoned) 10Hashar: httpbb: raise timeout for Barack Obama [puppet] - 10https://gerrit.wikimedia.org/r/1014425 (https://phabricator.wikimedia.org/T360867) (owner: 10Hashar) [20:18:37] <logmsgbot> !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.42.0-wmf.24 refs T360156 (duration: 12m 33s) [20:18:44] <stashbot> T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [20:19:33] <jeena> I'll do the final deploy to all wikis in a few minutes [20:22:13] <logmsgbot> !log hashar@deploy1002 Started deploy [integration/docroot@c89a404]: add CodeMirror to opensource.yaml - T359986 [20:22:18] <stashbot> T359986: Generate docs for CodeMirror 6 - https://phabricator.wikimedia.org/T359986 [20:22:20] <logmsgbot> !log hashar@deploy1002 Finished deploy [integration/docroot@c89a404]: add CodeMirror to opensource.yaml - T359986 (duration: 00m 06s) [20:29:51] <wikibugs> 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9671363 (10Dzahn) fwiw, I don't see how it's related to S4 [20:30:31] <wikibugs> (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015385 (https://phabricator.wikimedia.org/T360156) [20:30:33] <wikibugs> (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015385 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [20:30:46] <wikibugs> (03PS3) 10Dzahn: doc: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014610 [20:31:04] <wikibugs> (03CR) 10Dzahn: [C:03+1] "fixed, it's just like the change on releases* now" [puppet] - 10https://gerrit.wikimedia.org/r/1014610 (owner: 10Dzahn) [20:31:18] <wikibugs> (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015385 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [20:33:43] <wikibugs> (03PS1) 10Bking: cumin: Add aliases for Elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1015387 (https://phabricator.wikimedia.org/T361292) [20:34:00] <wikibugs> (03PS2) 10Bking: cumin: Add aliases for Elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1015387 (https://phabricator.wikimedia.org/T361292) [20:34:15] <jinxer-wm> (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:34:31] <wikibugs> (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1015387 (https://phabricator.wikimedia.org/T361292) (owner: 10Bking) [20:35:58] <wikibugs> (03CR) 10Dzahn: [C:03+2] prometheus: add config to scrape apache data from releases servers [puppet] - 10https://gerrit.wikimedia.org/r/1015372 (owner: 10Dzahn) [20:37:20] <wikibugs> (03CR) 10Ryan Kemper: [C:03+1] cumin: Add aliases for Elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1015387 (https://phabricator.wikimedia.org/T361292) (owner: 10Bking) [20:37:28] <wikibugs> (03CR) 10Bking: [C:03+2] cumin: Add aliases for Elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1015387 (https://phabricator.wikimedia.org/T361292) (owner: 10Bking) [20:37:48] <jeena> tgr: would you be able to take a look at this test failure I got during scap deployment? https://phabricator.wikimedia.org/P59007 [20:38:02] <jeena> I think it might be related to the earlier issue [20:38:53] <wikibugs> 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9671428 (10Aklapper) @dzahn: The Space is displayed as a prefix of the task title, separated by a pipeline character. [20:39:15] <jinxer-wm> (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:41:32] <wikibugs> 10ops-codfw, 10decommission-hardware: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305 (10RKemper) 03NEW [20:41:51] <wikibugs> 10ops-codfw, 10decommission-hardware: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305#9671463 (10RKemper) [20:42:31] <thcipriani> hrm, looks like scap got a 503 when it requested https://de.wikipedia.org/wiki/Wikipedia:Hauptseite on mwdebug1001... [20:42:56] <thcipriani> but it looks fine now on that server [20:43:14] <tgr> yeah, I can't repro either [20:43:32] <tgr> does the test response get recorded in full somewhere? [20:43:33] <jeena> ohh I misinterpreted the body, got: ... expected: part [20:43:41] <thcipriani> manually retrying the httpbb test [20:43:48] <thcipriani> and they all seem to work. [20:43:53] <jeena> the got must have been the 503 response html [20:44:19] <thcipriani> jeena: I think it's maybe ok to try it again [20:44:24] <jeena> yeah I think so too [20:46:34] <wikibugs> 06SRE, 06Infrastructure-Foundations: Reduce 'root' Email Noise by Migrating Reprepro Emails to Google Group - https://phabricator.wikimedia.org/T361262#9671474 (10bd808) I will be the jerk to ask why we should choose a Google Group rather than a Mailman list. Is this sensitive data that needs to be hidden from... [20:47:05] <tgr> There are no mwdebug1001 errors in logstash, so either the error happened outside MediaWiki or it failed so badly it couldn't even log it. [20:47:23] <jeena> tests passed [20:48:29] <thcipriani> I would say maybe something happened too quickly after php restart, but why only dewiki: unclear. [20:51:29] <wikibugs> (03PS3) 10Gergő Tisza: Enter deprecation trial for third-party cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015145 (https://phabricator.wikimedia.org/T359957) [20:52:22] <logmsgbot> !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1005.eqiad.wmnet with OS bullseye [20:52:39] <wikibugs> 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9671488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye [20:52:47] <urbanecm> tgr: et al: is anyone still deploying anything, please? [20:53:47] <thcipriani> urbanecm: jeena is still rolling train, I believe [20:53:56] <jeena> yeah it's still deploying, sorry [20:53:57] <urbanecm> ah, okay [20:54:10] <urbanecm> please let me know when it's an appropriate time to make a change :) [20:54:28] <jeena> will do [20:54:52] <thcipriani> like become a farmer? I've been wondering the same. [20:56:00] <jeena> I want to be a farmer of something low maintenance :P [20:56:47] <thcipriani> probably the same amount of maintenance work, but much more tangible. [20:57:02] <bd808> hair farmer? [20:57:14] <jeena> :D [20:58:04] <jeena> daphnia? [20:58:09] <logmsgbot> !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.24 refs T360156 [20:58:15] <stashbot> T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [20:58:27] <jeena> woohoo the train is at the final station [20:58:27] <wikibugs> (03CR) 10Cwhite: [C:03+2] logstash: remove configuration for logstash101[012] [puppet] - 10https://gerrit.wikimedia.org/r/1014048 (https://phabricator.wikimedia.org/T360950) (owner: 10Cwhite) [20:58:38] <rzl> cows: the original train blockers [20:58:50] <jeena> tgr: urbanecm I am releasing scap to you [20:59:01] <urbanecm> tgr: feel free to start :) [20:59:28] <thcipriani> rzl: hahaha [20:59:42] <urbanecm> or if you want, run scap for your change and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1014634. mine shouldn't be really impacting anything in prod anyway. [20:59:55] <urbanecm> thcipriani: you asked for it :). https://bash.toolforge.org/quip/9hfcho4BxE1_1c7smQs2 [20:59:58] <inflatador> !log bking@mwmaint1002 sudo apt-get install ripgrep (faster recursive grep) [21:00:00] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:03] <wikibugs> (03PS3) 10Gergő Tisza: Add CommunityConfiguration log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) (owner: 10Urbanecm) [21:01:44] <wikibugs> 10ops-eqiad, 10decommission-hardware, 13Patch-For-Review: decommission logstash101[012] - https://phabricator.wikimedia.org/T360950#9671499 (10colewhite) [21:01:49] <wikibugs> 10ops-eqiad, 10decommission-hardware, 13Patch-For-Review: decommission logstash101[012] - https://phabricator.wikimedia.org/T360950#9671502 (10colewhite) a:05colewhite→03None [21:01:54] <wikibugs> (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015145 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [21:01:54] <wikibugs> (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) (owner: 10Urbanecm) [21:02:12] <tgr> thx jeena [21:02:42] <wikibugs> (03Merged) 10jenkins-bot: Enter deprecation trial for third-party cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015145 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [21:05:02] <wikibugs> (03PS4) 10Gergő Tisza: Add CommunityConfiguration log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) (owner: 10Urbanecm) [21:05:17] <wikibugs> (03CR) 10TrainBranchBot: "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) (owner: 10Urbanecm) [21:06:04] <wikibugs> (03Merged) 10jenkins-bot: Add CommunityConfiguration log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) (owner: 10Urbanecm) [21:06:22] <logmsgbot> !log tgr@deploy1002 Started scap: Backport for [[gerrit:1015145|Enter deprecation trial for third-party cookie blocking (T359957)]], [[gerrit:1014634|Add CommunityConfiguration log channel (T361072)]] [21:06:29] <stashbot> T359957: Enroll in Chrome third-party cookies deprecation trial - https://phabricator.wikimedia.org/T359957 [21:06:29] <stashbot> T361072: Include CommunityConfiguration-originating logs in Logstash - https://phabricator.wikimedia.org/T361072 [21:07:33] <wikibugs> (03PS1) 10Andrew Bogott: wmcs puppetservers: stop pulling hiera from /etc/puppet/secrets [puppet] - 10https://gerrit.wikimedia.org/r/1015392 [21:07:37] <wikibugs> (03PS1) 10Ebernhardson: cirrus: Restore traffic to codfw clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015393 [21:08:24] <wikibugs> (03CR) 10CI reject: [V:04-1] cirrus: Restore traffic to codfw clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015393 (owner: 10Ebernhardson) [21:08:36] <logmsgbot> !log tgr@deploy1002 urbanecm and tgr: Backport for [[gerrit:1015145|Enter deprecation trial for third-party cookie blocking (T359957)]], [[gerrit:1014634|Add CommunityConfiguration log channel (T361072)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:05] * urbanecm cannot test anything [21:09:39] <jinxer-wm> (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2090-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:10:23] <wikibugs> (03PS2) 10Ebernhardson: cirrus: Restore traffic to codfw clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015393 [21:14:21] <wikibugs> (03CR) 10Andrew Bogott: "The hosts with cruft in /etc/puppet/secrets are:" [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [21:14:39] <jinxer-wm> (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2090-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:14:39] <jinxer-wm> (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:14:41] <logmsgbot> !log tgr@deploy1002 urbanecm and tgr: Continuing with sync [21:16:46] <wikibugs> (03PS1) 10Cwhite: beta-logs: move jobs host duties to logging-logstash-03 [puppet] - 10https://gerrit.wikimedia.org/r/1014664 (https://phabricator.wikimedia.org/T353912) [21:17:33] <wikibugs> (03CR) 10Cwhite: [C:03+2] beta-logs: move jobs host duties to logging-logstash-03 [puppet] - 10https://gerrit.wikimedia.org/r/1014664 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [21:24:31] <wikibugs> 06SRE, 06Infrastructure-Foundations: Reduce 'root' Email Noise by Migrating Reprepro Emails to Google Group - https://phabricator.wikimedia.org/T361262#9671541 (10andrea.denisse) >>! In T361262#9671474, @bd808 wrote: > I will be the jerk to ask why we should choose a Google Group rather than a Mailman list. Is... [21:25:53] <logmsgbot> !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1015145|Enter deprecation trial for third-party cookie blocking (T359957)]], [[gerrit:1014634|Add CommunityConfiguration log channel (T361072)]] (duration: 19m 30s) [21:25:58] <stashbot> T359957: Enroll in Chrome third-party cookies deprecation trial - https://phabricator.wikimedia.org/T359957 [21:25:59] <stashbot> T361072: Include CommunityConfiguration-originating logs in Logstash - https://phabricator.wikimedia.org/T361072 [21:33:06] <wikibugs> (03PS1) 10Cwhite: opensearch: ensure cluster_wide curator job absent [puppet] - 10https://gerrit.wikimedia.org/r/1014665 [21:33:33] <wikibugs> (03CR) 10CI reject: [V:04-1] opensearch: ensure cluster_wide curator job absent [puppet] - 10https://gerrit.wikimedia.org/r/1014665 (owner: 10Cwhite) [21:34:15] <wikibugs> 06SRE, 06Infrastructure-Foundations: Reduce 'root' Email Noise by Migrating Reprepro Emails to Google Group - https://phabricator.wikimedia.org/T361262#9671570 (10bd808) >>! In T361262#9671541, @andrea.denisse wrote: > because our emails are already integrated with Gmail, facilitating an effortless opt-in mech... [21:34:49] <wikibugs> (03PS2) 10Cwhite: opensearch: ensure cluster_wide curator job absent [puppet] - 10https://gerrit.wikimedia.org/r/1014665 [21:38:14] <wikibugs> (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1014665/1759/" [puppet] - 10https://gerrit.wikimedia.org/r/1014665 (owner: 10Cwhite) [22:29:13] <jinxer-wm> (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:32:16] <jinxer-wm> (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:15] <jinxer-wm> (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 942.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:19:40] <jinxer-wm> (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:23:41] <jinxer-wm> (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:56:15] <jinxer-wm> (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 897.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded