[00:06:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060508 (owner: 10TrainBranchBot) [00:14:23] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:15] (03PS1) 10RLazarus: mwscript_k8s: Add --image_version. [puppet] - 10https://gerrit.wikimedia.org/r/1060512 [00:19:34] (03CR) 10RLazarus: [C:03+2] mediawiki: Bump ttlSecondsAfterFinished for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060184 (owner: 10RLazarus) [00:20:41] FIRING: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:38] (03Merged) 10jenkins-bot: mediawiki: Bump ttlSecondsAfterFinished for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060184 (owner: 10RLazarus) [00:24:23] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:41] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:02] jouncebot: nowandnext [00:26:02] No deployments scheduled for the next 5 hour(s) and 33 minute(s) [00:26:02] In 5 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0600) [00:26:02] In 5 hour(s) and 33 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0600) [00:26:38] scapping a quick helmfile-only scap deploy -- only touches the job template so it's a no-op for the prod deployments, only doing it to clean up the diffs [00:28:01] (03CR) 10Scott French: [C:03+1] "LGTM. Ping me when you need this merged and I'd be happy to do so." [puppet] - 10https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [00:29:21] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1060184 [00:29:23] FIRING: [7x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:58] !log rzl@deploy1003 Finished scap: https://gerrit.wikimedia.org/r/1060184 (duration: 02m 33s) [00:34:23] FIRING: [7x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:18] (03PS1) 10RLazarus: mediawiki: Build sidecars annotation dynamically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060515 [01:41:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:52:31] FIRING: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:52:44] (03CR) 10Scott French: mwscript_k8s: Add --image_version. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060512 (owner: 10RLazarus) [01:57:31] RESOLVED: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:05] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:10:08] (03CR) 10Scott French: "Would it be possible to turn on at least one of these sidecars in the `mwscript_enabled` fixture to verify it shows up in the annotation?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060515 (owner: 10RLazarus) [02:20:25] FIRING: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:31] FIRING: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:27:31] RESOLVED: [3x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:28:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:23] FIRING: [2x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:24] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:25] RESOLVED: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:41] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:00:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:59:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:04:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:41:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:05] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:16:30] (03PS1) 10Giuseppe Lavagetto: statsd-exporter: only scrape metrics from the prometheus ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060669 (https://phabricator.wikimedia.org/T371885) [06:24:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10050082 (10SLyngshede-WMF) Hi @XiaoXiao-WMF, when logging in, please try to run the command "kinit" that should prompt you for your password, and will give you a valid Kerber... [06:25:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [06:29:09] <_joe_> jouncebot: nowandnext [06:29:09] For the next 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0600) [06:29:09] For the next 0 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0600) [06:29:09] In 0 hour(s) and 30 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0700) [06:29:28] <_joe_> ok still in my window [06:29:32] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: only scrape metrics from the prometheus ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060669 (https://phabricator.wikimedia.org/T371885) (owner: 10Giuseppe Lavagetto) [06:31:19] (03Merged) 10jenkins-bot: statsd-exporter: only scrape metrics from the prometheus ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060669 (https://phabricator.wikimedia.org/T371885) (owner: 10Giuseppe Lavagetto) [06:31:36] (03PS1) 10Giuseppe Lavagetto: statsd-exporter: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060673 [06:31:40] (03CR) 10Slyngshede: [C:03+2] data.yaml: Add toyofuku to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/1060338 (https://phabricator.wikimedia.org/T371650) (owner: 10Slyngshede) [06:33:19] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060673 (owner: 10Giuseppe Lavagetto) [06:34:56] (03Merged) 10jenkins-bot: statsd-exporter: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060673 (owner: 10Giuseppe Lavagetto) [06:35:52] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:36:15] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [06:36:39] (03CR) 10Slyngshede: [C:03+1] "Sorry, I hadn't see the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [06:39:23] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:12] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [06:41:25] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [06:41:37] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [06:41:46] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [06:42:27] I am restarting gerrit [06:42:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:42:39] !log restarting Gerrit [06:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:24] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:46:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:47:51] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply [06:48:04] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [06:48:16] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [06:48:25] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [06:49:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:50:19] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [06:50:28] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [06:50:29] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [06:50:44] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [06:50:45] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [06:50:54] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [06:50:55] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [06:51:06] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [06:51:07] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [06:51:14] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [06:51:15] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-misc: apply [06:51:21] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [06:51:22] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [06:51:31] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [06:51:32] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [06:51:43] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [06:51:45] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [06:51:51] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [06:51:52] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [06:51:59] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [06:57:26] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [06:57:33] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [06:57:34] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [06:57:44] <_joe_> just in time for the end of my window :) [06:57:45] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0700). [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] o/ [07:00:32] I can deploy [07:01:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [07:02:00] (03Merged) 10jenkins-bot: search: index stems for mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [07:02:31] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1060430|search: index stems for mul labels (T371401)]] [07:02:33] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [07:04:41] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1060430|search: index stems for mul labels (T371401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:06:51] !log dcausse@deploy1003 dcausse: Continuing with sync [07:11:34] !log dcausse@deploy1003 Finished scap: Backport for [[gerrit:1060430|search: index stems for mul labels (T371401)]] (duration: 09m 03s) [07:11:37] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [07:13:22] I'm done with the deploy [07:19:50] !log Restarted CI Jenkins for upgrade and plugin update # T371976 [07:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T367856)', diff saved to https://phabricator.wikimedia.org/P67248 and previous config saved to /var/cache/conftool/dbconfig/20240808-072458-marostegui.json [07:25:01] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:32:03] !log T371401: reindexing testwikidatawiki to index mul labels [07:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:06] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [07:37:35] (03PS1) 10Brouberol: cloudnative-pg-cluster: small bugfixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060752 (https://phabricator.wikimedia.org/T368240) [07:40:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P67249 and previous config saved to /var/cache/conftool/dbconfig/20240808-074005-marostegui.json [07:42:33] !log @deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:42:38] !log @deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:48:20] (03PS1) 10NMW03: Enable protection indicators for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060753 (https://phabricator.wikimedia.org/T371440) [07:49:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060753 (https://phabricator.wikimedia.org/T371440) (owner: 10NMW03) [07:55:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P67250 and previous config saved to /var/cache/conftool/dbconfig/20240808-075512-marostegui.json [07:55:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:57:25] FIRING: SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:52] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10050175 (10hashar) That is still happening and is bit annoying. The reason for the warning is `su` is invoked with `-` which starts the shell as a long shell... [08:00:05] jnuche and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0800). [08:00:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:01:55] morning, train is currently blocked by https://phabricator.wikimedia.org/T371966 [08:02:25] RESOLVED: SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:53] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10050198 (10dcaro) That's to be expected when moving lots of data around, we can try to be smarter and/or limit the backfill throughput (and/or add QoS, coming soon!), but it's... [08:07:22] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: add Tiziano Fogli to ctrl variables [puppet] - 10https://gerrit.wikimedia.org/r/1060438 (owner: 10Tiziano Fogli) [08:08:54] (03PS2) 10NMW03: Enable protection indicators for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060753 (https://phabricator.wikimedia.org/T371440) [08:09:47] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Openjdk upgrade - elukey@cumin1002 [08:10:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T367856)', diff saved to https://phabricator.wikimedia.org/P67251 and previous config saved to /var/cache/conftool/dbconfig/20240808-081019-marostegui.json [08:10:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [08:10:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:10:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [08:10:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T367856)', diff saved to https://phabricator.wikimedia.org/P67252 and previous config saved to /var/cache/conftool/dbconfig/20240808-081041-marostegui.json [08:12:43] (03CR) 10Sohom Datta: [C:03+1] Enable protection indicators for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060753 (https://phabricator.wikimedia.org/T371440) (owner: 10NMW03) [08:13:38] !log restart tomcat on idp[1,2]003 to pick up the new openjdk [08:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:17] !log restart dump_ip_reputation.service on puppetserver1001 [08:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:04] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "vtrs1003+gerrit1004 - ayounsi@cumin1002" [08:23:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "vtrs1003+gerrit1004 - ayounsi@cumin1002" [08:24:23] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:56] (03PS2) 10Filippo Giunchedi: grafana: set timeinterval 30s for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/1058106 (https://phabricator.wikimedia.org/T371102) [08:30:04] !log T371401: reindexing wikidatawiki@eqiad to index mul labels [08:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:08] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [08:33:56] (03PS3) 10Ayounsi: Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) [08:33:56] (03PS2) 10Ayounsi: Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) [08:33:56] (03PS1) 10Ayounsi: Remove Python 3.12 from env list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060759 [08:36:19] (03CR) 10CI reject: [V:04-1] Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [08:45:38] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:45:45] (03PS1) 10Dreamy Jazz: Fix DefaultPresenter rejecting IPCountInfo instances [extensions/IPInfo] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060760 (https://phabricator.wikimedia.org/T371966) [08:46:00] (03CR) 10Bugreporter: "Currently we also unblock Phabricator accounts when Wikitech account is unblocked. See https://gerrit.wikimedia.org/g/operations/mediawiki" [software/bitu] - 10https://gerrit.wikimedia.org/r/1060092 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [08:48:08] jouncebot: nowandnext [08:48:08] For the next 1 hour(s) and 11 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T0800) [08:48:08] In 1 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1000) [08:48:20] I want to backport a patch to unblock the train. [08:49:25] (03CR) 10Dreamy Jazz: [C:03+2] Fix DefaultPresenter rejecting IPCountInfo instances [extensions/IPInfo] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060760 (https://phabricator.wikimedia.org/T371966) (owner: 10Dreamy Jazz) [08:50:08] (03PS3) 10Ayounsi: Add validators for console(server) and power ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) [08:50:22] (03PS1) 10Filippo Giunchedi: sre: disable series lint for KubernetesContainerReachingMemoryLimit + mcrouter [alerts] - 10https://gerrit.wikimedia.org/r/1060761 [08:52:52] (03PS1) 10Filippo Giunchedi: mediawiki: remove etcdconfig up-to-date check [puppet] - 10https://gerrit.wikimedia.org/r/1060762 (https://phabricator.wikimedia.org/T322523) [08:53:44] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: remove etcdconfig up-to-date check [puppet] - 10https://gerrit.wikimedia.org/r/1060762 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi) [08:54:22] (03PS1) 10Chlod Alejandro: dtpwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060763 (https://phabricator.wikimedia.org/T372031) [08:55:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/IPInfo] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060760 (https://phabricator.wikimedia.org/T371966) (owner: 10Dreamy Jazz) [08:57:27] 06SRE, 10Observability-Alerting, 06serviceops-radar: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118#10050355 (10fgiunchedi) 05Open→03Resolved I'm going to call this resolved, we can reopen or start a new task to audit alerts causing floods on irc during incidents [08:57:59] (03CR) 10Btullis: idp-test: Register airflow-analytics-test IDP services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [08:59:14] (03CR) 10Filippo Giunchedi: [C:03+2] mediawiki: remove etcdconfig up-to-date check [puppet] - 10https://gerrit.wikimedia.org/r/1060762 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi) [09:00:27] (03PS1) 10Chlod Alejandro: bdrwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060764 (https://phabricator.wikimedia.org/T372031) [09:01:23] (03PS2) 10Chlod Alejandro: bdrwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060764 (https://phabricator.wikimedia.org/T372031) [09:09:11] (03PS1) 10Chlod Alejandro: mswikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060765 (https://phabricator.wikimedia.org/T372031) [09:16:36] (03PS1) 10Filippo Giunchedi: icinga: remove check_etcd_mw_config_lastindex [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) [09:21:26] (03Merged) 10jenkins-bot: Fix DefaultPresenter rejecting IPCountInfo instances [extensions/IPInfo] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060760 (https://phabricator.wikimedia.org/T371966) (owner: 10Dreamy Jazz) [09:21:46] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1060760|Fix DefaultPresenter rejecting IPCountInfo instances (T371966)]] [09:21:49] T371966: Wikimedia\Assert\ParameterElementTypeException: Bad value for parameter info['data']: all elements must be MediaWiki\IPInfo\Info\Info|MediaWiki\IPInfo\Info\IPoidInfo|MediaWiki\IPInfo\Info\BlockInfo|MediaWiki\IPInfo\Info\Contrib - https://phabricator.wikimedia.org/T371966 [09:21:49] (03PS1) 10David Caro: wmcs-backup: fix all typing issues [puppet] - 10https://gerrit.wikimedia.org/r/1060771 [09:23:52] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1060760|Fix DefaultPresenter rejecting IPCountInfo instances (T371966)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:24:29] !log powercycle ml-serve2004 - host frozen, no ssh access, get sel shows "Multi-bit memory errors detected on a memory device at location(s) DIMM_A2." [09:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:34] klausman: --^ [09:24:56] oh, another :-/ [09:25:10] They must be sensing that the GPU hosts are already racked [09:25:57] I'll take care of it [09:27:57] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [09:27:57] (03PS1) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 [09:28:22] (03CR) 10CI reject: [V:04-1] service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [09:31:50] RESOLVED: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:32:25] !log dreamyjazz@deploy1003 Finished scap: Backport for [[gerrit:1060760|Fix DefaultPresenter rejecting IPCountInfo instances (T371966)]] (duration: 10m 38s) [09:32:27] T371966: Wikimedia\Assert\ParameterElementTypeException: Bad value for parameter info['data']: all elements must be MediaWiki\IPInfo\Info\Info|MediaWiki\IPInfo\Info\IPoidInfo|MediaWiki\IPInfo\Info\BlockInfo|MediaWiki\IPInfo\Info\Contrib - https://phabricator.wikimedia.org/T371966 [09:32:56] I have a CPU soft lockup on dbstore1009. Power cycling it. [09:34:47] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Memory issues (ECC) with ml-serve2004.codfw.wmnet - https://phabricator.wikimedia.org/T372036 (10klausman) 03NEW [09:36:28] (03PS2) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 [09:36:53] (03CR) 10CI reject: [V:04-1] service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [09:37:22] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1009.eqiad.wmnet with reason: Rebooting due to CPU soft lockup [09:37:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1009.eqiad.wmnet with reason: Rebooting due to CPU soft lockup [09:38:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Openjdk upgrade - elukey@cumin1002 [09:38:14] (03PS2) 10David Caro: wmcs-backup: fix all typing issues [puppet] - 10https://gerrit.wikimedia.org/r/1060771 [09:39:49] (03PS3) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 [09:40:22] train blocker is resolved, I'm gonna start the deployment in a couple of mins [09:40:32] (03PS4) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 [09:41:45] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [09:42:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060763 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [09:42:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060764 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [09:42:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060765 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [09:45:56] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060780 (https://phabricator.wikimedia.org/T366962) [09:45:58] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060780 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [09:47:00] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060780 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [09:49:05] (03PS4) 10Ayounsi: Netbox script proxy: set to absent where possible [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [09:52:13] (03PS5) 10Ayounsi: Netbox script proxy: set to absent where possible [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [09:52:13] (03PS4) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [09:52:43] (03CR) 10Btullis: [C:03+1] cloudnative-pg-cluster: small bugfixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060752 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [09:53:37] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.17 refs T366962 [09:53:39] T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962 [09:53:45] (03PS5) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 [09:53:45] (03PS6) 10Ayounsi: Netbox script proxy: set to absent [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [09:53:45] (03PS5) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [09:54:19] (03CR) 10FNegri: [C:03+1] "LGTM, maybe we can add "mypy" to the CI in this same patch?" [puppet] - 10https://gerrit.wikimedia.org/r/1060771 (owner: 10David Caro) [09:54:21] (03CR) 10Ayounsi: Netbox script proxy: set to absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:54:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:55:26] jnuche: okay to do a config deployment? [09:56:23] TheresNoTime: give me a few minutes to verify everything looks healthy [09:56:37] ack :) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1000) [10:02:31] (03PS3) 10Stevemunene: idp-test: Register airflow-test-k8s IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) [10:03:22] (03CR) 10Stevemunene: idp-test: Register airflow-test-k8s IDP services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [10:03:25] TheresNoTime: I'm going to have to rollback [10:04:11] jnuche: ack, will do another time! [10:05:15] (03CR) 10Slyngshede: [C:03+1] "Still OK from me :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [10:12:44] (03PS1) 10David Caro: rbd2backy2: fix types and minor bugs [puppet] - 10https://gerrit.wikimedia.org/r/1060783 [10:12:44] (03PS1) 10David Caro: wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) [10:13:33] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060785 (https://phabricator.wikimedia.org/T366962) [10:13:36] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060785 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [10:14:35] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060785 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [10:16:11] (03CR) 10CI reject: [V:04-1] rbd2backy2: fix types and minor bugs [puppet] - 10https://gerrit.wikimedia.org/r/1060783 (owner: 10David Caro) [10:16:22] (03CR) 10CI reject: [V:04-1] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [10:22:12] (03CR) 10FNegri: [C:03+1] "I think this is a reasonable workaround until we find the time to address the root cause, which is the broken ImageBackup.remove method." [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [10:26:54] (03PS1) 10Hashar: Do not use a login shell when dropping privileges [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1060789 (https://phabricator.wikimedia.org/T216832) [10:29:02] (03CR) 10Hashar: [C:03+1] Remove Python 3.12 from env list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060759 (owner: 10Ayounsi) [10:29:35] (03PS1) 10Kevin Bazira: ml-services: fix langid python module usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060790 (https://phabricator.wikimedia.org/T369344) [10:31:38] (03PS3) 10David Caro: wmcs-backup: fix all typing issues [puppet] - 10https://gerrit.wikimedia.org/r/1060771 [10:31:39] (03CR) 10David Caro: wmcs-backup: fix all typing issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060771 (owner: 10David Caro) [10:31:39] (03PS2) 10David Caro: rbd2backy2: fix types and minor bugs [puppet] - 10https://gerrit.wikimedia.org/r/1060783 [10:31:39] (03PS2) 10David Caro: wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) [10:36:21] (03CR) 10CI reject: [V:04-1] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [10:36:33] (03CR) 10CI reject: [V:04-1] wmcs-backup: fix all typing issues [puppet] - 10https://gerrit.wikimedia.org/r/1060771 (owner: 10David Caro) [10:36:34] (03CR) 10CI reject: [V:04-1] rbd2backy2: fix types and minor bugs [puppet] - 10https://gerrit.wikimedia.org/r/1060783 (owner: 10David Caro) [10:39:38] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.17 refs T366962 [10:39:41] T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962 [10:42:21] (03PS1) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [10:44:25] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:19] (03CR) 10CI reject: [V:04-1] wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 (owner: 10David Caro) [10:45:55] (03CR) 10David Caro: "hmpf...different black versions give different results xd" [puppet] - 10https://gerrit.wikimedia.org/r/1060783 (owner: 10David Caro) [10:50:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:55:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:02:41] (03CR) 10Klausman: [C:03+1] ml-services: fix langid python module usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060790 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [11:13:11] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:14:55] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042 (10phaultfinder) 03NEW [11:18:05] (03PS1) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [11:19:11] (03PS4) 10David Caro: wmcs-backup: fix all typing issues [puppet] - 10https://gerrit.wikimedia.org/r/1060771 [11:19:12] (03PS3) 10David Caro: rbd2backy2: fix types and minor bugs [puppet] - 10https://gerrit.wikimedia.org/r/1060783 [11:19:12] (03PS3) 10David Caro: wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) [11:19:12] (03PS2) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [11:19:13] (03PS2) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [11:20:36] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1057000 (https://phabricator.wikimedia.org/T371573) (owner: 10Dzahn) [11:21:43] (03CR) 10CI reject: [V:04-1] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro) [11:22:31] (03CR) 10David Caro: "The image backup does not trash anything, I thought it was cinder (openstack) when deleting the volume, are you sure it happens when the b" [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:25:22] (03CR) 10CI reject: [V:04-1] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:27:59] (03CR) 10CI reject: [V:04-1] wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 (owner: 10David Caro) [11:28:05] (03CR) 10CI reject: [V:04-1] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro) [11:29:53] (03PS4) 10David Caro: wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) [11:29:54] (03PS3) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [11:29:54] (03PS3) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [11:34:08] (03CR) 10CI reject: [V:04-1] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:34:56] (03CR) 10CI reject: [V:04-1] wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 (owner: 10David Caro) [11:35:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica to new version [11:35:13] (03CR) 10CI reject: [V:04-1] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro) [11:35:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:47] ^ should resolve soon [11:39:24] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:26] (03CR) 10FNegri: [C:03+1] wmcs-backup: fix all typing issues [puppet] - 10https://gerrit.wikimedia.org/r/1060771 (owner: 10David Caro) [11:40:41] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:04] (03CR) 10FNegri: "What I think is happening (but I'm not 100% sure) is described more in detail in the task:" [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:43:49] Hi! I'd like to deploy a couple of config patches early, any concerns? (I know we did a rollback recently) [11:44:46] (03PS2) 10Chlod Alejandro: dtpwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060763 (https://phabricator.wikimedia.org/T372031) [11:45:13] (03CR) 10FNegri: "So Cinder is the one putting the image in the Trash, but if wmcs-backup cleaned up the backup snapshots correctly I expect the images in t" [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:46:36] TheresNoTime: hi, fine from my side [11:46:40] ack [11:47:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060763 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [11:47:50] (03Merged) 10jenkins-bot: dtpwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060763 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [11:48:12] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1060763|dtpwiki: add custom logos (T372031)]] [11:48:14] T372031: Set logos for new Malaysian wikis - https://phabricator.wikimedia.org/T372031 [11:51:46] (03PS5) 10David Caro: wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) [11:51:46] (03PS4) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [11:51:46] (03PS4) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [11:52:18] !log samtar@deploy1003 chlod, samtar: Backport for [[gerrit:1060763|dtpwiki: add custom logos (T372031)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:53:40] (03PS6) 10David Caro: wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) [11:53:40] (03PS5) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [11:53:40] (03PS5) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [11:53:46] !log samtar@deploy1003 chlod, samtar: Continuing with sync [11:55:38] (03CR) 10David Caro: "The backups always keep one snapshot around iirc, not sure if that is strictly needed, but as long as that snapshot is there it has no con" [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:57:09] (03CR) 10FNegri: [C:03+1] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:57:53] (03CR) 10CI reject: [V:04-1] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [11:58:23] !log samtar@deploy1003 Finished scap: Backport for [[gerrit:1060763|dtpwiki: add custom logos (T372031)]] (duration: 10m 10s) [11:58:25] T372031: Set logos for new Malaysian wikis - https://phabricator.wikimedia.org/T372031 [11:58:40] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10050758 (10cmooney) >>! In T371879#10049699, @Dzahn wrote: > We got paged at 20:19 UTC for "primary outbound port utilisation over 80%" on both cloudsw1-d5 and cloudsw1-f4 toda... [11:58:43] (2 more to do) [11:58:44] (03CR) 10CI reject: [V:04-1] wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 (owner: 10David Caro) [11:58:53] (03CR) 10CI reject: [V:04-1] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro) [11:59:04] (03PS3) 10Chlod Alejandro: bdrwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060764 (https://phabricator.wikimedia.org/T372031) [11:59:09] (03CR) 10Ayounsi: [C:03+2] Remove Python 3.12 from env list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060759 (owner: 10Ayounsi) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1200) [12:00:05] (03CR) 10Samtar: [C:03+2] bdrwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060764 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [12:00:51] (03Merged) 10jenkins-bot: bdrwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060764 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [12:01:16] (03Merged) 10jenkins-bot: Remove Python 3.12 from env list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060759 (owner: 10Ayounsi) [12:01:57] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1060764|bdrwiki: add custom logos (T372031)]] [12:05:34] !log samtar@deploy1003 chlod, samtar: Backport for [[gerrit:1060764|bdrwiki: add custom logos (T372031)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:05:36] T372031: Set logos for new Malaysian wikis - https://phabricator.wikimedia.org/T372031 [12:05:52] (03CR) 10FNegri: [C:03+1] "I don't remember the details but yes, the snapshots can not be deleted _before_ the image is deleted. IIRC the reason is that snapshots ar" [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [12:06:47] !log samtar@deploy1003 chlod, samtar: Continuing with sync [12:08:40] (03PS1) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 [12:11:17] !log samtar@deploy1003 Finished scap: Backport for [[gerrit:1060764|bdrwiki: add custom logos (T372031)]] (duration: 09m 20s) [12:11:20] T372031: Set logos for new Malaysian wikis - https://phabricator.wikimedia.org/T372031 [12:11:57] (03PS2) 10Chlod Alejandro: mswikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060765 (https://phabricator.wikimedia.org/T372031) [12:12:07] (1 more) [12:12:07] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: small bugfixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060752 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [12:13:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060765 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [12:14:06] (03Merged) 10jenkins-bot: mswikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060765 (https://phabricator.wikimedia.org/T372031) (owner: 10Chlod Alejandro) [12:14:29] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1060765|mswikisource: add custom logos (T372031)]] [12:16:50] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1060816 (owner: 10L10n-bot) [12:18:15] !log samtar@deploy1003 chlod, samtar: Backport for [[gerrit:1060765|mswikisource: add custom logos (T372031)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:18:17] T372031: Set logos for new Malaysian wikis - https://phabricator.wikimedia.org/T372031 [12:18:45] !log samtar@deploy1003 chlod, samtar: Continuing with sync [12:19:01] (03CR) 10David Caro: "But when images are in the trash, the snapshots (and the images) are not listed for the wmcs-backups to know they still exist, it would ha" [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [12:19:12] (03PS7) 10David Caro: wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) [12:19:12] (03PS6) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [12:19:12] (03PS6) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [12:20:14] (03CR) 10David Caro: "bad quotes..." [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [12:20:32] (03CR) 10David Caro: [V:03+1] "Working and ready" [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [12:22:32] !log T371401: reindexing wikidatawiki@codfw to index mul labels [12:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:34] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [12:23:17] !log samtar@deploy1003 Finished scap: Backport for [[gerrit:1060765|mswikisource: add custom logos (T372031)]] (duration: 08m 47s) [12:23:19] T372031: Set logos for new Malaysian wikis - https://phabricator.wikimedia.org/T372031 [12:24:06] (03PS3) 10DCausse: search: use mul fallback for manually-tuned search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) [12:24:06] (03PS4) 10DCausse: search: use the stem field when searching mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) [12:24:14] (03CR) 10CI reject: [V:04-1] wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 (owner: 10David Caro) [12:24:25] (03CR) 10CI reject: [V:04-1] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro) [12:25:32] I am done with my config changes, thanks :D [12:27:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10050899 (10XiaoXiao-WMF) HI @SLyngshede-WMF, I have done these steps a while ago, here is what I see - it seems that it is only valid for a few days, does it look right to yo... [12:30:13] I'm rolling the train forward again in a few minutes [12:32:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10050928 (10SLyngshede-WMF) Yes, the tickets are short lived, so you need to run the kinit regularly. See: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Kerberos... [12:36:19] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060825 (https://phabricator.wikimedia.org/T366962) [12:36:21] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060825 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [12:37:03] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060825 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [12:37:35] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10050973 (10ayounsi) This went very well until it didn't. Changes fully rolled back. The cookbook chang... [12:43:10] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10050996 (10fgiunchedi) Thank you @ayounsi for the write up! I agree with your preferred option, and arg... [12:43:41] (03PS1) 10Brouberol: deployment_server: register a mapping between PG versions and image tags [puppet] - 10https://gerrit.wikimedia.org/r/1060827 (https://phabricator.wikimedia.org/T368240) [12:43:53] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060827 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [12:44:29] (03PS1) 10Brouberol: cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) [12:45:26] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [12:45:29] (03PS2) 10Brouberol: cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) [12:47:04] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.17 refs T366962 [12:47:08] T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962 [12:47:53] (03CR) 10Elukey: service::uwsgi: add $ensure variable for clean removal (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [12:49:06] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [12:50:02] (03PS3) 10Brouberol: cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) [12:52:00] (03PS4) 10Brouberol: cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) [12:54:37] (03PS5) 10Brouberol: cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) [12:54:58] (03CR) 10David Caro: [V:03+1] wmcs-backups: add empty_trash command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [12:56:33] (03CR) 10Elukey: Netbox script proxy: set to absent (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [12:56:50] (03CR) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [12:57:58] (03CR) 10Elukey: [C:03+1] Remove netbox 3 references [puppet] - 10https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [12:58:36] (03CR) 10Elukey: [C:03+1] Remove "netbox4" upgrade flag [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1300). [13:00:04] Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] (03CR) 10Ayounsi: [C:03+2] Remove netbox 3 references [puppet] - 10https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [13:00:24] * TheresNoTime can't deploy now, sorry :D [13:00:29] (03CR) 10Elukey: check_netbox_report.py: reports -> scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [13:00:41] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:49] (03CR) 10Elukey: [C:03+1] Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) (owner: 10Ayounsi) [13:02:08] (03CR) 10Elukey: [C:03+1] Enable validators on Netbox-next for console(server) and power ports [puppet] - 10https://gerrit.wikimedia.org/r/1060435 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:02:44] (03CR) 10Elukey: "This is similar to what filed for netbox-next right? Should this wait for more testing in there?" [puppet] - 10https://gerrit.wikimedia.org/r/1060436 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:04:24] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:06:01] (03CR) 10Ayounsi: [C:03+2] Remove "netbox4" upgrade flag [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [13:06:11] (03PS3) 10Ayounsi: Remove "netbox4" upgrade flag [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) [13:08:19] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) (owner: 10Ayounsi) [13:09:07] (03CR) 10FNegri: [C:03+1] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [13:11:00] (03CR) 10David Caro: [C:03+2] wmcs-backup: fix all typing issues [puppet] - 10https://gerrit.wikimedia.org/r/1060771 (owner: 10David Caro) [13:11:11] (03CR) 10David Caro: [C:03+2] rbd2backy2: fix types and minor bugs [puppet] - 10https://gerrit.wikimedia.org/r/1060783 (owner: 10David Caro) [13:11:15] (03CR) 10David Caro: [V:03+1 C:03+2] wmcs-backups: add empty_trash command [puppet] - 10https://gerrit.wikimedia.org/r/1060784 (https://phabricator.wikimedia.org/T358774) (owner: 10David Caro) [13:21:49] (03PS7) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [13:21:49] (03PS7) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [13:24:13] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2 [13:24:16] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7 [13:24:54] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10051164 (10phaultfinder) [13:25:04] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Reimaging clouddb1018 T365424 [13:25:15] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [13:25:17] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Reimaging clouddb1018 T365424 [13:25:40] (03CR) 10CI reject: [V:04-1] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro) [13:26:02] (03CR) 10Ayounsi: Netbox script proxy: set to absent (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [13:28:35] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1018.eqiad.wmnet with OS bookworm [13:29:38] (03PS6) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 [13:29:38] (03PS7) 10Ayounsi: Netbox script proxy: set to absent [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [13:29:38] (03PS6) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [13:29:51] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [13:30:03] (03CR) 10CI reject: [V:04-1] service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 (owner: 10Ayounsi) [13:32:31] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:33:02] (03PS7) 10Ayounsi: service::uwsgi: add $ensure variable for clean removal [puppet] - 10https://gerrit.wikimedia.org/r/1060773 [13:33:02] (03PS8) 10Ayounsi: Netbox script proxy: set to absent [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [13:33:03] (03PS7) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [13:33:39] (03CR) 10Kevin Bazira: [C:03+2] ml-services: fix langid python module usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060790 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [13:34:39] (03Merged) 10jenkins-bot: ml-services: fix langid python module usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060790 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [13:34:47] (03CR) 10Ayounsi: "yep, I'll merge that only if no issues on -next." [puppet] - 10https://gerrit.wikimedia.org/r/1060436 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:35:01] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [13:37:31] RESOLVED: [3x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:14] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1018.eqiad.wmnet with reason: host reimage [13:41:19] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [13:43:10] (03PS8) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [13:44:11] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1018.eqiad.wmnet with reason: host reimage [13:44:33] (03CR) 10Ayounsi: Netbox script proxy: set to absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [13:44:41] (03CR) 10Ayounsi: [C:03+2] Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) (owner: 10Ayounsi) [13:47:03] (03Merged) 10jenkins-bot: Add request argument to validate() method [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) (owner: 10Ayounsi) [13:47:32] (03PS1) 10Ladsgroup: Add missing close tags to #contentSub message [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060839 (https://phabricator.wikimedia.org/T372054) [13:47:43] jouncebot: nowandnext [13:47:43] For the next 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1300) [13:47:43] In 1 hour(s) and 12 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1500) [13:47:47] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:47:57] (03CR) 10Ladsgroup: [C:03+2] Add missing close tags to #contentSub message [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060839 (https://phabricator.wikimedia.org/T372054) (owner: 10Ladsgroup) [13:48:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:51:09] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [13:51:19] (03PS11) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [13:51:19] (03PS13) 10Andrew Bogott: wmf_sink: replace targeted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [13:51:19] (03PS1) 10Andrew Bogott: wmf_sink: use project_id rather than project_name for proxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1060841 [13:52:13] (03CR) 10Ayounsi: check_netbox_report.py: reports -> scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [13:53:23] (03CR) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [13:53:28] (03CR) 10Andrew Bogott: [C:03+2] dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [13:53:52] (03CR) 10Andrew Bogott: [C:03+2] wmf_sink: replace targeted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [13:54:04] (03CR) 10Andrew Bogott: [C:03+2] wmf_sink: use project_id rather than project_name for proxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1060841 (owner: 10Andrew Bogott) [13:54:43] (03PS1) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) [13:54:45] (03PS1) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) [13:58:15] (03Merged) 10jenkins-bot: Add missing close tags to #contentSub message [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060839 (https://phabricator.wikimedia.org/T372054) (owner: 10Ladsgroup) [13:59:29] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1060839|Add missing close tags to #contentSub message (T372054)]] [13:59:32] T372054: Unclosed #contentSub tag causes pages to be rendered incorrectly - https://phabricator.wikimedia.org/T372054 [14:00:54] !log stevemunene@deploy1003 Started deploy [airflow-dags/analytics_test@2a3060e]: (no justification provided) [14:01:28] !log stevemunene@deploy1003 Finished deploy [airflow-dags/analytics_test@2a3060e]: (no justification provided) (duration: 00m 33s) [14:02:56] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1060839|Add missing close tags to #contentSub message (T372054)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:05:17] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [14:05:42] (03CR) 10Andrew Bogott: [C:03+2] New files, templates and manifests for OpenStack Caracal [puppet] - 10https://gerrit.wikimedia.org/r/1059408 (https://phabricator.wikimedia.org/T369044) (owner: 10Andrew Bogott) [14:07:23] (03PS1) 10Elukey: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) [14:07:25] (03PS1) 10Elukey: doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) [14:08:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10051338 (10Ladsgroup) >>! In T371342#10042745, @VRiley-WMF wrote: > @Marostegui Is there a preferred time for us to maybe offline this device? I would like try updating some of the firm... [14:09:45] !log ladsgroup@deploy1003 Finished scap: Backport for [[gerrit:1060839|Add missing close tags to #contentSub message (T372054)]] (duration: 10m 15s) [14:10:13] (03CR) 10Ladsgroup: "LGTM to roll out to mwdebug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [14:12:22] (03CR) 10Btullis: [C:03+1] deployment_server: register a mapping between PG versions and image tags [puppet] - 10https://gerrit.wikimedia.org/r/1060827 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [14:12:36] (03PS2) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 [14:12:39] (03CR) 10Btullis: [C:03+1] cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [14:13:06] (03CR) 10Brouberol: [C:03+2] deployment_server: register a mapping between PG versions and image tags [puppet] - 10https://gerrit.wikimedia.org/r/1060827 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [14:18:17] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060828 (https://phabricator.wikimedia.org/T368240) (owner: 10Brouberol) [14:22:39] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10051427 (10Jhancock.wm) [14:23:49] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10051435 (10Jhancock.wm) [14:23:58] (03CR) 10Vgutierrez: [C:04-1] ACMEChiefConfig: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055232 (owner: 10Ncmonitor) [14:24:53] !log fnegri@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fnegri@cumin1002" [14:24:57] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fnegri@cumin1002" [14:24:58] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1018.eqiad.wmnet with OS bookworm [14:25:05] 06SRE, 06Infrastructure-Foundations, 10netops: Add link from cloudsw1-e4-eqiad to cloudsw1-f4-eiqad - https://phabricator.wikimedia.org/T372061 (10cmooney) 03NEW p:05Triage→03Low [14:28:00] 07sre-alert-triage, 10Data-Platform-SRE (2024.07.29 - 2024.08.16): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10051476 (10BTullis) [14:28:16] (03CR) 10CI reject: [V:04-1] doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [14:32:11] 06SRE, 06Infrastructure-Foundations, 10netops: Add link from cloudsw1-e4-eqiad to cloudsw1-f4-eiqad - https://phabricator.wikimedia.org/T372061#10051491 (10cmooney) [14:32:31] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2009.codfw.wmnet with OS bookworm [14:32:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2010.codfw.wmnet with OS bookworm [14:32:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2011.codfw.wmnet with OS bookworm [14:33:30] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [14:33:57] (03PS11) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [14:35:09] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10051517 (10BTullis) This sounds sensible to me, too. Given that we have approval for the proposal as it stands, how do we proceed? Do we just need to update th... [14:35:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:31] RESOLVED: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:24] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:24] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:27] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7 [14:40:31] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2 [14:40:48] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10051529 (10Jelto) [14:40:52] (03PS2) 10Kevin Bazira: ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055145 (https://phabricator.wikimedia.org/T369344) [14:41:20] (03PS4) 10Andrew Bogott: Add wmcs-empty-rbd-trash script [puppet] - 10https://gerrit.wikimedia.org/r/999218 (https://phabricator.wikimedia.org/T356904) [14:42:09] (03CR) 10Andrew Bogott: [C:03+2] Add wmcs-empty-rbd-trash script [puppet] - 10https://gerrit.wikimedia.org/r/999218 (https://phabricator.wikimedia.org/T356904) (owner: 10Andrew Bogott) [14:43:43] (03PS1) 10David Caro: empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 [14:43:53] (03CR) 10Kevin Bazira: "This new image has first been tested in staging as shown in: https://phabricator.wikimedia.org/P67245#269285" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055145 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [14:44:01] (03PS2) 10David Caro: empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 [14:44:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:08] (03PS3) 10David Caro: empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 [14:45:43] (03PS1) 10Andrew Bogott: Upgrade codfw1dev openstack to version Caracal [puppet] - 10https://gerrit.wikimedia.org/r/1060862 (https://phabricator.wikimedia.org/T369044) [14:46:27] 06SRE, 06Infrastructure-Foundations, 10netops: Add link from cloudsw1-e4-eqiad to cloudsw1-f4-eiqad - https://phabricator.wikimedia.org/T372061#10051545 (10cmooney) [14:47:52] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:47:56] (03CR) 10CI reject: [V:04-1] empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 (owner: 10David Caro) [14:48:09] (03CR) 10Scott French: mediawiki: fetch active deployment host (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [14:48:32] (03PS4) 10David Caro: empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 [14:50:49] (03PS5) 10David Caro: empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 [14:51:58] !log fnegri@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Running sync-netbox-hiera manually because it failed during the reimage - fnegri@cumin1002 - T365424" [14:52:01] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [14:52:28] (03CR) 10David Caro: [V:03+1] "Noop run looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/1060861 (owner: 10David Caro) [14:52:32] !log fnegri@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Running sync-netbox-hiera manually because it failed during the reimage - fnegri@cumin1002 - T365424" [14:53:32] (03CR) 10CI reject: [V:04-1] empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 (owner: 10David Caro) [14:58:46] (03PS8) 10David Caro: wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 [14:58:46] (03PS9) 10David Caro: wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 [14:58:46] (03PS6) 10David Caro: empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 [14:58:47] (03PS1) 10David Caro: backy2/*: use new type annotations [puppet] - 10https://gerrit.wikimedia.org/r/1060865 [15:00:04] jnuche and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1500). [15:00:41] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10051598 (10Jhancock.wm) [15:04:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10051599 (10Jhancock.wm) @JMeybohm drives are installed. Lemme know if it all looks good or if you need anything else. [15:07:59] (03PS1) 10Giuseppe Lavagetto: php-fpm: make /healthz smarter [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 [15:08:14] (03CR) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:11:12] it looks like stashbot crashed. can anyone reboot it or something? [15:12:05] again [15:13:13] dhinus: can you please restart it again? thanks <3 [15:14:50] (03CR) 10Hnowlan: [C:04-1] php-fpm: make /healthz smarter (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 (owner: 10Giuseppe Lavagetto) [15:15:51] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10051639 (10Ottomata) That and maybe some comments in puppet admin data.yaml to instruct SREs on the right thing to do? [15:15:57] (03CR) 10Hnowlan: [C:03+1] php-fpm: make /healthz smarter (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060867 (owner: 10Giuseppe Lavagetto) [15:16:26] (03CR) 10Dzahn: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1057000 (https://phabricator.wikimedia.org/T371573) (owner: 10Dzahn) [15:16:36] !log test [15:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:01] Reedy: :P [15:18:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Memory issues (ECC) with ml-serve2004.codfw.wmnet - https://phabricator.wikimedia.org/T372036#10051654 (10Jhancock.wm) a:05Papaul→03Jhancock.wm @klausman I'm onsite and can do a DIMM swap on this if you have time. [15:21:41] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ml-serve2004.codfw.wmnet with reason: Hardware maintenance for memory errors [15:21:57] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ml-serve2004.codfw.wmnet with reason: Hardware maintenance for memory errors [15:22:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Memory issues (ECC) with ml-serve2004.codfw.wmnet - https://phabricator.wikimedia.org/T372036#10051680 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=773e931e-1862-4d24-b0bd-52100c4ad9bb) set by klausman@cumin20... [15:22:53] sukhe: I think someone was faster :) [15:23:09] yep, thanks nonetheless (and to Reedy) [15:34:24] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:41] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:36:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['testhost2001.codfw.wmnet'] [15:42:12] (03PS1) 10Dzahn: wikistats: re-add puppetized mounting of cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1060877 (https://phabricator.wikimedia.org/T371573) [15:43:43] the debmonitor failures are due to me, upgrading the client on all bullseye nodes [15:45:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve2009.codfw.wmnet with OS bookworm [15:45:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve2010.codfw.wmnet with OS bookworm [15:45:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve2011.codfw.wmnet with OS bookworm [15:48:13] (03CR) 10FNegri: [C:03+1] backy2/*: use new type annotations [puppet] - 10https://gerrit.wikimedia.org/r/1060865 (owner: 10David Caro) [15:50:23] (03CR) 10David Caro: [V:03+1] "Finished the run, cleaned up all the images in the trash :)" [puppet] - 10https://gerrit.wikimedia.org/r/1060861 (owner: 10David Caro) [15:50:41] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:38] (03CR) 10Andrew Bogott: [C:03+2] Upgrade codfw1dev openstack to version Caracal [puppet] - 10https://gerrit.wikimedia.org/r/1060862 (https://phabricator.wikimedia.org/T369044) (owner: 10Andrew Bogott) [15:51:47] (03PS5) 10Effie Mouzeli: mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) [15:52:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['testhost2001.codfw.wmnet'] [15:53:45] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:54:02] hello [15:54:24] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:24] debmonitor [15:54:27] ok [15:54:34] sukhe: yeah it's me sorry :( [15:54:39] all good, let us know if we can help <3 [15:55:07] are we looking at debmonitor or the widespread puppet failures? [15:55:19] same thing? [15:55:30] mutante: yep [15:55:33] ack! [15:55:41] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:58] (03PS1) 10Andrew Bogott: wmcs-empty-rbd-trash: fix name of source file [puppet] - 10https://gerrit.wikimedia.org/r/1060880 [15:56:32] (03CR) 10Andrew Bogott: [C:03+2] wmcs-empty-rbd-trash: fix name of source file [puppet] - 10https://gerrit.wikimedia.org/r/1060880 (owner: 10Andrew Bogott) [15:56:57] (03PS1) 10Klausman: site.pp: move new ML hosts to insetup for imaging [puppet] - 10https://gerrit.wikimedia.org/r/1060881 [15:56:59] (03CR) 10CI reject: [V:04-1] site.pp: move new ML hosts to insetup for imaging [puppet] - 10https://gerrit.wikimedia.org/r/1060881 (owner: 10Klausman) [15:57:34] (03PS2) 10Klausman: site.pp: move new ML hosts to insetup for imaging [puppet] - 10https://gerrit.wikimedia.org/r/1060881 [15:57:42] sukhe, mutante - IIUC it should just be a matter of running puppet again on failed nodes, trying [15:58:03] (03PS3) 10Klausman: site.pp: move new ML hosts to insetup for imaging [puppet] - 10https://gerrit.wikimedia.org/r/1060881 [15:58:58] (03CR) 10Klausman: [C:03+2] site.pp: move new ML hosts to insetup for imaging [puppet] - 10https://gerrit.wikimedia.org/r/1060881 (owner: 10Klausman) [15:59:24] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:04] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:12] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10051874 (10XiaoXiao-WMF) okay, so how can I renew this now? I have passed the 7/22 deadline [16:01:15] (03CR) 10FNegri: [C:03+1] empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 (owner: 10David Caro) [16:02:53] (03CR) 10David Caro: [C:03+2] backy2/*: use new type annotations [puppet] - 10https://gerrit.wikimedia.org/r/1060865 (owner: 10David Caro) [16:02:55] (03CR) 10David Caro: [V:03+1 C:03+2] empty_trash: unprotect snapshots of the image if needed [puppet] - 10https://gerrit.wikimedia.org/r/1060861 (owner: 10David Caro) [16:03:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:03:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:03:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10051878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm [16:04:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [16:04:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10051900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors... [16:04:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2009.codfw.wmnet with OS bookworm [16:05:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2010.codfw.wmnet with OS bookworm [16:06:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve2009.codfw.wmnet with OS bookworm [16:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 22.48% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:07:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve2010.codfw.wmnet with OS bookworm [16:07:57] !log on cumin1002 "sudo cumin -b 20 -p 95 'P{F:lsbdistcodename="bullseye"} and A:codfw' 'run-puppet-agent -q --failed-only'" [16:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:32] seems solved [16:10:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bookworm [16:10:23] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10051910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm [16:10:41] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testhost2001.codfw.wmnet with OS bookworm [16:11:45] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10051916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm execut... [16:12:01] mw-api-int saturation lines up with an increase in parsoidCachePrewarm jobs [16:13:17] *could* be the long tail from a big transclusion update in changeprop [16:14:05] could be recovering a little bit, too early to say [16:14:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bookworm [16:14:23] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10051924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm [16:14:24] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testhost2001.codfw.wmnet with OS bookworm [16:14:53] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10051925 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm execut... [16:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 24.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:17:31] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:24] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:39] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:20:41] FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating mgmt ips in codfw - jhancock@cumin2002" [16:24:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating mgmt ips in codfw - jhancock@cumin2002" [16:24:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:26:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10051967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm [16:26:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [16:26:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10051969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors... [16:27:31] RESOLVED: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:56] !log debmonitor-client 0.4.0 rolledout to all bullseye nodes [16:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:41] RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:28] there are some failed nodes among the puppetboard list, but they should clear out during the next puppet run [16:37:31] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:31] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:31] RESOLVED: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:03] elukey: thanks, looks good on puppetboard [16:49:22] (03PS1) 10BCornwall: ncmonitor: Set ignored domains configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060891 (https://phabricator.wikimedia.org/T372076) [16:50:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:08] (03PS1) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [16:54:28] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3591/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060891 (https://phabricator.wikimedia.org/T372076) (owner: 10BCornwall) [17:00:04] bd808: That opportune time for a Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1700). [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1700) [17:01:17] I've got some developer-portal changes to roll out in my window today. [17:04:41] (03PS1) 10Urbanecm: Convert gb_id to integer in GlobalBlock [extensions/GlobalBlocking] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060897 (https://phabricator.wikimedia.org/T372063) [17:05:34] jouncebot: nowandnext [17:05:34] For the next 0 hour(s) and 54 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1700) [17:05:35] For the next 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1700) [17:05:35] In 0 hour(s) and 54 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1800) [17:05:37] (03CR) 10Dreamy Jazz: [C:03+2] Convert gb_id to integer in GlobalBlock [extensions/GlobalBlocking] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060897 (https://phabricator.wikimedia.org/T372063) (owner: 10Urbanecm) [17:05:57] Can I deploy please :) [17:06:03] It's probably a train issue [17:07:05] Go ahead. [17:07:51] (03CR) 10Dzahn: [C:03+2] wikistats: re-add puppetized mounting of cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1060877 (https://phabricator.wikimedia.org/T371573) (owner: 10Dzahn) [17:08:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060897 (https://phabricator.wikimedia.org/T372063) (owner: 10Urbanecm) [17:08:57] !log bking@wdqs1020 restart wdqs-blazegraph service due to excessive GC [17:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2009.codfw.wmnet with OS bookworm [17:11:00] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20240729/ using stat1009.eqiad.wmnet) [17:11:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2009.codfw.wmnet with reason: host reimage [17:11:38] (03CR) 10Dzahn: [C:03+1] "sounds good! we just have to watch out that the logs don't become huge like on gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1060131 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:11:58] (03CR) 10Dzahn: [C:03+2] "current list is full of googlebot IPs but also some legit looking users." [puppet] - 10https://gerrit.wikimedia.org/r/1060502 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [17:13:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2009.codfw.wmnet with reason: host reimage [17:14:16] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-08-05-122022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) [17:15:11] (03CR) 10CI reject: [V:04-1] developer-portal: Bump container to 2024-08-05-122022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) (owner: 10BryanDavis) [17:15:19] (03Merged) 10jenkins-bot: Convert gb_id to integer in GlobalBlock [extensions/GlobalBlocking] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060897 (https://phabricator.wikimedia.org/T372063) (owner: 10Urbanecm) [17:15:33] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1060897|Convert gb_id to integer in GlobalBlock (T372063)]] [17:15:36] T372063: TypeError: Typed property MediaWiki\Extension\GlobalBlocking\GlobalBlock::$id must be int, string used - https://phabricator.wikimedia.org/T372063 [17:17:31] (03PS1) 10Dzahn: idp:standalone: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1060901 [17:17:46] !log dreamyjazz@deploy1003 urbanecm, dreamyjazz: Backport for [[gerrit:1060897|Convert gb_id to integer in GlobalBlock (T372063)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:17:51] !log dreamyjazz@deploy1003 urbanecm, dreamyjazz: Continuing with sync [17:19:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10052216 (10BTullis) >>! In T369517#10051874, @XiaoXiao-WMF wrote: > okay, so how can I renew this now? I have passed the 7/22 deadline You just have to run `kinit` and then... [17:20:51] (03CR) 10Dzahn: "Might not be used. Is https://wikitech.wikimedia.org/wiki/Nova_Resource:Sso in use or abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/1060901 (owner: 10Dzahn) [17:22:21] !log dreamyjazz@deploy1003 Finished scap: Backport for [[gerrit:1060897|Convert gb_id to integer in GlobalBlock (T372063)]] (duration: 06m 48s) [17:22:24] T372063: TypeError: Typed property MediaWiki\Extension\GlobalBlocking\GlobalBlock::$id must be int, string used - https://phabricator.wikimedia.org/T372063 [17:23:12] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) (owner: 10BryanDavis) [17:24:44] helmfile lint go :boom: [17:28:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10052245 (10Dzahn) This should be resolved now. @SToyofuku-WMF I can see your user exists now on the deployment server. You should be able to login. The currently active server i... [17:28:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:29:41] I have no idea what this linter failure is telling me. I've sent a plea for help in the -serviceops and -sre channels, but if there is a deployment-charts/helmfile knower here who can point me in the right direction I would be grateful: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060899 https://integration.wikimedia.org/ci/job/helm-lint/19680/console [17:33:17] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10052258 (10XiaoXiao-WMF) Okay perfect. Thanks, I will mark this as resolve! [17:34:17] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10052259 (10XiaoXiao-WMF) 05Open→03Resolved a:03XiaoXiao-WMF [17:36:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:36:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2009.codfw.wmnet with OS bookworm [17:44:30] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1024.eqiad.wmnet with OS bullseye [17:45:07] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [17:48:55] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10052313 (10phaultfinder) [17:49:22] (03PS1) 10Ryan Kemper: wdqs: update scap wdqs hostlist [puppet] - 10https://gerrit.wikimedia.org/r/1060902 (https://phabricator.wikimedia.org/T364368) [17:55:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10052356 (10VRiley-WMF) Currently, not planning on shutting it down. However, some firmware requires a reboot of the server. I just wanted to insure that there would be no disruptions. [18:00:05] jnuche and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1800). [18:00:15] o/ [18:00:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:00:17] nothing for this window [18:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 19.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:18:24] (03PS1) 10RLazarus: Revert "cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060905 [18:19:22] (03PS2) 10RLazarus: Revert "cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060905 [18:20:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:21:46] (03CR) 10RLazarus: [C:03+2] Revert "cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060905 (owner: 10RLazarus) [18:22:29] (03Merged) 10jenkins-bot: Revert "cloudnative-pg-cluster: rely on the common_images data to ingfer the PG image tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060905 (owner: 10RLazarus) [18:23:42] (03CR) 10Cwhite: "Although this tells grafana that we are using a 30s scrape interval when we've configured Prometheus to use a 60s scrape interval, this do" [puppet] - 10https://gerrit.wikimedia.org/r/1058106 (https://phabricator.wikimedia.org/T371102) (owner: 10Filippo Giunchedi) [18:34:35] (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) (owner: 10BryanDavis) [18:40:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2010.codfw.wmnet with OS bookworm [18:40:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2011.codfw.wmnet with OS bookworm [18:41:41] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:42:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2010.codfw.wmnet with reason: host reimage [18:44:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2011.codfw.wmnet with reason: host reimage [18:44:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:45:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2010.codfw.wmnet with reason: host reimage [18:45:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [18:45:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10052499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm [18:48:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2011.codfw.wmnet with reason: host reimage [18:58:34] (03PS1) 10Cathal Mooney: Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) [18:58:51] !log [Elastic] `ryankemper@cumin2002:~$ sudo -E cumin 'elastic2062*,elastic2082*,elastic2088*,elastic2090*,elastic2099*,elastic2103*' 'pool'` (hosts that had not been repooled after previous maintenance) [18:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:00:57] jouncebot nowandnext [19:00:57] For the next 0 hour(s) and 59 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T1800) [19:00:57] In 0 hour(s) and 59 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T2000) [19:01:20] I'm going to do some scap testing. [19:01:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:01:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2010.codfw.wmnet with OS bookworm [19:02:32] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:02:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:02:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2011.codfw.wmnet with OS bookworm [19:03:21] !log dancy@deploy1003 Started scap sync-world: testing T371904 [19:03:24] T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904 [19:05:16] (03PS1) 10Cathal Mooney: Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) [19:05:17] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1024.eqiad.wmnet with OS bullseye [19:05:54] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS bullseye [19:06:02] !log dancy@deploy1003 Finished scap: testing T371904 (duration: 02m 40s) [19:06:29] (03PS2) 10Cathal Mooney: Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) [19:07:28] (03PS2) 10Cathal Mooney: Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) [19:12:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bookworm [19:13:05] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10052548 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm [19:13:36] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T371874 [19:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 18.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:18:53] (03PS2) 10BryanDavis: developer-portal: Bump container to 2024-08-05-122022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) [19:19:49] (03CR) 10CI reject: [V:04-1] developer-portal: Bump container to 2024-08-05-122022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) (owner: 10BryanDavis) [19:20:10] !log dancy@deploy1003 Started scap sync-world: testing T371904 [19:20:13] T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904 [19:22:55] (03PS1) 10Ssingh: sre.dns.admin: add cookbook for GeoDNS pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) [19:23:25] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T371874 [19:26:02] (03PS2) 10Ssingh: sre.dns.admin: add cookbook for GeoDNS pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) [19:27:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10052625 (10Jhancock.wm) @klausman these servers are ready! [19:29:26] (03PS1) 10BryanDavis: deployment_server: quote numeric map key in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1060915 (https://phabricator.wikimedia.org/T368240) [19:29:36] (03PS1) 10Dzahn: gerrit: further increase throttling threshold for testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/1060916 (https://phabricator.wikimedia.org/T365259) [19:29:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10052622 (10Jhancock.wm) 05Open→03Resolved [19:31:10] !log dancy@deploy1003 Started scap sync-world: testing T371904 [19:31:15] T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904 [19:32:00] (03CR) 10Dzahn: [C:03+2] gerrit: further increase throttling threshold for testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/1060916 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:32:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testhost2001.codfw.wmnet with OS bookworm [19:32:22] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10052647 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm execut... [19:33:20] (03CR) 10RLazarus: [C:03+2] deployment_server: quote numeric map key in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1060915 (https://phabricator.wikimedia.org/T368240) (owner: 10BryanDavis) [19:33:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [19:33:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10052654 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm executed with errors... [19:38:46] (03CR) 10Dzahn: [C:03+2] ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:39:44] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) (owner: 10BryanDavis) [19:47:53] (03PS10) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [19:49:40] (03PS11) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [19:51:30] (03PS3) 10Ahmon Dancy: scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) [19:51:59] (03CR) 10Dzahn: [C:03+2] "actual diff on the contint machines:" [puppet] - 10https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:52:15] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-08-05-122022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) (owner: 10BryanDavis) [19:53:16] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-08-05-122022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060899 (https://phabricator.wikimedia.org/T371385) (owner: 10BryanDavis) [19:54:30] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [19:55:36] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [19:55:56] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [19:56:19] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [19:56:52] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [19:57:16] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [19:57:23] (03CR) 10Ryan Kemper: [C:03+2] wdqs: update scap wdqs hostlist [puppet] - 10https://gerrit.wikimedia.org/r/1060902 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240808T2000). [20:00:04] Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:53] (03CR) 10RLazarus: [C:03+2] "This original commit was an innocent bystander -- feel free to roll it forward again, and sorry for the inconvenience. The actual issue wa" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060905 (owner: 10RLazarus) [20:02:56] (03PS1) 10Andrew Bogott: puppetserver-deploy-code: don't use sudo when checking current branch [puppet] - 10https://gerrit.wikimedia.org/r/1060919 (https://phabricator.wikimedia.org/T364492) [20:03:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060919 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [20:04:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.6% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:09:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.21% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:09:50] If anyone is around to deploy, I'll be happy to test Nemoralis's patch [20:10:58] Sohom_Datta: I can! [20:11:31] Sounds good :) (Thank you!) [20:11:47] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@0266527]: (no justification provided) [20:12:00] \o/ [20:12:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060753 (https://phabricator.wikimedia.org/T371440) (owner: 10NMW03) [20:12:36] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@0266527]: (no justification provided) (duration: 00m 49s) [20:12:57] (03Merged) 10jenkins-bot: Enable protection indicators for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060753 (https://phabricator.wikimedia.org/T371440) (owner: 10NMW03) [20:13:09] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1060753|Enable protection indicators for azwiki (T371440)]] [20:13:13] T371440: Enable protection indicators for azwiki - https://phabricator.wikimedia.org/T371440 [20:15:22] !log samtar@deploy1003 samtar, nmw03: Backport for [[gerrit:1060753|Enable protection indicators for azwiki (T371440)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:36] Sohom_Datta: ready for testing :) [20:16:11] Looking :) [20:16:28] Yep, it appears to be working [20:16:36] :D [20:16:45] o/ [20:17:01] TheresNoTime: I'm here [20:17:03] !log samtar@deploy1003 samtar, nmw03: Continuing with sync [20:17:32] (03PS11) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [20:18:15] Nemoralis: Sohom_Datta offered to test your patch [20:18:25] thanks Sohom [20:18:39] :) [20:20:09] (03CR) 10BCornwall: Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [20:21:32] !log samtar@deploy1003 Finished scap: Backport for [[gerrit:1060753|Enable protection indicators for azwiki (T371440)]] (duration: 08m 22s) [20:21:41] T371440: Enable protection indicators for azwiki - https://phabricator.wikimedia.org/T371440 [20:22:03] thank you all [20:22:08] Sohom_Datta: Nemoralis: done! :D [20:22:16] <3 [20:22:46] (03PS2) 10Jdlrobson: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) [20:22:52] (03PS2) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [20:23:58] <3 [20:24:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.17% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:42:18] (03CR) 10BCornwall: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1055232 (owner: 10Ncmonitor) [20:42:27] (03CR) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1055232 (owner: 10Ncmonitor) [20:42:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:42:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10052900 (10SToyofuku-WMF) Can confirm I just got in - thank you so much!! [20:50:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:03] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: security update - bking@cumin2002 - T371874 [21:09:51] 06SRE-OnFire, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#10052992 (10BCornwall) Proposal: $name = "2024-08-08 Database outage" $status = (Investigating|Identified|Monitoring|Resolved) $message = "The database is experiencing hea... [21:21:00] !log ebernhardson@deploy1003 Synchronized private/PrivateSettings.php: Update NetworkSession users list for T341332 (duration: 06m 15s) [21:21:03] T341332: [EPIC] The CirrusSearch streaming updater should support private wikis - https://phabricator.wikimedia.org/T341332 [21:29:02] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: security update - bking@cumin2002 - T371874 [21:29:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bookworm [21:29:32] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10053037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm [21:31:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testhost2001.codfw.wmnet with reason: host reimage [21:34:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testhost2001.codfw.wmnet with reason: host reimage [21:42:22] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10053066 (10Dzahn) 05In progress→03Resolved Thanks for confirming! I'll call it resolved. Cheers [21:45:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testhost2001.codfw.wmnet with OS bookworm [21:45:24] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10053074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm comple... [21:55:05] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['testhost2001.codfw.wmnet'] [21:57:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['testhost2001.codfw.wmnet'] [21:59:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bookworm [21:59:24] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10053122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm [22:18:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testhost2001.codfw.wmnet with OS bookworm [22:18:59] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10053141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm execut... [22:20:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bookworm [22:20:09] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10053145 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm [22:44:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:13] (03CR) 10Scott French: "Took a quick look this afternoon, and overall no big surprises. A couple of items of note and / or questions." [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [22:59:48] (03PS12) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [23:01:16] (03PS13) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [23:33:43] (03PS2) 10RLazarus: mwscript_k8s: Add --mediawiki_image [puppet] - 10https://gerrit.wikimedia.org/r/1060512 [23:34:13] (03CR) 10RLazarus: mwscript_k8s: Add --mediawiki_image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060512 (owner: 10RLazarus) [23:35:09] (03PS5) 10Pppery: WIP: Add wmf-config changes for mos: interwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051814 (https://phabricator.wikimedia.org/T363538) [23:35:19] (03CR) 10CI reject: [V:04-1] WIP: Add wmf-config changes for mos: interwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051814 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [23:37:38] (03PS6) 10Pppery: WIP: Add wmf-config changes for mos: interwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051814 (https://phabricator.wikimedia.org/T363538) [23:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060943 [23:38:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060943 (owner: 10TrainBranchBot) [23:39:32] (03CR) 10Scott French: [C:03+1] mwscript_k8s: Add --mediawiki_image [puppet] - 10https://gerrit.wikimedia.org/r/1060512 (owner: 10RLazarus) [23:41:30] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#10053269 (10Scott_French) Though mainly focused on supporting the php 8.1 migration, there's ongoing work to support multiple base-image “flavors” and a helm-rele... [23:51:16] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10053303 (10KFrancis) Hello all, the NDA has been signed. Thanks! [23:57:16] (03CR) 10RLazarus: [C:03+2] mwscript_k8s: Add --mediawiki_image [puppet] - 10https://gerrit.wikimedia.org/r/1060512 (owner: 10RLazarus)